← back to methodeutics

Case Study

Chapter 15 · Applying the full pipeline

A system that searches without reasoning. It can perturb and observe but never asks why. It tries every combination, times each, keeps the best. No theory, no pruning, no explanation. This chapter applies the full pipeline from chapters 1–14.

Search without theory

The optimizer has three moving parts:

It finds fast configurations. But it has three structural problems:

Problem Symptom Root cause
Regressions Some workloads run slower than a naive baseline No fallback. The search can find a local optimum worse than the default.
Waste 90% of search budget spent on workloads already near-optimal No triage. Every workload gets the same search depth.
Opacity When a configuration fails, the failure teaches nothing No diagnosis. Bad results are discarded, not analyzed.

All three trace to the same absence: the system has Experiment and Observation but no Theory. It runs the bottom edge of the triangle only.

1. Two missing edges (Ch 1–3)

The trichotomy names what's missing.

Edge Mode Present?
Theory → Experiment Deduction Absent. No theory predicts which configurations should work.
Experiment → Observation Induction Present. The system measures, ranks, and generalizes (top-k).
Observation → Theory Abduction Absent. No mechanism asks why a result was fast or slow.

Two of three edges missing. The system compensates with brute force: time enough configurations and you'll stumble onto a good one.

2. The diff is the primitive (Ch 4–6)

Chapter 4's abductive primitive is the diff: given two states with different outcomes, the diff is the set of changes between them.

Configuration A runs slow, configuration B runs fast. The diff: tile size 4→16, loop order swapped, vectorization enabled. That diff is the figure. Everything unchanged is the ground.

The figure is a candidate explanation: "this workload is faster with larger tiles." The system already produces this diff (it must, to move from A to B) but discards it immediately. It keeps the timing, throws away the structure.

Bi-abduction (Ch 5) adds the frame: what the diff didn't touch. If A and B differ only in tile size, the frame includes loop order, vectorization, memory layout. The frame bounds the hypothesis: "larger tiles help when loop order is X and vectorization is Y." Without the frame, the hypothesis overgeneralizes.

Tri-abduction (Ch 6) handles forks. Larger tiles help on one workload and hurt on another. The branching condition (memory-bound vs. compute-bound) splits the hypothesis into two. The current system cannot represent this split. It would need to test both branches separately and discover which condition governs.

3. Spend budget where uncertainty is highest (Ch 7)

200 configuration options. Exhaustive search is infeasible. Beam search samples from the space, but it commits prematurely to a local basin instead of exploring broadly.

Peirce's economy of research asks: which experiment gives the most information per unit cost?

A never-varied knob has high information gain: testing it tells you whether it matters at all. A knob tested ten times with consistent results has low information gain: you already know its effect. This criterion replaces beam search's uniform sampling with directed probing.

If the current hypothesis says "tile size is the bottleneck," the highest-value experiment tests tile size in a new context (different workload shape, different memory pressure), not a knob already known to be irrelevant.

4. Measurements are trajectories, not snapshots (Ch 8–9)

Each measurement is a point on a trajectory. Chapter 8 introduced e-values for tracking cumulative evidence; Chapter 9 classified trajectories into four bins.

Trajectory shape Meaning for the optimizer Action
Convergent This knob's effect is stable across workloads. Changing it always helps or never helps. Lock it. Stop testing it. Move budget elsewhere.
Divergent Each new measurement increases the e-value. This knob is critical. Follow it. This is the bottleneck. Allocate more budget.
Oscillatory The knob helps on workload A, hurts on workload B, helps on C, hurts on D. Split. Two constraints are fighting. Find the branching condition.
Flat / indeterminate Measurements are noisy. The knob might matter, might not. Collect more data or reduce measurement variance before deciding.

The current system keeps the best time and discards the rest. It cannot distinguish a convergent knob (safe to freeze) from an oscillatory one (masking a deeper structural question). Trajectory classification turns raw measurements into decision procedures.

5. Failures name the next experiment (Ch 10)

Chapter 10's central claim: when a hypothesis fails, the failure mode names the next experiment.

Hypothesis: "This workload is memory-bound. Grouping memory accesses will speed it up." The optimizer tests a grouped-access configuration. It's slower. The hypothesis is killed. But the shape of the failure carries information.

Each failure mode generates a directed edge in the hypothesis graph. The current system has none. Slow configurations get ranked low and forgotten. Failure information, the richest output the system produces, is discarded.

6. Stop when you can explain, not when you're exhausted (Ch 11)

Two stopping conditions look alike but differ fundamentally. Theory-based: "Tile size 16 is optimal because the working set fits exactly in L1 cache, and the loop order minimizes TLB misses given this tile." Exhaustion-based: "I tried 500 configurations and this one was the fastest."

Exhaustion-based stopping fails silently: the search may be stuck in a local basin, and the system cannot detect this because it has no theory of the search space. Theory-based stopping is falsifiable: if changing cache size should change the optimal tile size but doesn't, the theory is wrong and the search must continue.

Chapter 11's convergence criterion (the hypothesis graph closes when every open edge resolves) gives a structural stopping condition. The optimizer has converged when:

  1. Every knob has been classified (convergent, divergent, oscillatory, or indeterminate).
  2. Every divergent knob has a theory explaining its effect.
  3. Every oscillatory knob has a branching condition that resolves the oscillation.
  4. No indeterminate knobs remain.

Stronger than "no improvement found," and cheaper in the long run: it prevents wasted budget on basins the system has already explained.

7. The optimizer can deceive itself (Ch 12–13)

Chapter 12's prereg checklist catches self-deception before it compounds.

The most dangerous failure: proxy measurement. The optimizer measures on a benchmark workload and deploys on production. The benchmark is smaller, fits in cache, has regular access patterns. Production is larger, irregular, cold-cache. Rankings do not transfer.

The prereg checklist asks: "Is your measurement on the actual workload or a proxy?" If proxy, the entire evidence trajectory is suspect. E-values computed on proxy data do not compose with e-values on production data. Measure on the real workload or explicitly model the gap.

Other checklist items:

Chapter 13's proof manual adds structural validation: "If the bottleneck is memory bandwidth, then doubling bandwidth should halve runtime." If the deduction fails, the abduction was wrong. Re-enter the loop.

8. Wired together (Ch 14)

The exhaustive search loop:

def grid_search(workload, candidates, budget):
    """Enumerate, time, keep best. No theory."""
    results = []
    for config in candidates[:budget]:
        time = measure(workload, config)
        results.append((time, config))
    results.sort()
    return results[0]  # fastest

    # Problems:
    # - budget spent uniformly across knobs
    # - bad results discarded (no diagnosis)
    # - no stopping criterion beyond exhaustion
    # - no explanation of why the winner wins
    # - regressions undetected (no baseline comparison)

The abduction loop:

def abduction_engine(workload, knobs, budget):
    """Observe, hypothesize, test, update. Theory-guided."""
    baseline = measure(workload, default_config(knobs))
    graph = HypothesisGraph()
    spent = 0

    while spent < budget and graph.has_open_edges():
        # 1. Select: economy of research
        knob = max(knobs, key=lambda k: info_gain(k, graph))

        # 2. Perturb: generate a test configuration
        config = perturb(knob, current_best(graph))
        time = measure(workload, config)
        spent += 1

        # 3. Diff: extract figure and ground
        figure, ground = diff(config, current_best(graph))

        # 4. Classify: trajectory shape
        shape = classify_trajectory(knob, graph)

        if shape == CONVERGENT:
            graph.freeze(knob)          # stop testing
        elif shape == DIVERGENT:
            graph.deepen(knob, figure)  # follow the lead
        elif shape == OSCILLATORY:
            graph.split(knob, ground)   # find the branch
        else:
            pass  # need more data

        # 5. Kill check: does the current theory survive?
        for hyp in graph.active_hypotheses():
            if hyp.prediction_fails(time, config):
                next_exp = hyp.failure_mode()  # Ch 10
                graph.add_edge(hyp, next_exp)

        # 6. Baseline guard: never regress
        if current_best(graph).time > baseline:
            graph.revert_to(baseline)

    return graph.best_config(), graph.explanation()

Side by side:

Property Exhaustive search Abduction engine
Selection Enumerate Information gain per cost
Bad results Discarded Diagnosed (kill condition → next edge)
Stopping Budget exhausted Hypothesis graph closed
Output Best configuration Best configuration + explanation
Regressions Undetected Caught by baseline guard

Three testable predictions

If the framework from chapters 1–14 is correct, the abduction engine beats exhaustive search on the same budget in three measurable ways:

  1. Fewer trials to equivalent quality. Theory-guided selection skips irrelevant regions of configuration space. The savings grow with the size of the space.
  2. Fewer regressions. The baseline guard prevents deploying configurations known to be slower. Noisy timing and benchmark/production mismatch can still cause regressions, but the guard eliminates the obvious ones.
  3. Transferable explanations. The hypothesis graph from workload A partially transfers to workload B. If the theory says "this workload is memory-bound because of access pattern X," and workload B shares that pattern, the same configuration works without re-searching. Exhaustive search cannot transfer because it produces no theory.

How this textbook fails

The framework makes a specific, falsifiable claim: abduction reduces sample complexity. Theory-guided search finds configurations of equal or better quality in fewer measurement rounds than theory-free search.

If the abduction engine consistently requires more trials than exhaustive search to reach the same quality, the overhead of the hypothesis graph is not justified. The core assumption (that failure structure carries exploitable information) would be wrong. The configuration space would be adversarial to theory: no pattern in the failures, no transferable structure, pure noise.

The textbook provides the machinery to run the experiment and the criteria to interpret the result. If the abduction engine wins, the framework works. If it loses, the framework is wrong, and the shape of the loss will say why.

That last sentence is the framework applying to itself. The failure mode of this textbook names the next one.

Neighbors