Case Study
Chapter 15 · Applying the full pipeline
A system that searches without reasoning. It can perturb and observe but never asks why. It tries every combination, times each, keeps the best. No theory, no pruning, no explanation. This chapter applies the full pipeline from chapters 1–14.
Search without theory
The optimizer has three moving parts:
- A combinatorial space of roughly 200 configuration options: knobs, flags, orderings, tile sizes, data layouts.
- A measurement oracle: run any configuration on hardware, get a wall-clock time.
- A search strategy: beam search. Enumerate candidates, time each, keep the top-k, expand from there.
It finds fast configurations. But it has three structural problems:
| Problem | Symptom | Root cause |
|---|---|---|
| Regressions | Some workloads run slower than a naive baseline | No fallback. The search can find a local optimum worse than the default. |
| Waste | 90% of search budget spent on workloads already near-optimal | No triage. Every workload gets the same search depth. |
| Opacity | When a configuration fails, the failure teaches nothing | No diagnosis. Bad results are discarded, not analyzed. |
All three trace to the same absence: the system has Experiment and Observation but no Theory. It runs the bottom edge of the triangle only.
1. Two missing edges (Ch 1–3)
The trichotomy names what's missing.
| Edge | Mode | Present? |
|---|---|---|
| Theory → Experiment | Deduction | Absent. No theory predicts which configurations should work. |
| Experiment → Observation | Induction | Present. The system measures, ranks, and generalizes (top-k). |
| Observation → Theory | Abduction | Absent. No mechanism asks why a result was fast or slow. |
Two of three edges missing. The system compensates with brute force: time enough configurations and you'll stumble onto a good one.
2. The diff is the primitive (Ch 4–6)
Chapter 4's abductive primitive is the diff: given two states with different outcomes, the diff is the set of changes between them.
Configuration A runs slow, configuration B runs fast. The diff: tile size 4→16, loop order swapped, vectorization enabled. That diff is the figure. Everything unchanged is the ground.
The figure is a candidate explanation: "this workload is faster with larger tiles." The system already produces this diff (it must, to move from A to B) but discards it immediately. It keeps the timing, throws away the structure.
Bi-abduction (Ch 5) adds the frame: what the diff didn't touch. If A and B differ only in tile size, the frame includes loop order, vectorization, memory layout. The frame bounds the hypothesis: "larger tiles help when loop order is X and vectorization is Y." Without the frame, the hypothesis overgeneralizes.
Tri-abduction (Ch 6) handles forks. Larger tiles help on one workload and hurt on another. The branching condition (memory-bound vs. compute-bound) splits the hypothesis into two. The current system cannot represent this split. It would need to test both branches separately and discover which condition governs.
3. Spend budget where uncertainty is highest (Ch 7)
200 configuration options. Exhaustive search is infeasible. Beam search samples from the space, but it commits prematurely to a local basin instead of exploring broadly.
Peirce's economy of research asks: which experiment gives the most information per unit cost?
- Cost = wall-clock time to run one configuration on hardware.
- Information gain = how much the result would change the current hypothesis.
A never-varied knob has high information gain: testing it tells you whether it matters at all. A knob tested ten times with consistent results has low information gain: you already know its effect. This criterion replaces beam search's uniform sampling with directed probing.
If the current hypothesis says "tile size is the bottleneck," the highest-value experiment tests tile size in a new context (different workload shape, different memory pressure), not a knob already known to be irrelevant.
4. Measurements are trajectories, not snapshots (Ch 8–9)
Each measurement is a point on a trajectory. Chapter 8 introduced e-values for tracking cumulative evidence; Chapter 9 classified trajectories into four bins.
| Trajectory shape | Meaning for the optimizer | Action |
|---|---|---|
| Convergent | This knob's effect is stable across workloads. Changing it always helps or never helps. | Lock it. Stop testing it. Move budget elsewhere. |
| Divergent | Each new measurement increases the e-value. This knob is critical. | Follow it. This is the bottleneck. Allocate more budget. |
| Oscillatory | The knob helps on workload A, hurts on workload B, helps on C, hurts on D. | Split. Two constraints are fighting. Find the branching condition. |
| Flat / indeterminate | Measurements are noisy. The knob might matter, might not. | Collect more data or reduce measurement variance before deciding. |
The current system keeps the best time and discards the rest. It cannot distinguish a convergent knob (safe to freeze) from an oscillatory one (masking a deeper structural question). Trajectory classification turns raw measurements into decision procedures.
5. Failures name the next experiment (Ch 10)
Chapter 10's central claim: when a hypothesis fails, the failure mode names the next experiment.
Hypothesis: "This workload is memory-bound. Grouping memory accesses will speed it up." The optimizer tests a grouped-access configuration. It's slower. The hypothesis is killed. But the shape of the failure carries information.
- If grouping didn't change throughput at all, the workload isn't memory-bound. The bottleneck is elsewhere. Next experiment: test compute-bound interventions.
- If grouping helped latency but hurt throughput, the bottleneck is bandwidth, not latency. Grouping reduced round-trips but each trip now carries more data than the bus can handle. Next experiment: test a smaller group size or a different memory layout.
- If grouping helped on small inputs but hurt on large ones, the working set overflows cache at scale. Next experiment: test with explicit cache management.
Each failure mode generates a directed edge in the hypothesis graph. The current system has none. Slow configurations get ranked low and forgotten. Failure information, the richest output the system produces, is discarded.
6. Stop when you can explain, not when you're exhausted (Ch 11)
Two stopping conditions look alike but differ fundamentally. Theory-based: "Tile size 16 is optimal because the working set fits exactly in L1 cache, and the loop order minimizes TLB misses given this tile." Exhaustion-based: "I tried 500 configurations and this one was the fastest."
Exhaustion-based stopping fails silently: the search may be stuck in a local basin, and the system cannot detect this because it has no theory of the search space. Theory-based stopping is falsifiable: if changing cache size should change the optimal tile size but doesn't, the theory is wrong and the search must continue.
Chapter 11's convergence criterion (the hypothesis graph closes when every open edge resolves) gives a structural stopping condition. The optimizer has converged when:
- Every knob has been classified (convergent, divergent, oscillatory, or indeterminate).
- Every divergent knob has a theory explaining its effect.
- Every oscillatory knob has a branching condition that resolves the oscillation.
- No indeterminate knobs remain.
Stronger than "no improvement found," and cheaper in the long run: it prevents wasted budget on basins the system has already explained.
7. The optimizer can deceive itself (Ch 12–13)
Chapter 12's prereg checklist catches self-deception before it compounds.
The most dangerous failure: proxy measurement. The optimizer measures on a benchmark workload and deploys on production. The benchmark is smaller, fits in cache, has regular access patterns. Production is larger, irregular, cold-cache. Rankings do not transfer.
The prereg checklist asks: "Is your measurement on the actual workload or a proxy?" If proxy, the entire evidence trajectory is suspect. E-values computed on proxy data do not compose with e-values on production data. Measure on the real workload or explicitly model the gap.
Other checklist items:
- Did you abduce from the test data? If the optimizer generates hypotheses and evaluates them on the same measurements, it is fitting to noise. Hold out a validation set.
- Is the measurement deterministic? Thermal throttling, background processes, and memory allocation all inject variance. Control these or account for them in the e-value computation.
- Can the result reproduce on a different machine? If the optimal configuration depends on a specific hardware quirk, it is an artifact, not a theory.
Chapter 13's proof manual adds structural validation: "If the bottleneck is memory bandwidth, then doubling bandwidth should halve runtime." If the deduction fails, the abduction was wrong. Re-enter the loop.
8. Wired together (Ch 14)
The exhaustive search loop:
def grid_search(workload, candidates, budget):
"""Enumerate, time, keep best. No theory."""
results = []
for config in candidates[:budget]:
time = measure(workload, config)
results.append((time, config))
results.sort()
return results[0] # fastest
# Problems:
# - budget spent uniformly across knobs
# - bad results discarded (no diagnosis)
# - no stopping criterion beyond exhaustion
# - no explanation of why the winner wins
# - regressions undetected (no baseline comparison) The abduction loop:
def abduction_engine(workload, knobs, budget):
"""Observe, hypothesize, test, update. Theory-guided."""
baseline = measure(workload, default_config(knobs))
graph = HypothesisGraph()
spent = 0
while spent < budget and graph.has_open_edges():
# 1. Select: economy of research
knob = max(knobs, key=lambda k: info_gain(k, graph))
# 2. Perturb: generate a test configuration
config = perturb(knob, current_best(graph))
time = measure(workload, config)
spent += 1
# 3. Diff: extract figure and ground
figure, ground = diff(config, current_best(graph))
# 4. Classify: trajectory shape
shape = classify_trajectory(knob, graph)
if shape == CONVERGENT:
graph.freeze(knob) # stop testing
elif shape == DIVERGENT:
graph.deepen(knob, figure) # follow the lead
elif shape == OSCILLATORY:
graph.split(knob, ground) # find the branch
else:
pass # need more data
# 5. Kill check: does the current theory survive?
for hyp in graph.active_hypotheses():
if hyp.prediction_fails(time, config):
next_exp = hyp.failure_mode() # Ch 10
graph.add_edge(hyp, next_exp)
# 6. Baseline guard: never regress
if current_best(graph).time > baseline:
graph.revert_to(baseline)
return graph.best_config(), graph.explanation() Side by side:
| Property | Exhaustive search | Abduction engine |
|---|---|---|
| Selection | Enumerate | Information gain per cost |
| Bad results | Discarded | Diagnosed (kill condition → next edge) |
| Stopping | Budget exhausted | Hypothesis graph closed |
| Output | Best configuration | Best configuration + explanation |
| Regressions | Undetected | Caught by baseline guard |
Three testable predictions
If the framework from chapters 1–14 is correct, the abduction engine beats exhaustive search on the same budget in three measurable ways:
- Fewer trials to equivalent quality. Theory-guided selection skips irrelevant regions of configuration space. The savings grow with the size of the space.
- Fewer regressions. The baseline guard prevents deploying configurations known to be slower. Noisy timing and benchmark/production mismatch can still cause regressions, but the guard eliminates the obvious ones.
- Transferable explanations. The hypothesis graph from workload A partially transfers to workload B. If the theory says "this workload is memory-bound because of access pattern X," and workload B shares that pattern, the same configuration works without re-searching. Exhaustive search cannot transfer because it produces no theory.
How this textbook fails
The framework makes a specific, falsifiable claim: abduction reduces sample complexity. Theory-guided search finds configurations of equal or better quality in fewer measurement rounds than theory-free search.
If the abduction engine consistently requires more trials than exhaustive search to reach the same quality, the overhead of the hypothesis graph is not justified. The core assumption (that failure structure carries exploitable information) would be wrong. The configuration space would be adversarial to theory: no pattern in the failures, no transferable structure, pure noise.
The textbook provides the machinery to run the experiment and the criteria to interpret the result. If the abduction engine wins, the framework works. If it loses, the framework is wrong, and the shape of the loss will say why.
That last sentence is the framework applying to itself. The failure mode of this textbook names the next one.