The Pipeline
Chapter 14 · Calcagno et al. 2009, Wang et al. 2023, Arjovsky et al. 2019
Chapter 12 protected the loop from self-deception. Chapter 13 gave escalation rules for validating claims. Now: wire every tool from the textbook into an autonomous pipeline. Five stations, one outer loop, and a contract that makes the stations compose.
Three independent convergences
Three systems close the abduction loop in production. Different fields, different problems, zero shared citations. The convergence is the evidence.
Facebook Infer (Calcagno et al. 2009)
Infer takes a code diff and produces bug reports. The pipeline is: observe the change (diff, Ch 4), infer the anti-frame and frame (bi-abduction, Ch 5), compose contracts across call boundaries, report where the inferred precondition conflicts with the calling context. The observation is formal (a code diff). The goal is formal (memory safety). The hypothesis is formal (the anti-frame). Every component is mechanized. No human intervenes between input and output.
Infer runs on every diff submitted to the Facebook codebase. Millions of lines, every day. The abduction primitive ("what must have been true for this function to succeed?") executes at industrial scale because the decomposition is compositional (Ch 5): each function is analyzed once, in isolation, and the results compose linearly.
Voyager (Wang et al. 2023)
Voyager is a Minecraft agent. It observes failure (a task it cannot complete), proposes a fix (a new skill program), self-verifies (executes and checks the postcondition), and commits the result to a skill library. The loop is: observe → hypothesize → test → commit. It produced 180 verified skills without human intervention.
The structure maps onto the textbook. Observation: the agent fails to craft an item. Diff (Ch 4): the gap between the current skill library and the task requirement. Hypothesis: a new code snippet that fills the gap. Test: execute in the environment. Kill condition (Ch 10): if execution fails, the error message names the next hypothesis. Convergence (Ch 11): the skill is added to the library only when it passes verification. The agent has no theory of Minecraft physics. It has a theory of its own capabilities, maintained as a growing library, and that theory guides which tasks to attempt next.
Invariant Risk Minimization (Arjovsky et al. 2019)
IRM observes that a classifier trained on one environment fails on another. Standard training minimizes loss. IRM minimizes loss subject to a constraint: the learned features must be invariant across training environments. Features that correlate with the label in environment A but not environment B are discarded. Only features that predict in all environments survive.
Environment variation is the lever for figure-ground separation (Ch 4). Invariant features are the figure: what the system depends on. Spurious correlations are the ground: what holds in one environment but not another. IRM automates the tri-abductive split (Ch 6). When a feature helps in one environment and hurts in another (oscillatory, Ch 9), the branching condition is the environment itself. The system splits the hypothesis without a human identifying the confound.
One pattern across three domains
Three witnesses, three domains, one pattern.
| Step | Infer | Voyager | IRM |
|---|---|---|---|
| Observe | Code diff | Task failure | Cross-environment loss gap |
| Hypothesize | Infer anti-frame | Generate skill code | Propose invariant features |
| Test | Check precondition vs. context | Execute in Minecraft | Evaluate across environments |
| Kill / commit | Report bug or pass | Reject or add to library | Discard spurious or keep invariant |
| Loop | Next function in call graph | Next task | Next gradient step |
Each system embeds Peirce's three modes (Ch 1). Observation produces theory (abduction). Theory produces prediction (deduction). Prediction is tested (induction). The triangle closes. None of the three papers names the triangle. All three implement it.
Every station produces a standalone artifact
The textbook's tools assemble into five stations. Each produces a self-contained artifact. Any station can stop and the output stands alone: a diagnosis, an experiment repo, a measurement, a convergence report, a PR.
| Station | Artifact | Tools from the textbook | Kill condition |
|---|---|---|---|
| 1. Hypothesize | Hypothesis graph | Diff (Ch 4), bi-abduction (Ch 5), economy of research (Ch 7) | No surviving hypothesis → stop |
| 2. Prework | Experiment repo | Prereg checklist (Ch 12), proof manual (Ch 13) | Extraction script fails to confirm bug → re-enter station 1 |
| 3. Benchmark | Measurement table | E-values (Ch 8), four bins (Ch 9), kill conditions (Ch 10) | Oscillatory → split hypothesis, re-enter station 1 |
| 4. Bug hunt | Convergence report | Convergence (Ch 11), proof manual (Ch 13) | New bug found → classify and re-enter station 1 |
| 5. Ship | PR | Prereg checklist (Ch 12) | Conflicts with project vision → re-enter station 1 |
Every kill condition re-enters station 1. This is the outer loop. The pipeline does not branch midstream. It returns to the hypothesis graph, classifies the new observation using the four bins (Ch 9), and the trajectory shape names what happens next: divergent follows the edge (Ch 10), convergent refines the hypothesis (Ch 11), oscillatory splits it (Ch 6), chaotic redesigns the experiment.
Stations compose because contracts are explicit
Each station writes everything the next station needs. No hidden state:
| From | To | Contract artifact |
|---|---|---|
| Hypothesize | Prework | HYPOTHESIS_GRAPH.md: ranked hypotheses with evidence trails, predicted outcomes, falsification conditions |
| Prework | Benchmark | Experiment repo: reference implementation, candidate fix, compatibility suite, extraction script |
| Benchmark | Bug hunt | Measurement table: before/after numbers, trajectory classification, environment specs |
| Bug hunt | Ship | Convergence report: two independent reviewers, iterated to zero new findings |
| Ship (kill) | Hypothesize | Kill condition: classified observation re-entering the graph as a new edge |
The contract is monoidal. Station A's output is Station B's input. The composition is associative: (A ; B) ; C = A ; (B ; C). No station needs to know about any station except its immediate neighbors. This is separation logic's frame rule (Ch 5) applied to the pipeline itself: each station has a footprint (what it reads and writes) and a frame (everything else). The frame passes through untouched.
Stations can be replaced, reordered, or run by different agents. The contract is the interface. The implementation is local.
Kills re-enter station 1, never retry
The five stations form the inner pass. The outer loop handles re-entry. When any station fires a kill condition, the pipeline returns to station 1, adds a new edge to the hypothesis graph, and starts a fresh inner pass.
The outer loop terminates when one of three conditions holds:
- Station 5 succeeds. The PR ships. The hypothesis graph is closed. Every open edge has been resolved (Ch 11).
- No surviving hypothesis. Station 1 cannot generate a hypothesis that survives the prereg checklist (Ch 12). The investigation concludes: the system is not broken, or the breakage is outside the investigation's scope.
- Budget exhausted. The economy of research (Ch 7) says the marginal information gain per experiment has dropped below threshold. Further investigation costs more than it returns.
Condition 2 is the most informative termination. "No surviving hypothesis after three rounds" teaches you something concrete: the initial observation was misleading, the system's behavior is within spec, or the hypothesis space was too narrow. Each diagnosis is itself a classified observation that can re-enter a higher-level investigation.
Code: the five-station loop
def pipeline(system, observation, vision, max_rounds=5):
"""Five stations, one outer loop.
system: the target under investigation
observation: the surprising fact that started the inquiry
vision: VISION.md — project philosophy, not tool configuration
max_rounds: budget for outer-loop iterations
"""
graph = HypothesisGraph(observation)
for round in range(max_rounds):
# ── Station 1: Hypothesize ──────────────────────────
# Diff (Ch 4), bi-abduction (Ch 5), economy (Ch 7)
hypotheses = graph.fan_out()
hypotheses = rank_by_info_gain(hypotheses) # Ch 7
best = hypotheses[0]
if not prereg_check(best): # Ch 12
graph.kill(best, reason="failed prereg")
continue
# ── Station 2: Prework ──────────────────────────────
# Build experiment repo before touching production
repo = build_experiment_repo(best, system)
confirmed = repo.extraction_script.run()
if not confirmed:
graph.kill(best, reason="bug not confirmed")
continue # re-enter station 1
# ── Station 3: Benchmark ────────────────────────────
# E-values (Ch 8), four bins (Ch 9), kills (Ch 10)
before = measure(system, baseline_config)
after = measure(system, repo.candidate_fix)
shape = classify_trajectory(before, after) # Ch 9
if shape == OSCILLATORY:
# Split the hypothesis (Ch 6). Two constraints fighting.
h1, h2 = graph.split(best, branching_condition)
continue # re-enter station 1
if shape == FLAT:
graph.kill(best, reason="no improvement")
continue
# ── Station 4: Bug hunt ─────────────────────────────
# Convergence (Ch 11), proof manual (Ch 13)
findings = adversarial_review(repo.candidate_fix)
while findings:
for bug in findings:
shape = classify_trajectory(bug) # Ch 9
graph.add_edge(best, bug, shape) # Ch 10
repo.fix(bug)
findings = adversarial_review(repo.candidate_fix)
# Zero new findings = converged (Ch 11)
# ── Station 5: Ship ─────────────────────────────────
if not aligns_with_vision(repo.candidate_fix, vision):
graph.kill(best, reason="conflicts with VISION.md")
continue
return ship(repo) # PR
# Budget exhausted. Report what was learned.
return graph.summary() The continue statements are the outer loop. Every kill re-enters station 1. The hypothesis graph accumulates across rounds: a hypothesis killed in round 2 is not re-proposed in round 3. The graph is the memory. The kills are the edges. The trajectory shape (Ch 9) classifies each edge, and the classification determines whether the next round follows, splits, refines, or stops.
Every kill teaches the graph something
Kill conditions from downstream stations do not discard the hypothesis. They re-enter the hypothesis graph as new observations. The four-bin classification (Ch 9) names what happens next:
| Kill source | Trajectory shape | Graph operation |
|---|---|---|
| Prework: extraction fails | Flat | Kill hypothesis. The diagnosis was wrong. Prune the branch. |
| Benchmark: helps A, hurts B | Oscillatory | Split hypothesis (Ch 6). Two constraints fighting. Find the branching condition. |
| Benchmark: no improvement | Flat | Kill hypothesis. The lever doesn't move this system. |
| Bug hunt: structural flaw | Divergent | Follow the edge (Ch 10). The flaw names a deeper problem. |
| Ship: conflicts with vision | Convergent (wrong direction) | Reject. The fix is correct but philosophically wrong. Adjust the heuristic's defaults, don't add a special case. |
The oscillatory case is the most instructive: the investigation found a real lever but applied it too broadly. The split produces a simpler, more targeted hypothesis. Investigations that pass through the oscillatory case once almost always converge on the second pass, because the branching condition narrows scope to where the lever actually works.
Isolation makes the pipeline embarrassingly parallel
One agent, one system, one fix. Ten agents, ten systems, ten fixes. No shared state between runs.
Each pipeline run is independent. Agent A investigating system X cannot corrupt agent B investigating system Y, because the contract artifacts (hypothesis graph, experiment repo, measurement table, convergence report) are local to each run. The pipeline's footprint is the system under investigation. Its frame is everything else. The frame rule (Ch 5) guarantees non-interference.
The pipeline scales because the contract forces isolation. The same property that makes individual stations composable (no hidden state) makes entire pipeline runs composable (no shared state). Parallelism falls out of the structure; it is not bolted on as an optimization.
The human writes one document: VISION.md
"The compiler replaces the kernel library." "No vendor primitives." "The search should find it." These are constraints on what counts as a valid fix. A pipeline that reads VISION.md knows to reject its own custom kernel before the maintainer has to. It knows to adjust the heuristic's defaults instead of adding a special case.
VISION.md is the prereg checklist (Ch 12) applied to the project, not the experiment. It blocks a specific failure mode: the investigation succeeds technically but fails architecturally. Without VISION.md, the pipeline ships the fix and the maintainer rejects the PR. With VISION.md, the pipeline rejects the fix itself, re-enters the hypothesis graph, and searches for a solution that is both correct and aligned.
The human provides direction: what kind of system is this, and what kind of system should it be. The pipeline handles everything else.
Remove any tool and the pipeline breaks
Every tool from the textbook appears in the pipeline:
| Tool | Chapter | Where it appears in the pipeline |
|---|---|---|
| Diff | 4 | Station 1: snapshot before/after, extract figure and ground |
| Bi-abduction | 5 | Station 1: infer the frame automatically; contract compositionality |
| Tri-abduction | 6 | Station 3: split oscillatory hypotheses at the branching condition |
| Economy of research | 7 | Station 1: rank hypotheses by information gain per cost |
| E-values | 8 | Station 3: track cumulative evidence, not point estimates |
| Four bins | 9 | Stations 3–4: classify trajectory shape to determine next action |
| Kill conditions | 10 | All stations: failures generate edges in the hypothesis graph |
| Convergence | 11 | Station 4: two reviewers, iterated to zero new findings |
| Prereg checklist | 12 | Stations 1–2: block self-deception before experiments run |
| Proof manual | 13 | Stations 2, 4: escalation rules for validating claims structurally |
No tool is optional. Remove the diff and the pipeline cannot generate hypotheses. Remove bi-abduction and it cannot infer the frame, so hypotheses overgeneralize. Remove the economy of research and it tests hypotheses in arbitrary order. Remove e-values and it relies on point estimates that lie. Remove the four bins and it cannot read the trajectory shape. Remove kill conditions and failures teach nothing. Remove convergence and it never stops. Remove the prereg checklist and it deceives itself. Remove the proof manual and it cannot escalate when induction fails.
The textbook is the pipeline. The pipeline is the textbook.
The framework predicts it will generalize. Chapter 15 tests that.
The pipeline works on one system at a time. One agent, one system, one fix. The contract guarantees isolation. The monoidal structure guarantees composability.
The three independent convergences predict generalization: the pattern recurs in static analysis, game AI, and machine learning without shared lineage. The pipeline implements the pattern as an explicit loop. Any system that can be perturbed and measured should be a valid target.
The next chapter tests this claim on a concrete case: a system that searches without reasoning, that perturbs and observes but never asks why. If the pipeline produces a fix that grid search cannot find in the same budget, the prediction holds. If it does not, the shape of the failure will name what the textbook got wrong.
Chapter 15.
Neighbors
- 🔭 Methodeutics
- Ch 5: Bi-abduction — the frame rule and compositional analysis
- Ch 7: Economy of Research — which hypotheses are worth testing?
- Ch 15: Case Study — applying the pipeline to a concrete system
- Investigate — the blog post that implements this pipeline
External