The Pipeline

Chapter 14 · Calcagno et al. 2009, Wang et al. 2023, Arjovsky et al. 2019

Chapter 12 protected the loop from self-deception. Chapter 13 gave escalation rules for validating claims. Now: wire every tool from the textbook into an autonomous pipeline. Five stations, one outer loop, and a contract that makes the stations compose.

Three independent convergences

Three systems close the abduction loop in production. Different fields, different problems, zero shared citations. The convergence is the evidence.

Facebook Infer (Calcagno et al. 2009)

Infer takes a code diff and produces bug reports. The pipeline is: observe the change (diff, Ch 4), infer the anti-frame and frame (bi-abduction, Ch 5), compose contracts across call boundaries, report where the inferred precondition conflicts with the calling context. The observation is formal (a code diff). The goal is formal (memory safety). The hypothesis is formal (the anti-frame). Every component is mechanized. No human intervenes between input and output.

Infer runs on every diff submitted to the Facebook codebase. Millions of lines, every day. The abduction primitive ("what must have been true for this function to succeed?") executes at industrial scale because the decomposition is compositional (Ch 5): each function is analyzed once, in isolation, and the results compose linearly.

Voyager (Wang et al. 2023)

Voyager is a Minecraft agent. It observes failure (a task it cannot complete), proposes a fix (a new skill program), self-verifies (executes and checks the postcondition), and commits the result to a skill library. The loop is: observe → hypothesize → test → commit. It produced 180 verified skills without human intervention.

The structure maps onto the textbook. Observation: the agent fails to craft an item. Diff (Ch 4): the gap between the current skill library and the task requirement. Hypothesis: a new code snippet that fills the gap. Test: execute in the environment. Kill condition (Ch 10): if execution fails, the error message names the next hypothesis. Convergence (Ch 11): the skill is added to the library only when it passes verification. The agent has no theory of Minecraft physics. It has a theory of its own capabilities, maintained as a growing library, and that theory guides which tasks to attempt next.

Invariant Risk Minimization (Arjovsky et al. 2019)

IRM observes that a classifier trained on one environment fails on another. Standard training minimizes loss. IRM minimizes loss subject to a constraint: the learned features must be invariant across training environments. Features that correlate with the label in environment A but not environment B are discarded. Only features that predict in all environments survive.

Environment variation is the lever for figure-ground separation (Ch 4). Invariant features are the figure: what the system depends on. Spurious correlations are the ground: what holds in one environment but not another. IRM automates the tri-abductive split (Ch 6). When a feature helps in one environment and hurts in another (oscillatory, Ch 9), the branching condition is the environment itself. The system splits the hypothesis without a human identifying the confound.

One pattern across three domains

Three witnesses, three domains, one pattern.

Step	Infer	Voyager	IRM
Observe	Code diff	Task failure	Cross-environment loss gap
Hypothesize	Infer anti-frame	Generate skill code	Propose invariant features
Test	Check precondition vs. context	Execute in Minecraft	Evaluate across environments
Kill / commit	Report bug or pass	Reject or add to library	Discard spurious or keep invariant
Loop	Next function in call graph	Next task	Next gradient step

Each system embeds Peirce's three modes (Ch 1). Observation produces theory (abduction). Theory produces prediction (deduction). Prediction is tested (induction). The triangle closes. None of the three papers names the triangle. All three implement it.

Every station produces a standalone artifact

The textbook's tools assemble into five stations. Each produces a self-contained artifact. Any station can stop and the output stands alone: a diagnosis, an experiment repo, a measurement, a convergence report, a PR.

Station	Artifact	Tools from the textbook	Kill condition
1. Hypothesize	Hypothesis graph	Diff (Ch 4), bi-abduction (Ch 5), economy of research (Ch 7)	No surviving hypothesis → stop
2. Prework	Experiment repo	Prereg checklist (Ch 12), proof manual (Ch 13)	Extraction script fails to confirm bug → re-enter station 1
3. Benchmark	Measurement table	E-values (Ch 8), four bins (Ch 9), kill conditions (Ch 10)	Oscillatory → split hypothesis, re-enter station 1
4. Bug hunt	Convergence report	Convergence (Ch 11), proof manual (Ch 13)	New bug found → classify and re-enter station 1
5. Ship	PR	Prereg checklist (Ch 12)	Conflicts with project vision → re-enter station 1

Every kill condition re-enters station 1. This is the outer loop. The pipeline does not branch midstream. It returns to the hypothesis graph, classifies the new observation using the four bins (Ch 9), and the trajectory shape names what happens next: divergent follows the edge (Ch 10), convergent refines the hypothesis (Ch 11), oscillatory splits it (Ch 6), chaotic redesigns the experiment.

Stations compose because contracts are explicit

Each station writes everything the next station needs. No hidden state:

From	To	Contract artifact
Hypothesize	Prework	HYPOTHESIS_GRAPH.md: ranked hypotheses with evidence trails, predicted outcomes, falsification conditions
Prework	Benchmark	Experiment repo: reference implementation, candidate fix, compatibility suite, extraction script
Benchmark	Bug hunt	Measurement table: before/after numbers, trajectory classification, environment specs
Bug hunt	Ship	Convergence report: two independent reviewers, iterated to zero new findings
Ship (kill)	Hypothesize	Kill condition: classified observation re-entering the graph as a new edge

The contract is monoidal. Station A's output is Station B's input. The composition is associative: (A ; B) ; C = A ; (B ; C). No station needs to know about any station except its immediate neighbors. This is separation logic's frame rule (Ch 5) applied to the pipeline itself: each station has a footprint (what it reads and writes) and a frame (everything else). The frame passes through untouched.

Stations can be replaced, reordered, or run by different agents. The contract is the interface. The implementation is local.

Kills re-enter station 1, never retry

The five stations form the inner pass. The outer loop handles re-entry. When any station fires a kill condition, the pipeline returns to station 1, adds a new edge to the hypothesis graph, and starts a fresh inner pass.

The outer loop terminates when one of three conditions holds:

Station 5 succeeds. The PR ships. The hypothesis graph is closed. Every open edge has been resolved (Ch 11).
No surviving hypothesis. Station 1 cannot generate a hypothesis that survives the prereg checklist (Ch 12). The investigation concludes: the system is not broken, or the breakage is outside the investigation's scope.
Budget exhausted. The economy of research (Ch 7) says the marginal information gain per experiment has dropped below threshold. Further investigation costs more than it returns.

Condition 2 is the most informative termination. "No surviving hypothesis after three rounds" teaches you something concrete: the initial observation was misleading, the system's behavior is within spec, or the hypothesis space was too narrow. Each diagnosis is itself a classified observation that can re-enter a higher-level investigation.

Code: the five-station loop

def pipeline(system, observation, vision, max_rounds=5):
    """Five stations, one outer loop.

    system:      the target under investigation
    observation: the surprising fact that started the inquiry
    vision:      VISION.md — project philosophy, not tool configuration
    max_rounds:  budget for outer-loop iterations
    """
    graph = HypothesisGraph(observation)

    for round in range(max_rounds):

        # ── Station 1: Hypothesize ──────────────────────────
        # Diff (Ch 4), bi-abduction (Ch 5), economy (Ch 7)
        hypotheses = graph.fan_out()
        hypotheses = rank_by_info_gain(hypotheses)   # Ch 7
        best = hypotheses[0]

        if not prereg_check(best):                   # Ch 12
            graph.kill(best, reason="failed prereg")
            continue

        # ── Station 2: Prework ──────────────────────────────
        # Build experiment repo before touching production
        repo = build_experiment_repo(best, system)
        confirmed = repo.extraction_script.run()

        if not confirmed:
            graph.kill(best, reason="bug not confirmed")
            continue                                    # re-enter station 1

        # ── Station 3: Benchmark ────────────────────────────
        # E-values (Ch 8), four bins (Ch 9), kills (Ch 10)
        before = measure(system, baseline_config)
        after  = measure(system, repo.candidate_fix)
        shape  = classify_trajectory(before, after)  # Ch 9

        if shape == OSCILLATORY:
            # Split the hypothesis (Ch 6). Two constraints fighting.
            h1, h2 = graph.split(best, branching_condition)
            continue                                    # re-enter station 1
        if shape == FLAT:
            graph.kill(best, reason="no improvement")
            continue

        # ── Station 4: Bug hunt ─────────────────────────────
        # Convergence (Ch 11), proof manual (Ch 13)
        findings = adversarial_review(repo.candidate_fix)

        while findings:
            for bug in findings:
                shape = classify_trajectory(bug)     # Ch 9
                graph.add_edge(best, bug, shape)     # Ch 10
                repo.fix(bug)
            findings = adversarial_review(repo.candidate_fix)
            # Zero new findings = converged (Ch 11)

        # ── Station 5: Ship ─────────────────────────────────
        if not aligns_with_vision(repo.candidate_fix, vision):
            graph.kill(best, reason="conflicts with VISION.md")
            continue

        return ship(repo)                            # PR

    # Budget exhausted. Report what was learned.
    return graph.summary()

The continue statements are the outer loop. Every kill re-enters station 1. The hypothesis graph accumulates across rounds: a hypothesis killed in round 2 is not re-proposed in round 3. The graph is the memory. The kills are the edges. The trajectory shape (Ch 9) classifies each edge, and the classification determines whether the next round follows, splits, refines, or stops.

Every kill teaches the graph something

Kill conditions from downstream stations do not discard the hypothesis. They re-enter the hypothesis graph as new observations. The four-bin classification (Ch 9) names what happens next:

Kill source	Trajectory shape	Graph operation
Prework: extraction fails	Flat	Kill hypothesis. The diagnosis was wrong. Prune the branch.
Benchmark: helps A, hurts B	Oscillatory	Split hypothesis (Ch 6). Two constraints fighting. Find the branching condition.
Benchmark: no improvement	Flat	Kill hypothesis. The lever doesn't move this system.
Bug hunt: structural flaw	Divergent	Follow the edge (Ch 10). The flaw names a deeper problem.
Ship: conflicts with vision	Convergent (wrong direction)	Reject. The fix is correct but philosophically wrong. Adjust the heuristic's defaults, don't add a special case.

The oscillatory case is the most instructive: the investigation found a real lever but applied it too broadly. The split produces a simpler, more targeted hypothesis. Investigations that pass through the oscillatory case once almost always converge on the second pass, because the branching condition narrows scope to where the lever actually works.

Isolation makes the pipeline embarrassingly parallel

One agent, one system, one fix. Ten agents, ten systems, ten fixes. No shared state between runs.

Each pipeline run is independent. Agent A investigating system X cannot corrupt agent B investigating system Y, because the contract artifacts (hypothesis graph, experiment repo, measurement table, convergence report) are local to each run. The pipeline's footprint is the system under investigation. Its frame is everything else. The frame rule (Ch 5) guarantees non-interference.

The pipeline scales because the contract forces isolation. The same property that makes individual stations composable (no hidden state) makes entire pipeline runs composable (no shared state). Parallelism falls out of the structure; it is not bolted on as an optimization.

The human writes one document: VISION.md

"The compiler replaces the kernel library." "No vendor primitives." "The search should find it." These are constraints on what counts as a valid fix. A pipeline that reads VISION.md knows to reject its own custom kernel before the maintainer has to. It knows to adjust the heuristic's defaults instead of adding a special case.

VISION.md is the prereg checklist (Ch 12) applied to the project, not the experiment. It blocks a specific failure mode: the investigation succeeds technically but fails architecturally. Without VISION.md, the pipeline ships the fix and the maintainer rejects the PR. With VISION.md, the pipeline rejects the fix itself, re-enters the hypothesis graph, and searches for a solution that is both correct and aligned.

The human provides direction: what kind of system is this, and what kind of system should it be. The pipeline handles everything else.

Remove any tool and the pipeline breaks

Every tool from the textbook appears in the pipeline:

Tool	Chapter	Where it appears in the pipeline
Diff	4	Station 1: snapshot before/after, extract figure and ground
Bi-abduction	5	Station 1: infer the frame automatically; contract compositionality
Tri-abduction	6	Station 3: split oscillatory hypotheses at the branching condition
Economy of research	7	Station 1: rank hypotheses by information gain per cost
E-values	8	Station 3: track cumulative evidence, not point estimates
Four bins	9	Stations 3–4: classify trajectory shape to determine next action
Kill conditions	10	All stations: failures generate edges in the hypothesis graph
Convergence	11	Station 4: two reviewers, iterated to zero new findings
Prereg checklist	12	Stations 1–2: block self-deception before experiments run
Proof manual	13	Stations 2, 4: escalation rules for validating claims structurally

No tool is optional. Remove the diff and the pipeline cannot generate hypotheses. Remove bi-abduction and it cannot infer the frame, so hypotheses overgeneralize. Remove the economy of research and it tests hypotheses in arbitrary order. Remove e-values and it relies on point estimates that lie. Remove the four bins and it cannot read the trajectory shape. Remove kill conditions and failures teach nothing. Remove convergence and it never stops. Remove the prereg checklist and it deceives itself. Remove the proof manual and it cannot escalate when induction fails.

The textbook is the pipeline. The pipeline is the textbook.

The framework predicts it will generalize. Chapter 15 tests that.

The pipeline works on one system at a time. One agent, one system, one fix. The contract guarantees isolation. The monoidal structure guarantees composability.

The three independent convergences predict generalization: the pattern recurs in static analysis, game AI, and machine learning without shared lineage. The pipeline implements the pattern as an explicit loop. Any system that can be perturbed and measured should be a valid target.

The next chapter tests this claim on a concrete case: a system that searches without reasoning, that perturbs and observes but never asks why. If the pipeline produces a fix that grid search cannot find in the same budget, the prediction holds. If it does not, the shape of the failure will name what the textbook got wrong.

Chapter 15.

Neighbors

🔭 Methodeutics
Ch 5: Bi-abduction — the frame rule and compositional analysis
Ch 7: Economy of Research — which hypotheses are worth testing?
Ch 15: Case Study — applying the pipeline to a concrete system
Investigate — the blog post that implements this pipeline

External