Investigate
Part of the methodology series. Sequel to Abduction.
Software engineering encodes two of the three modes. Deduction: type checkers, provers, linters. Induction: test suites, benchmarks, CI. Abduction, the third mode, stays manual. A human reads the profile, stares at the flame graph, and has a hunch. What if the hunch had a pipeline?
Coding agents can write code, run tests, and ship PRs. They still need a human to decide what to fix. Abduction argued the primitive is encodable. This post is the encoding. Five stations, one loop. A pipeline from “something is slow” to a mergeable PR, mechanized abduction at the hinge.
Shape
Each station produces a self-contained artifact. Any station can stop and the output stands alone: diagnosis, experiment repo, measurement, convergence report, PR. The contract is simple: each station writes everything the next station needs. No hidden state.
The loop is the mechanism. Kill conditions from downstream stations re-enter the hypothesis graph. The e-value classification names what happens next: divergent follows the edge, convergent refines, oscillatory splits, chaotic redesigns.
Hypothesize
Start with a question about a system you can perturb. Fan out competing hypotheses. For each one: perturb, measure, classify the trajectory shape, prune. Abduction proposes, deduction traces consequences, induction tests. The graph deepens until the frontier closes or you redirect.
A good investigation attacks its own premise first; if it survives, you weren’t adversarial enough.
Example: I assumed tinygrad’s LLaMA gap was realize overhead from novel graph shapes. The investigation killed that premise. The symbolic JIT handles varying shapes without recompilation. The gap was elsewhere.
Prework
When a surviving hypothesis implies a code change, build the prework before touching production: reference implementation, candidate fix as a pure function, compatibility suite, extraction script that confirms the bug exists in the target.
The extraction script is the derisk. If it doesn’t confirm the bug, the diagnosis was wrong. Re-enter the hypothesis graph.
Example: extract.py dumped the generated Metal kernel and showed stride-32768 in the inner loop. 87% of every cache line wasted. The bug was confirmed, not hypothetical.
Benchmark
Implement the minimal fix, measure before and after, and classify the benchmark trajectory as you would any e-value:
- Divergent improvement on all tested cases → proceed.
- Oscillatory (helps some cases, hurts others) → the fix is too coarse. Re-enter the hypothesis graph. Split.
- No improvement → the diagnosis was wrong. Re-enter the hypothesis graph.
The oscillatory case is the most valuable. It means you found a real lever but applied it too broadly. The split produces a simpler, more targeted fix.
Example: removing GROUP on the reduction axis improved contiguous weights by 74% but regressed transposed weights (nn.Linear) by 25%. Oscillatory. The split revealed that the improvement came from wider UPCAST, not GROUP removal.
Bug hunt
Two reviewers from different model families, iterated to convergence. The first catches structural issues, the second logic bugs. Zero new findings = converged.
If the hunt kills the fix, the kill condition is a new observation. Classify it and re-enter the hypothesis graph. This is mechanized adversarial abduction: the reviewer generates hypotheses, not just verdicts.
Example: Gemini flagged that nn.Linear transposes its weights. The kill condition re-entered the graph as an oscillatory observation, split the hypothesis, and the surviving fix was one number instead of two.
Ship
The only station requiring human approval. Everything before it is local and reversible. A PR triggers CI on someone else’s hardware and occupies reviewer attention.
Before creating: check for existing PRs. Git blame the changed line. Deliberate design choice? Address the author’s reasoning. Inherited default nobody tuned? Say so.
Example: the value was copied from kernel.py in a file move a year ago. The TODO said “adjust/tune.” Nobody did. One number, one line, one file.
Cost
I spent an hour on prework; the regression it caught would have cost trust. The experiment repo gives you both sides of an oscillatory result; the compatibility suite proves the fix is safe; the benchmark table makes the PR description three lines instead of three paragraphs.
Scale
One agent, one repo, one fix. Ten agents, ten repos, ten fixes. No shared state between runs. The monoidal contract makes them independent.
But the real end state isn’t humans pointing agents at problems. It’s agents noticing the problems themselves. A benchmark drifts. A latency percentile creeps. The system spins up an investigation, classifies the trajectory, builds the experiment repo, benchmarks the fix, runs adversarial review. The three modes in a closed loop: observation produces theory, theory produces experiment, experiment produces observation.
Vision
The human’s role narrows to one artifact: VISION.md. Not tool configuration. Project philosophy. “The compiler replaces the kernel library.” “No vendor primitives.” “BEAM should find it.” A pipeline that reads VISION.md knows to reject its own custom kernel before the maintainer has to. It knows to adjust the heuristic’s defaults instead.
If the hypothesis graph converges, the benchmark is divergent, the bug hunt finds nothing, the compatibility suite passes, and the fix aligns with VISION.md: what’s the human adding at the gate?
The scientific method, which was always personal and sequential, becomes continuous and autonomous. The third mode isn’t manual anymore.
The tinygrad case: experiment repo, PR #16072. The full hypothesis graph has the evidence trail. The /investigate skill runs in Claude Code.