Prework

Part of the methodology series.

Coding agents are fast on greenfield and sloppy on brownfield. Give one a blank repo and it ships. Point it at a production codebase (internal monorepo, open source project, anything with existing reviewers) and the output degrades. The diff is large, the rationale is thin, the tests look generated. The agent doesn’t know what the reviewer needs to see to say yes. It optimizes for “code that works,” not “code that gets merged.”

The bottleneck in a collaborative codebase is never the code generation. It’s the review: a human deciding whether to trust your contribution. Prework closes that gap.

Predicate vs. transformation

Every contribution to an existing codebase answers two questions. Does this approach work? And does this code implement it correctly? The first is a claim about the world: the predicate. The second is a mechanical task: the transformation. They need different evidence, different tools, different iteration speeds.

I needed to know whether union-find compaction preserves more detail than flat summarization for chat history. The production repo had no cheap way to compare variants: no fixtures, no evaluation harness, no controlled conditions. So the predicate lived in a prototype repo: synthetic conversations, its own runner, seven trials. The transformation (porting the algorithm to TypeScript) took ninety minutes. Rebase, fix, verify, push. No design arguments during review. No “should we?” — only “does it work?”

Settle the predicate before you touch production. The transformation becomes verifiable, delegatable, and fast.

What to build

Prework is worth building when you can name the failure category. Not the specific bug — the kind of surprise.

Experiment repo. Guards against: the approach doesn’t actually help. Build it when production can’t answer cheaply: feedback loop too slow, fixtures too coupled, or evaluation needs controlled conditions CI can’t provide. The experiment repo should have its own fixtures, its own runner, and results a reviewer can audit independently.

Compatibility suite. Guards against: the port diverges from the original. Build it when the prework is in a different language or framework than the target. Import the prototype’s fixtures, run them through the production implementation, assert identical outputs. Two bugs survived three review rounds because this artifact didn’t exist.

Transformation design doc. Guards against: structural decisions made ad hoc during the port. Build it when the prototype’s architecture doesn’t map onto the target’s. It specifies which module maps where, which interfaces change shape, and which behaviors differ intentionally.

Integration manifest. Guards against: forgetting where things live. Build it when the contribution spans multiple repos, remotes, or branches. One line per artifact: repo, branch, remote, what it hosts. Free, prevents twenty-minute archaeology sessions.

What not to build

Prework against failure categories you can’t name is speculative architecture, wrong by the time the failure arrives.

Don’t pre-build defenses against unknown review feedback. That’s the reviewer’s job. Pre-building abstractions for hypothetical extensions or a compatibility suite for a one-shot script falls in the same trap. Prework that misses its target is worse than none: it creates false confidence.

The filter: can you name the failure category? If yes, is the prework artifact legible to someone other than you (a reviewer, a CI pipeline, a future agent)? If both yes, build it. If you can’t name the category, skip it and learn from the failure when it comes.

Artifacts compound

Prework that stays in your head isn’t prework. It’s thinking, and thinking doesn’t compound. The artifact is what compounds: an experiment repo gets cited in a blog post, linked from a PR, referenced in review, audited in a retro. A design doc gets read by the agent doing the port. A compatibility suite catches regressions on every future change.

Artifacts that encode what’s true (experiments, specs, validated prototypes) have long half-lives and accumulate references; artifacts that encode where things are (manifests, checklists) have short half-lives and deprecate. Build the first kind when the failure category is project-killing. Build the second when rediscovery costs more than a one-line note.

The recipe

Given a spec for a brownfield contribution:

Identify the predicate. What claim does this contribution make? What would falsify it?
Check whether production can falsify it cheaply. If the test suite is fast, fixtures are simple, and evaluation is binary, skip the prototype, work in production.
If not, build a prototype repo. Match the target’s data shapes and interfaces; skip its build system and dependencies. Own the whole loop: fixtures, runner, evaluation. Validate the predicate with evidence a reviewer can audit.
Name every failure category you can foresee for the port. For each one, decide: experiment, compatibility suite, design doc, or manifest.
Port. The transformation should be mechanical. If it requires design decisions, the prework missed something. Go back and build the missing artifact.

The third axis

Dexter Horthy pointed out that predicate and transformation don’t cover everything a reviewer checks. There’s a third question: does this change make the codebase easier to maintain or harder? He called it the slop-slope: the tendency of automated changes to increase complexity despite passing tests. Most people skip this step, or don’t care. Most AI code makes the codebase more complex, even if predicate and transformation are correct.

The slop-slope isn’t a prework problem. It’s a post-work problem. The predicate is settled, the transformation passes tests, and the code still isn’t merge-ready because it’s more complex than it needs to be. The reviewer feels “I can’t say yes to this” because the diff is harder to maintain than the codebase it’s entering.

This is where refactoring comes in. For any correct transformation, there’s an equivalence class of implementations that all pass the same tests. The first one the agent produces is rarely the simplest. A refactoring pass after tests pass navigates within that class toward a simpler member. Whether LLMs can do this reliably is an open experiment.

The port itself is where tools like /forge and /volley operate: automated pipelines that sharpen specs, implement, verify, and clean up PRs. Those tools optimize the transformation. Prework optimizes what comes before: the evidence that makes the transformation worth reviewing at all. The refactoring pass optimizes what comes after: the complexity trajectory that determines whether the reviewer says yes.

The reviewer opens the diff: one commit, a link to a pre-registered experiment, eighty tests matching the prototype’s assertions. They review the code, not the premise. No back-and-forth about whether the algorithm works. That question has a URL with seven trials and a p-value.

Reviewers don’t separate “does this work?” from “should we do this?” They just feel “I can’t say yes to this.” Prework resolves the first question before they encounter the diff. Whatever doubt remains is about direction, and now it’s small enough to name. You don’t need to categorize the uncertainty upfront. You build the artifacts, and the distinction reveals itself.

Prework converts “trust me” into “here’s the provenance.” What lands in production is just the receipt.

Written via the double loop.