How Not to Run SWE-bench Pro

How I thought I beat the bench, found out I was using the wrong rulebook, and the field guide I wish I'd had first

I ran all 728 eligible instances of SWE-bench Pro and resolved 694 of them, 95.3%, under the official grader, on a fresh container, solo. Most of it billed to a $200/month Claude Max plan and a handful of EC2 boxes. No grant, no cluster, no API budget that needed sign-off. The receipts are public: github.com/kimjune01/swebench-pro. That number is a harness result on the public split, not a leaderboard claim and not a statement about model capability; the difference is most of what this post is about.

This post is not the result. The result has its own writeup. This is the part nobody publishes: the operational stuff I got wrong, the one mistake that would have made the whole number meaningless, and the day I realized the number didn’t mean what I thought it did. It’s written for anyone building a coding harness and trying to measure whether it actually works, on consumer tokens instead of a budget that needs approval. Read it before you spend any.

I thought I’d beaten the bench. I was keeping score by the wrong rulebook.

When the harness first cleared 95%, I thought I’d won. I’d resolved more of SWE-bench Pro than the models on the public leaderboard, on a subscription, alone. For about a day it felt like beating the bench.

I hadn’t beaten anything. I’d been scoring myself by a rulebook the bench doesn’t use, on three counts at once.

The leaderboard ranks models through one fixed, standardized harness. I’d done the opposite: held the model roughly fixed and rebuilt the harness around it. A harness number can’t sit on that board by construction; there was never a slot for it to win. I’d been comparing my run to numbers it shares no axis with.

The public split is contaminated. These repos can sit in a model’s training data, which is the whole reason the bench holds out a private split. A number on the public set is an audition for the held-out one, not a trophy (README, “Will this hold on the private set?”).

And the part that stung, because it’s the subject of this whole post: some of my earlier, even-higher numbers weren’t the harness at all. They were the oracle and my own gate doing the work; I’d been grading myself by my rules, not the bench’s. The 95.3% that survived is real in a bounded sense, a public-split, source-only, officially graded harness result, and nothing beyond that. The triumphant version in my head, I beat SWE-bench Pro, was scoring a game whose rules I’d quietly written myself. The painful lesson wasn’t a bug in the code. It was that a big, true-looking number is trivial to manufacture when you’re the one holding the answer key.

How I found out is the lesson under the lesson: the rulebook wasn’t in the repo, it was in the paper. The runnable code, the grader, the dataset, the harness, will happily let you run a task and tell you nothing about whether it’s the bench’s task. I had to read pages 4–9 of the Scale paper to learn what Pro actually does: it hands the agent the problem, requirements, and interface, and holds the fail2pass tests out for grading. It is not a red-to-green task. My early setup restored the gold tests, stood up a gate, and iterated to green, a different and much easier task that the code never warned me I’d wandered into. The cost of not reading the paper first was about a week and a thousand dollars of compute on a number that couldn’t be interpreted (WORKLOG, “Why the 94% was never a leaderboard number”). The cheap probe I’d skipped: a surprising number is a stop sign, and the first thing to read when you hit one is the source that defines the task, not the source that runs it.

Rule one: don’t hand over the oracle

SWE-bench grades a patch by running the project’s own test suite. Each instance ships with a test patch: the new or modified tests a correct fix has to pass. That test patch is the answer key. In the literature it’s the oracle, and the single fastest way to inflate your number is to hand it to the agent. It leaks through two surfaces that fail in opposite directions.

Input leak: the test source in the prompt. It is tempting to paste the failing test body into context. The bug report is vague, the test is precise, and the model fixes the bug faster when it can read exactly what “fixed” means. That’s the trap. A model that can see the assertion writes the smallest thing that satisfies the assertion, which is often not a fix but a hard-coded special case shaped like the test. You stop measuring whether the harness can diagnose and repair and start measuring whether it can pattern-match a function to an assert. I ran this as a deliberate baseline (PREREGISTRATION-baseline-replication.md): on a 50-instance sample, an implement-only loop with no oracle floored at 50%, and restoring the real tests for the gate to iterate against pushed the ceiling to 96% (the bracket). Forty-six points sat between those two arms, bought by the gate having the tests to iterate against, not by the same model reasoning any harder. If the test bodies are in the context, your number lives near that ceiling and it is the oracle’s, with your harness’s name on it.

The one permissible boundary, and it’s narrow: on the public set you may read the FAIL_TO_PASS test names as a stopping signal, “keep going until these pass,” since the public tests are visible anyway. The names tell the loop when to halt. The bodies tell the model what to write. The first is budget control; the second is cheating. SWE-bench Pro holds out a private split, so on the real target even the names are gone and the gate has to go blind; a harness that only works with the names in hand never really worked.

Capture leak: test edits in the graded diff. This one is sneakier because it fools you, not the score. Left alone, an agent that wants its own gate green will sometimes edit the test instead of the code. Grade the agent’s raw git diff and you grade its tampering and log a win that isn’t one. The fix is to capture source only: take the diff, then strip every test-file hunk, every build and runtime blob (node_modules, redis appendonly.aof, build directories), and any single-file diff over 256 KB, before anything is graded (METHODOLOGY.md). The official grader restores the real gold tests on a clean container regardless, so the only thing the agent’s test edits can ever do is deceive a local grader that trusts the agent’s diff. Source-only capture deletes that surface. Two of the only skill edits I made between starting and freezing were source-only capture rules. Neither changed how the agent solves anything; both bias against false wins. That’s the right place to spend your paranoia.

Rule two: your gate is a stopping signal, never the verdict

Follow capture to its conclusion: the agent’s opinion of whether it passed must never count. The harness has an internal gate so the loop knows when to stop. That gate is allowed to be wrong. Mine disagreed with the official grader once, a PATH bug that made it false-negative on Go while the gold patch graded resolved. It cost budget and changed no verdict, because the verdict is always the official grader re-run on the captured source-only diff in a fresh container.

I learned how load-bearing this is by trying to get out of it. On the held-out set there is no visible oracle, so I went looking for a substitute: let the agent reason its way to confidence, or read its own diff critically, or write and run its own repro from the bug report and stop when that passes. Every substitute failed, and the last one failed loudly. On four hard losses I had codex write its own check from the prose, no peeking at the gold tests, and stop when it went green. It went green 4 out of 4. The official grader failed all 4 (WORKLOG, “self-audit rung”). Not merely unhelpful: actively misleading. Trust that self-authored check as your stop signal and you ship four broken fixes believing they’re correct. You cannot bootstrap the oracle you don’t have. Reasoning about correctness, reviewing for it, executing a check you wrote yourself: all the same move wearing different hats, and none of them an independent ground-truth signal. The gate is. Build that wall before you tune anything, because every optimization afterward is pushing on a gate you’ve already decided not to trust.

Rule three: settle the attribution on a sample, before you spend 728 instances

This is the rule that separates building a harness from measuring one, and the one I most wish I’d internalized before the celebration. You do not run the full set to find out whether your harness works. You run a random subset, settle the attribution there, freeze, and spend the full run as confirmation. By the time you commit 728 instances of tokens you should already know the number within sampling error. The big run is the witness, not the experiment.

The corrected claim you’re working toward is narrow and worth saying out loud: this measures a harness on the public split under official grading, not a model’s capability on the benchmark. What you’re attributing is the thing it’s easy to fool yourself about: how much of that harness number is the harness, and how much is something else wearing its name. A resolve rate alone can’t tell you, so on a sample you run the arms that decompose it. Run the bare model on the standardized scaffold over the same instances, so your headline is a lift over a real baseline and not a figure floating in space. Run the oracle arm against the no-oracle arm, the 50-versus-96 split above, so you see exactly how many wins evaporate when the answer key leaves the prompt. Knock a stage out and re-run, to see whether it earns its tokens. Each of these is a few dozen instances, and each answers a question the full run never can. A big number you can’t decompose is a vibe, not a measurement.

A warning that belongs here, because the audience is people writing measurement code: your instrument has bugs too, and they void experiments silently. One of my review-arm runs detected a sentinel string in the agent’s stdout to decide “no findings, stop.” The sentinel matched the prompt’s own echo of that string, so it clean-stopped all 35 runs while 13 reviewers had actually found real defects. The whole arm was noise and I almost reported it (WORKLOG, “A2 sentinel bug”). Before you trust an ablation, confirm the instrument actually moved: that revisions happened, that the knob you turned changed behavior downstream. An experiment that runs clean and changes nothing is the most expensive kind of nothing.

And freeze before the scored run, for real. Iterating on a public sample is free and good; that’s how the harness takes shape. Tuning against the full scored set after you’ve seen its verdicts is overfitting with extra steps, and it’s invisible in the final number. I froze the harness as a tagged config and ran it once over the whole eligible set, scored verdicts walled off from further edits. The sample is where you learn; the full run is where you stop learning and report.

The subscription will betray the measurement, not just the wallet

Running on consumer tokens is where the wrong rulebook gets you a second time. The billing path manufactures its own false scoreboard: an auth stall or a starved quota produces a sub-90-second failure that looks exactly like the model giving up on a hard bug, and if you score it as one, your number is depressed by infrastructure wearing the model’s face.

A stray API key silently overrides your subscription. The claude CLI resolves credentials in a precedence order, and an ANTHROPIC_API_KEY in the environment wins over your subscription OAuth. A single leaked key, inherited from a shell profile or an old .env, doesn’t error: it bills you per request at PAYG rates while you believe you’re on the flat plan. I set CLAUDE_SUBSCRIPTION=1 and had the dispatch environment actively drop any stray key before launch. Treat a present-but-unwanted key as a fault to scrub, not a convenience to ignore.

Put the load on the bigger bucket. My harness runs two models behind two subscriptions: Claude for the generator, codex for the craft challenger. In my May 2026 setup the codex allowance is more than double the Claude Max one, so the binding constraint over a multi-day run is almost always the Claude bucket. That asymmetry is a design input. If a stage can run on either model, put it on the bucket with headroom and keep the scarce tokens for the work only that model does. I considered moving the challenger onto Claude for a same-model control and dropped it precisely because it would have dumped all the load onto the bucket that runs out first. Map the token ceilings before you assign the stages; the cheapest model is the one with headroom left at hour 60, not the one with the lowest per-token rate.

The OAuth bucket is shared with the consumer app, and the credential rotates out from under you. Max OAuth draws from the same pool as Claude.ai chat traffic, so the fleet gets starved exactly when that traffic peaks. My losses clustered hard at 19:00–21:00 UTC, US working hours, and went near-zero overnight: same boxes, same code, same instances, tracking other people’s chat sessions and nothing about my run. Separately, three times the provider rotated my pushed OAuth credential server-side and started rejecting it, producing waves of sub-90-second “losses” carrying the literal string Failed to authenticate. API Error: 401 Invalid authentication credentials over a 0-byte patch (auth_storm_2026-05-29.md). Read either of those as reasoning collapse and your score is depressed by infrastructure. The fix was never more retries: schedule off-peak, warm each box until it returns OK before dispatching, and move the tail to paid API billing so it stops sharing the bucket.

Don’t let infra failures launder into your score, and don’t let recovery become a re-roll lever

Here’s the trap inside that last section. Once you decide an auth storm isn’t a real loss, you’ve handed yourself a knife: “that one was infrastructure” becomes an excuse you can apply to any loss you dislike, after you see which ones they are. That’s how an honest run quietly becomes a curated one.

The defense is to commit the escape hatch in advance and make it mechanical. I wrote the fault class down with four invariants that all have to hold: the verbatim 401 string, a 0-byte patch, a wave of at least three instances, and resolution by a fresh credential. I committed that definition the day I first saw the storm, before any re-dispatched verdict landed, so it couldn’t become a post-hoc filter on results I’d already seen (PREREGISTRATION.md §14). Anything missing one invariant stays a loss and stands. A non-empty patch that graded “not resolved” is always a loss, no matter how I feel about it. Recovered instances re-run from a byte-identical task and reach a real verdict; the final tally had zero parked in limbo. Infrastructure gets discounted by a rule written before the outcomes were visible, not by judgment applied after.

Two more infra lessons earned at hour 60, both one line. Sanity-check absolute state, not derived state. My grader-health watchdog killed healthy graders because it inferred liveness from sibling containers, and the bench’s own grader leaks orphan containers that poisoned that inference (WORKLOG, “grader leaks containers”). And guard your env at the call site. An unset variable after set -u once turned a fleet rsync into a copy of my entire root filesystem to each box; : ${REPO:?} ${SSH:?} would have stopped it cold. Tired-operator-at-hour-60 is part of your threat model.

Do every free check before you spend a token

The cheapest lesson, learned fastest. On a subscription, token volume is the scarce resource, so anything verifiable at $0 should be verified before any tokens move. Grade the gold patches through your adapter to confirm the grader works across every language in the set. Run your gate’s self-tests, red-on-base and green-on-gold, so you know it discriminates before you trust it to stop a loop. Confirm Docker, the pinned eval commit, the capture stripping, all free. My bootstrap.sh prints READY or the exact debug command and refuses to proceed until green. Every minute spent making the free checks loud is a token you don’t waste discovering, mid-run, that your adapter silently mis-graded one language the whole time.

What it cost, and the field card

So you can budget your own: about $5.14 per instance at full metered API rates for both models, the portable figure to plan against (COST_BASIS.md). My actual cash was far lower, roughly $870 all in, because most of the 728 ran on the flat Max plan at near-zero marginal and only the auth-stall tail, around 310 instances, billed to paid API; EC2 was about $58. Report those two numbers separately and always quote the economic rate to a reproducer, because the subscription subsidy doesn’t reproduce. (The 728: 731 dataset instances minus 3 whose own gold patch fails the official grader, frozen as defects before the run.)

The walls that keep a coding-harness number honest, on one card:

Read the paper, not just the repo. The task definition lives in the spec; the code will run a different, easier task and never tell you. A surprising number is a stop sign.
No test bodies in the prompt. Public-set test names as a stop signal at most; on a held-out set, blind.
Capture source only. Strip test files, build and runtime blobs, oversized diffs, before grading.
The official grader on a fresh container is the only verdict. Your gate is a stop signal that’s allowed to lie.
You can’t bootstrap the oracle you don’t have. Self-audit went 4/4 false-green against the real grader.
Attribute on a sample first: lift over the bare model, oracle vs no-oracle, stage ablations. Confirm the instrument actually moved.
Freeze before the full run. The sample is for learning; the scored run is for reporting.
Predeclare infra fault classes with invariants, before any re-dispatch lands. Otherwise recovery is a re-roll lever.
Pin the billing path. Scrub stray API keys, load the bigger bucket, run off-peak or pay for the tail.
Spend $0 before you spend tokens. Gold-grade, gate self-test, language coverage, all free.
Report economic cost separately from cash.

A benchmark this size was runnable by one person on a consumer plan, at least in this run and at these quotas. The thing between you and an honest number isn’t compute; it’s whether you hand over the oracle, trust your own gate, or let the billing path lie to you. I learned that by getting a number I was proud of, then finding out I’d scored it by my own rules. The 95.3% survived everything on that card, source-only capture, the official grader, a frozen config, and infra rules written before the outcomes were visible, which is exactly why it’s the smaller, bounded claim and not the trophy I first mistook it for. Build the walls first, and the celebration you get to keep is the one you graded honestly.