← back to scientific method

Registered Prediction

Falsification as daily practice · Example: Gwern Branwen, 2010→ · gwern.net/prediction-market

Any claim can be hedged after the fact. The defense is long content: a public, timestamped record that makes post-hoc revision impossible.

calibration: predicted vs. actual predicted probability actual frequency perfect 0% 50% 100% underconfident (actual > predicted)

The method

Popper gave science a criterion: a theory is scientific if it specifies what would disprove it. But the criterion is rarely practiced. Scientists state hypotheses. They don't often write down, in advance, the exact observation that would make them abandon the hypothesis. Pundits never do. Registered prediction operationalizes falsification: state your claim, assign a probability, set a deadline, record it publicly, then wait.

The mechanics are simple. "I assign 70% probability that X happens by 2025." When 2025 arrives, X either happened or it didn't. Over hundreds of such predictions, a calibration curve emerges. If your 70% predictions come true 70% of the time, you're well-calibrated. If they come true 90% of the time, you're systematically underconfident. If 50%, you're overconfident. The curve is a diagnostic for self-deception, measured quantitatively.

Kant anticipated this. In the Critique of Pure Reason, he observed that a bet "disconcerts" the dogmatist because it forces honest assessment of confidence. Alex Tabarrok gave the modern version: "A bet is a tax on bullshit." A registered prediction is a bet against your future self, scored by reality.

What the method catches

Feynman's first principle: you must not fool yourself. If you consistently predict 80% and events happen 60% of the time, you are fooling yourself by exactly 20 percentage points. You can't argue with the calibration curve. You can only update your predictions.

Mayo's severity criterion asks: could this test have caught you being wrong? A prediction registered at 90% that resolves false is a severe test. If your model survives thousands of such tests with good calibration, the model has passed severe testing. If it doesn't, the curve shows exactly where it breaks down. Over 2,000 predictions, the sample is large enough that systematic bias cannot hide behind variance.

Chamberlin warned that a favorite hypothesis blinds you. Prediction registration works against this: you can't retroactively claim you predicted the outcome you observed. The record is public, timestamped, immutable. Post-hoc rationalization — the most common form of self-deception in science and punditry alike — is structurally impossible.

What the practice requires

Element Why it matters
Explicit probability"I think X" is unfalsifiable. "70% by March" is not. The number is the method.
Public registrationPrevents revision. The prediction exists before the outcome. Popper's criterion, enforced by a timestamp.
Resolution trackingMost predictions vanish. Tracked ones accumulate into a calibration record that Mayo would recognize as a severity audit.
VolumeA few lucky calls prove nothing. Thousands reveal systematic bias. Fisher's lesson: significance requires sample size.
Published failuresFeynman's integrity test: report what didn't work. A calibration record that hides the misses is cargo cult prediction.

Discussion

Pre-registration in clinical trials solves the same problem at the institutional level. Researchers declare their hypotheses and analysis plan before running the experiment. This prevents the p-hacking and HARKing that Ioannidis showed inflates false positives across entire fields. Registered prediction is pre-registration scaled down to one person making claims about the world. The mechanism is the same: commit to your hypothesis before seeing the data.

The Good Judgment Project showed that forecasting skill is real, measurable, and trainable. It also showed that it degrades without practice. Participation across multiple seasons produced rankings that went up and down — top quartile one year, 12th percentile another. That variation, published alongside the peaks, is itself a data point about the method's demands. Calibration isn't a talent you have. It's a practice you maintain.

The deeper point is about Goodhart's law. Popper's falsification criterion can be gamed: state a vague hypothesis, then reinterpret the outcome to fit. Mayo's severity criterion can be selectively applied: report the tests that passed, bury the ones that didn't. The paper trail resists gaming because it's public, timestamped, and immutable. You can Goodhart a method. You can't Goodhart a record that anyone can audit.

Neighbors