Testing by Betting

Glenn Shafer · 2021 · JRSS-A 184(2), 407–431

If you can get rich wagering against the null hypothesis, the null is wrong. The wealth you accumulate is the evidence.

The argument

Statistical testing began in two minds. Fisher's p-value asked: how surprising are these data under the null? Neyman and Pearson asked: how often will this procedure reject a true null in the long run? Both framings dominated twentieth-century practice, and both have a peculiar property: they punish you for looking at your data more than once. Peek and you must correct for it; combine two studies and you must adjust for dependence; stop early and your p-value is no longer valid. The mathematics treats curiosity as a crime against rigor.

Shafer is reviving an older tradition. Probability, on the betting interpretation, has always been about wagers. Pascal, Bernoulli, and de Moivre worked in this idiom, and Kolmogorov's measure-theoretic axiomatization in the 1930s buried it. Shafer and Vovk's program over the 2000s and 2010s pulls it back into hypothesis testing. Skeptic puts up $1 against the null, designs a wager whose conditional expected payoff under the null is at most $1, and lets the data play out. The ratio of final wealth to starting wealth is the e-value. If Skeptic ends with $20, the data argued against the null with strength 20, in the same units regardless of when Skeptic chose to stop.

The trick is that the wealth process is a (super)martingale under the null. Ville's inequality, a martingale generalization of Markov's, then guarantees that under the null the probability the wealth ever crosses 1/α is at most α. You get a valid level-α test that is anytime-valid by construction. The realized path can still rise; the bet is just not expected to win unless the null fails.

Key concepts

Concept	What it means
E-value	A non-negative random variable with expectation ≤ 1 under the null. Larger values are stronger evidence against the null.
Test supermartingale	A wealth process whose conditional expectation never increases under the null. Realized paths can rise; the bet is just not expected to win unless the null fails. Its value at any stopping time is a valid e-value.
Skeptic vs. Nature	Game-theoretic framing: Skeptic bets against the null, Nature reveals data. The story is the proof.
Multiplicative composition	When e-values are chained as a sequential test, their product is a valid e-value for the combined test. The bound holds at every stopping time without Bonferroni or alpha-spending — within that construction.

Connections

Fisher's p-value answered "how surprised should I be by these data, under the null?" Shafer reframes the question as "how much would I bet against the null, given these data?" The two are not the same number, and Shafer argues the betting question is the more honest one. A p-value of 0.04 says "if the null were true, results this extreme arise 4% of the time," but that probability is fragile under any deviation from the assumed sampling plan: peeking, optional stopping, dependent observations, post-hoc analyses. An e-value of 25 says "Skeptic turned $1 into $25 betting against the null." That statement does not depend on when Skeptic chose to stop.

Mayo's severity principle and Shafer's betting framework are first cousins. Both ask whether a test could have caught a false hypothesis. Severity asks counterfactually: would a false null have probably produced different data? Betting asks operationally: did Skeptic, betting against the null, walk away rich? When the test is well-designed, the two answers align. The betting framework adds something Mayo's framework lacks: a single number, in interpretable units, that composes across experiments.

Ioannidis diagnosed the replication crisis as a tooling problem dressed up as a discipline problem. Researchers peek at their data because curiosity is unstoppable; the tools then punish them for it, so they hide the peeking. The mathematics encourages concealment. Shafer's framework removes the incentive: peek all you want, the wealth process is valid at every stopping time. The tool stops fighting the human.

The reframing

This is not a small technical adjustment. Shafer is claiming that fixed-horizon significance testing was always a special case of something more general, and a poorly-behaved one. The general thing is the betting game; the fixed-sample-size test is what you get when you forbid Skeptic from reacting to intermediate results. Most real experiments do not satisfy that constraint, which is why p-values are so easy to misuse and so hard to combine.

The historical irony is that the betting framework is closer to what scientists actually want to do (accumulate evidence, peek when curious, stop when convinced) than the Fisherian apparatus that occupied the twentieth century. The mathematics caught up to the practice, not the other way around.

The paper is on arXiv:1903.06991. The longer book treatment is Shafer & Vovk's Game-Theoretic Foundations for Probability and Finance (Wiley, 2019). The 2001 precursor is Probability and Finance: It's Only a Game!

Neighbors

🔬 Fisher 1935 — the p-value Shafer's framework supersedes
🔬 Mayo 2018 — severity as a counterfactual cousin of betting
🔬 Ioannidis 2005 — the crisis the betting framework attacks mathematically
🔬 Ramdas 2023 — the consolidation into safe anytime-valid inference

External

← Registered prediction · 18 of 20 by june.kim Ramdas 2023 · 20 of 20 →