← back to scientific method

Game-Theoretic Statistics and Safe Anytime-Valid Inference

Aaditya Ramdas, Peter Grünwald, Vladimir Vovk & Glenn Shafer · 2023 · Statistical Science 38(4), 576–601

Two decades of game-theoretic statistics, surveyed and named: SAVI. E-values are the evidence, anytime validity is the guarantee, multiplication is how you compose along a sequential test.

three properties p-values lack, all valid by construction Continuous monitoring peek anytime stop anytime no α inflation Composition under dependence multiply e-values across studies no Bonferroni Optional stopping stop on success stop on futility stop on funding test supermartingale E[Eₙ | ℱₙ₋₁] ≤ 1 under H₀ Doob + Markov ⇒ P(sup Eₙ ≥ 1/α) ≤ α a single inequality covers all three properties

The argument

The 2023 paper is a survey, not a discovery. Ramdas, Grünwald, Vovk, and Shafer take twenty years of scattered work on game-theoretic statistics (Shafer & Vovk's 2001 monograph, Vovk & Wang's e-value calibration, Grünwald's safe testing, Howard et al.'s confidence sequences, Ramdas et al.'s anytime-valid concentration inequalities) and consolidate it under a single name: safe anytime-valid inference, SAVI.

The unification matters because it makes the framework usable. Before this paper, anyone wanting to escape the p-value had to assemble parts from a dozen sources, reconcile incompatible notations, and figure out which results applied to their setting. After it, there is a textbook treatment with one vocabulary, one set of theorems, and worked examples for the standard tests: t-test, ANOVA, binomial, contingency tables, Cox regression. The mathematics had been ready for a decade. The packaging was the bottleneck.

The framework rests on a single inequality. If the wealth process is a (super)martingale under the null, then by Ville's inequality (a martingale generalization of Markov's) the probability the wealth ever exceeds 1/α is at most α. From that one fact, three properties fall out at once. Continuous monitoring is direct: the bound holds at every step. Optional stopping is the same statement evaluated at any stopping time. Sequential composition takes one extra step: when test supermartingales are chained correctly, their product is itself a test supermartingale, even with conditionally dependent increments. The book ships a recipe for each standard test. The freedom is real, but local to those constructions, not a universal "any dependence goes" license.

What changed

Old (Fisher) New (SAVI)
P-value: P(T ≥ t | H₀)E-value: E[E | H₀] ≤ 1
Fixed sample size, no peekingAnytime-valid; peek and stop freely
Combine via Bonferroni or alpha-spendingMultiply along a sequential test (no separate correction)
Independence between observations assumedConditionally dependent increments allowed by construction
Power calculation before the experimentWealth target sets the stopping rule

Connections

Shafer's 2021 paper laid out the betting interpretation as a philosophical and practical reframing. The 2023 survey is the technical follow-through: every standard testing scenario, recast in the betting framework, with explicit constructions of the test supermartingales. Where Shafer 2021 said "here is the better way to think about evidence," Ramdas et al. say "here is the working manual."

Fisher's p-value and the Neyman-Pearson decision framework were built for fixed-horizon experiments where the analyst commits to a sample size before seeing the data. That assumption was always a constraint on how the math could be used, not a description of how scientists actually work. Ioannidis showed what happens when the constraint is widely ignored: false positives at scale. SAVI does not claim Fisher was wrong; it proposes a different evidence machinery for the regime where peeking, sequential analysis, and combination across studies are the norm rather than the exception. Fisher remains correct inside his assumptions. SAVI extends the toolkit when those assumptions don't hold.

Mayo's severity framework and SAVI converge from different directions. Severity asks counterfactually whether a test could have caught a false claim. SAVI delivers a related operational guarantee: the wealth process is not expected to grow under the null, so accumulated wealth is a measure of how thoroughly the null has been probed. Mayo argues for severity from the philosophy of science. Ramdas et al. derive a complementary guarantee from the calculus of martingales. Different machinery, overlapping epistemic role.

Gwern's long content and SAVI are the two halves of the post-Ioannidis recovery. Gwern fixes the human side: publish the trail so the method cannot be gamed. SAVI fixes the tool side: build a tool that does not penalize the things humans naturally want to do. Without Gwern's discipline, SAVI tools can still be selectively reported. Without SAVI's tools, Gwern's discipline still struggles against the perverse incentives of fixed-sample-size testing. Both responses are necessary because the replication crisis is both a behavioral and a mathematical problem.

Adoption

The slow part is now ahead. SAVI tools are settled; the question is whether journals, regulators, and reviewers will accept them. Online experimentation moved first: Optimizely's Stats Engine, Statsig, and Eppo all ship sequential or always-valid tests, because their users were peeking at dashboards anyway and the alternatives leaked false positives. Pharma followed, with confidence sequences and group-sequential designs running adaptive trials. The FDA's 2019 adaptive-design guidance now treats them as a recognized regulatory pathway. Academic statistics is the slowest mover, which is the usual order: industry adopts first, journals last. The mathematics is settled. The sociology is still in motion.

The paper is on arXiv:2210.01948. Aaditya Ramdas's CMU page hosts an extensive bibliography of follow-up work and tutorials. The Game-Theoretic Probability and Finance project at probabilityandfinance.com archives the founding work by Shafer & Vovk.
Neighbors

External