Statistical Inference as Severe Testing
Deborah G. Mayo ยท 2018 ยท Cambridge University Press
A test only confirms a hypothesis if it had a real chance of catching the hypothesis being wrong. The severity principle.
Severe vs. non-severe tests
Two experiments both produce data consistent with hypothesis H. Only one of them tells you anything.
The argument
Mayo reunifies two traditions that had been at war for decades. Popper said science works by trying to falsify hypotheses. Fisher and Neyman-Pearson built the statistical machinery for testing them. The two sides never agreed on what a "significant result" actually means. Mayo resolves the dispute by showing both contribute to the same underlying logic: severity.
A result is evidence for H only if the test had a high probability of producing a different result when H is false. That is the severity principle. It sounds obvious once stated. The consequences are not.
Consider a drug trial. The drug shows a small positive effect, p = 0.04. By the standard ritual, this is "significant." But severity asks: how likely was this test to detect that the drug does not work? If the trial was underpowered, the answer is: not very. A weak test that happens to produce p < 0.05 has low severity. The positive result is not informative. It passed a test that could not have caught it failing.
Flip it around. A well-powered trial produces p = 0.06. By the ritual, this is "not significant." But severity asks a different question: has this test severely probed the claim that the drug has no effect? If the trial had 95% power and the best it found was p = 0.06, then the claim of zero effect has been severely tested and not refuted. The data are evidence that any effect is small. The ritual says "not significant." Severity says "informative."
Key concepts
| Concept | What it means |
|---|---|
| Severity principle | Data support H only when the test had a high chance of finding against H if H were false. |
| Error statistics | The framework: evaluate inference methods by their error probabilities, not by posterior beliefs. |
| Piecemeal testing | Don't test an entire theory at once. Isolate specific claims and severely test each one. Build from parts. |
| Probativeness | The capacity of a test to discriminate between H and alternatives. A test is probative when passing it counts for something. |
Connections
Popper gave falsification its philosophical force. A theory is scientific only if it can be falsified. But Popper never provided a statistical account of how much evidence counts as falsification. He dealt in logical possibility, not probability. Mayo gives falsification statistical teeth. Severity quantifies how hard you tried to falsify and how decisively the attempt failed. Popper says "try to refute." Mayo says "here is how to measure whether you tried hard enough."
Fisher introduced the p-value as a measure of surprise under the null hypothesis. Neyman and Pearson reframed testing as a decision procedure with Type I and Type II error rates. The two approaches were never reconciled in their lifetimes. Fisher called Neyman's framework "childish." Neyman called Fisher's p-values "worse than useless." Mayo shows that both contribute to severity. Fisher's p-value measures how surprising the data are under the null. Neyman-Pearson's power measures how likely the test was to catch a false null. Severity needs both: a low p-value from a high-power test is severe. A low p-value from a low-power test is not.
Ioannidis demonstrated that most published findings are false. Severity is the diagnostic for his problem. The studies Ioannidis flags as unreliable are precisely those with low severity: small samples, flexible designs, low prior probability, underpowered tests. They produced p < 0.05, but they could not have caught the hypothesis being wrong. Severity does not merely diagnose the replication crisis. It predicts which findings will fail to replicate and which will hold.
Feynman said the first principle is that you must not fool yourself. Severity operationalizes that principle. A non-severe test is exactly the mechanism by which you fool yourself: you designed a procedure that confirms your hypothesis regardless of whether it is true. The test looks like science. It has a p-value, a sample size, a confidence interval. But it could not have caught you being wrong. Severity turns Feynman's informal principle into something you can calculate.
The capstone
This book sits at the end of the scientific method collection for a reason. Everything before it is building toward this.
Bacon gave us observation. Descartes gave us systematic doubt. Hume showed that observation alone proves nothing. Mill provided methods for extracting causes. Popper said the only honest move is to try to refute your own theory. Kuhn showed that scientists usually don't. Feynman said they should. Ioannidis showed that they don't, at massive scale, with measurable consequences.
Mayo provides the answer: severity. Your test counts only if it could have caught you being wrong. Apply this principle and you get Popper's falsificationism with statistical rigor. Ignore it and you get the replication crisis. The principle is simple. Following it is hard. Following it is the whole point.
Integrity as severity
Scientific integrity, in Mayo's framework, has a precise definition. An honest scientist designs tests that are severe with respect to their own hypotheses. They choose experiments that could refute their claims. They report the sensitivity of their tests alongside the results. They ask "what would it take for this test to show I am wrong?" before they ask "what did the test show?"
This is more demanding than it sounds. A researcher who runs many analyses and reports the one that works has low severity. A researcher who pre-registers a single analysis and sticks to it has higher severity. The difference is not honesty in the colloquial sense. It is structural: the second researcher designed a procedure that could catch a mistake. The first designed one that could not.
Neighbors
- ๐ฌ Popper 1934 โ falsification without statistical machinery; Mayo supplies it
- ๐ฌ Fisher 1935 โ the p-value and significance testing that severity subsumes
- ๐ฌ Ioannidis 2005 โ the crisis severity was built to diagnose
- ๐ฌ Feynman 1974 โ "you must not fool yourself," now with a formal definition
- ๐ฌ Chamberlin 1890 โ multiple hypotheses as a prerequisite for severe testing
- ๐ฌ Platt 1964 โ strong inference as a special case of severity
External
- errorstatistics.com โ Mayo's blog, the best free resource on severity
Severity in philosophy of science (Wikipedia)
Neyman-Pearson lemma (Wikipedia)