Why Most Published Research Findings Are False
John P. A. Ioannidis ยท 2005 ยท PLoS Medicine 2(8): e124
Most published research findings are false. The replication crisis, diagnosed before it had a name.
The base rate problem
Suppose one in ten hypotheses tested is actually true. A study with 80% power and a 5% significance threshold will produce results like this:
The arithmetic is not subtle. With a 10% prior, more than a third of published positives are false. Drop the prior to 1% and the false discovery rate climbs past 80%. The significance threshold was never designed to handle low base rates. Fisher knew this. The field forgot.
The argument
Ioannidis builds from a Bayesian framework. Before a study runs, the hypothesis has some prior probability of being true. The study applies a significance threshold (usually p < 0.05) and has some statistical power (often far below the nominal 80%). The positive predictive value of a "significant" result depends on how these three numbers combine.
When prior probability is low, the math tilts against truth. A 5% false positive rate sounds small until you realize it applies to the 90% of hypotheses that are wrong. The 80% true positive rate applies to only the 10% that are right. False positives outnumber true positives.
That alone would be manageable. Ioannidis identifies five factors that make it worse.
Five accelerants
| Factor | Why it inflates false positives |
|---|---|
| Low prior probability | Most hypotheses tested are wrong. The base rate dominates. |
| Small effect sizes | Small effects require large samples. Underpowered studies miss true effects and amplify noise. |
| Flexible designs | More outcome variables, more subgroups, more analytic choices. Each is another chance to find a "significant" result by accident. |
| Financial and career interest | Incentives to find positive results. Publication bias: journals want novelty, not null results. |
| Many teams testing the same hypothesis | If 20 teams test the same false hypothesis, one will get p < 0.05 by chance. That team publishes. The other 19 don't. |
Each factor operates independently. Stack them and the positive predictive value of a published finding can drop below 50%. In fields where all five are present, the majority of published claims are false.
The telltale bump
If researchers test true hypotheses honestly, p-values cluster near zero. If they test false hypotheses honestly, p-values spread uniformly. What actually shows up in the published literature is neither: a suspicious bump just below 0.05.
Head et al. (2015) found this bump across thousands of papers in multiple fields. It's the fingerprint of p-hacking: researchers who run analyses until they cross the threshold, then stop. The method selects for results that barely pass, which is exactly the population most likely to be false.
The replication crisis
Ioannidis published in 2005. The replication crisis arrived on schedule.
In 2015, the Reproducibility Project attempted to replicate 100 psychology studies. Ninety-seven of the originals reported significant results. Thirty-six of the replications did. The effect sizes in the replications were half the originals on average. Some landmark findings vanished entirely.
Cancer biology fared no better. Amgen tried to replicate 53 "landmark" studies. Six replicated. Bayer reported similar numbers: roughly 25% replication across their therapeutic areas.
In economics, the Federal Reserve Bank of St. Louis replicated 67 studies. About 50% held up, better than psychology but still alarming for a field that informs policy.
These are not cherry-picked failures. They are systematic attempts to reproduce the published record. The published record is unreliable. Ioannidis saw it coming a decade before the data confirmed it.
Connections
The p-value was never meant to carry this weight. Fisher introduced it as a rough guide for when to look more closely, not as a decision boundary for truth. Somewhere between 1935 and 2005, the scientific community turned a heuristic into a ritual. Ioannidis is documenting what happens when the ritual replaces the reasoning.
๐ฌ Feynman warned in 1974 that scientists were performing the motions of science without the substance. He called it cargo cult science: build the runway, light the fires, wait for the planes that never come. Ioannidis is showing that the cargo cult operates at industrial scale. Thousands of papers per year, most of them reporting planes that never landed.
๐ฌ Mayo provides the diagnostic. Her severity principle asks: did the test have a high probability of producing a different result if the hypothesis were false? If not, a positive result is uninformative. Most of the studies Ioannidis flags would fail the severity test. The design was too flexible, the power too low, the prior too thin. The machinery produced a number below 0.05, but the test was not severe.
๐ฌ Popper said science advances by trying to falsify. Ioannidis is showing that scientists routinely fail to falsify. The incentive structure rewards confirmation. Publication bias filters for it. Career advancement depends on it. The community selects for the behavior Popper said disqualifies a field from being science at all.
The incentive problem
Publish or perish is a selection mechanism. Researchers who produce positive results get published. Those who publish get funded. Those who get funded survive. The system selects for false positives the same way evolution selects for fitness: not by intent, but by differential survival.
Pre-registration helps. Open data helps. Replication requirements help. But the fundamental incentive remains: your career depends on producing results that journals will print. Journals print novelty and significance. Null results and replications sit in file drawers. Until the incentive structure changes, the replication crisis is a chronic condition, not an acute one.
Neighbors
- ๐ฌ Fisher 1935 โ the machinery Ioannidis shows being misused
- ๐ฌ Feynman 1974 โ the integrity principle being violated at scale
- ๐ฌ Popper 1934 โ falsification as the standard scientists are failing to meet
- ๐ฌ Mayo 2018 โ the proposed fix: severity as the formal diagnostic
- ๐ฌ Chamberlin 1890 โ multiple hypotheses as the antidote to confirmation bias
- ๐ฐ Probability Ch.7 โ Bayes' theorem and base rates: Ioannidis's core argument is a Bayesian calculation about prior probability
External