← back to scientific method

Why Most Published Research Findings Are False

John P. A. Ioannidis ยท 2005 ยท PLoS Medicine 2(8): e124

Most published research findings are false. The replication crisis, diagnosed before it had a name.

The base rate problem

Suppose one in ten hypotheses tested is actually true. A study with 80% power and a 5% significance threshold will produce results like this:

Study says positive Study says negative Hypothesis true Hypothesis false true positives false negatives false positives true negatives drag the slider: lower the prior and watch the false discovery rate climb

The arithmetic is not subtle. With a 10% prior, more than a third of published positives are false. Drop the prior to 1% and the false discovery rate climbs past 80%. The significance threshold was never designed to handle low base rates. Fisher knew this. The field forgot.

The argument

Ioannidis builds from a Bayesian framework. Before a study runs, the hypothesis has some prior probability of being true. The study applies a significance threshold (usually p < 0.05) and has some statistical power (often far below the nominal 80%). The positive predictive value of a "significant" result depends on how these three numbers combine.

When prior probability is low, the math tilts against truth. A 5% false positive rate sounds small until you realize it applies to the 90% of hypotheses that are wrong. The 80% true positive rate applies to only the 10% that are right. False positives outnumber true positives.

That alone would be manageable. Ioannidis identifies five factors that make it worse.

Five accelerants

Factor Why it inflates false positives
Low prior probabilityMost hypotheses tested are wrong. The base rate dominates.
Small effect sizesSmall effects require large samples. Underpowered studies miss true effects and amplify noise.
Flexible designsMore outcome variables, more subgroups, more analytic choices. Each is another chance to find a "significant" result by accident.
Financial and career interestIncentives to find positive results. Publication bias: journals want novelty, not null results.
Many teams testing the same hypothesisIf 20 teams test the same false hypothesis, one will get p < 0.05 by chance. That team publishes. The other 19 don't.

Each factor operates independently. Stack them and the positive predictive value of a published finding can drop below 50%. In fields where all five are present, the majority of published claims are false.

20 teams test the same false hypothesis * 19 null results file drawer published p < .05 the one that crossed the threshold by chance

The telltale bump

If researchers test true hypotheses honestly, p-values cluster near zero. If they test false hypotheses honestly, p-values spread uniformly. What actually shows up in the published literature is neither: a suspicious bump just below 0.05.

p-value frequency 0 .01 .02 .03 .04 .05 ฮฑ = .05 expected the bump p-value distribution in published results (schematic)

Head et al. (2015) found this bump across thousands of papers in multiple fields. It's the fingerprint of p-hacking: researchers who run analyses until they cross the threshold, then stop. The method selects for results that barely pass, which is exactly the population most likely to be false.

The replication crisis

Ioannidis published in 2005. The replication crisis arrived on schedule.

In 2015, the Reproducibility Project attempted to replicate 100 psychology studies. Ninety-seven of the originals reported significant results. Thirty-six of the replications did. The effect sizes in the replications were half the originals on average. Some landmark findings vanished entirely.

Cancer biology fared no better. Amgen tried to replicate 53 "landmark" studies. Six replicated. Bayer reported similar numbers: roughly 25% replication across their therapeutic areas.

In economics, the Federal Reserve Bank of St. Louis replicated 67 studies. About 50% held up, better than psychology but still alarming for a field that informs policy.

These are not cherry-picked failures. They are systematic attempts to reproduce the published record. The published record is unreliable. Ioannidis saw it coming a decade before the data confirmed it.

Connections

The p-value was never meant to carry this weight. Fisher introduced it as a rough guide for when to look more closely, not as a decision boundary for truth. Somewhere between 1935 and 2005, the scientific community turned a heuristic into a ritual. Ioannidis is documenting what happens when the ritual replaces the reasoning.

๐Ÿ”ฌ Feynman warned in 1974 that scientists were performing the motions of science without the substance. He called it cargo cult science: build the runway, light the fires, wait for the planes that never come. Ioannidis is showing that the cargo cult operates at industrial scale. Thousands of papers per year, most of them reporting planes that never landed.

๐Ÿ”ฌ Mayo provides the diagnostic. Her severity principle asks: did the test have a high probability of producing a different result if the hypothesis were false? If not, a positive result is uninformative. Most of the studies Ioannidis flags would fail the severity test. The design was too flexible, the power too low, the prior too thin. The machinery produced a number below 0.05, but the test was not severe.

๐Ÿ”ฌ Popper said science advances by trying to falsify. Ioannidis is showing that scientists routinely fail to falsify. The incentive structure rewards confirmation. Publication bias filters for it. Career advancement depends on it. The community selects for the behavior Popper said disqualifies a field from being science at all.

The incentive problem

Publish or perish is a selection mechanism. Researchers who produce positive results get published. Those who publish get funded. Those who get funded survive. The system selects for false positives the same way evolution selects for fitness: not by intent, but by differential survival.

Pre-registration helps. Open data helps. Replication requirements help. But the fundamental incentive remains: your career depends on producing results that journals will print. Journals print novelty and significance. Null results and replications sit in file drawers. Until the incentive structure changes, the replication crisis is a chronic condition, not an acute one.

The paper is open access. Read it at PLoS Medicine. It is nine pages long and requires no specialized training. There is no excuse for not reading it.
Neighbors

External