← back to scientific method

The Design of Experiments

R. A. Fisher · 1935 · Oliver & Boyd · Internet Archive

Randomization, significance tests, analysis of variance. Fisher built the statistical machinery that modern experimental science runs on. The null hypothesis, the p-value, factorial design: all codified here.

population N subjects R random assignment treatment intervention control no intervention measure measure compare p-value the randomized controlled trial: random assignment makes groups comparable on all variables, known and unknown

Key concepts

Concept Definition
Null hypothesisThe default claim that there is no effect; the thing you try to reject.
p-valueThe probability of data this extreme if the null hypothesis were true.
RandomizationAssign subjects to conditions by chance, so confounders cancel out.
Factorial designVary multiple factors simultaneously to detect interactions.
ReplicationRepeat measurements to estimate variability and improve precision.
ANOVAPartition total variance into components (treatment, error) to test group differences.

The argument

Fisher's starting point was agricultural field trials at Rothamsted. Different plots of land vary in soil quality, drainage, sunlight. If you assign all treatment plots to one side of the field and all control plots to the other, any measured difference might be the treatment or might be the soil. Randomization solves this by making it equally likely that each plot ends up in either group. Confounding variables do not disappear, but they distribute evenly in expectation.

From randomization Fisher built significance testing. Under the null hypothesis, the treatment has no effect, so any measured difference is due to chance variation in assignment. The p-value measures how surprising the observed difference would be under this null. A small p-value means the data are unlikely if nothing is happening, which is grounds (Fisher argued) for rejecting the null.

Fisher also formalized factorial design: instead of testing one variable at a time (does fertilizer work? does irrigation work?), vary both simultaneously. This is more efficient and reveals interactions that single-factor experiments cannot detect. A fertilizer that works on dry soil but not irrigated soil is invisible to one-factor-at-a-time testing.

The lady tasting tea opens the book and illustrates the whole framework in miniature. A colleague claimed she could tell whether milk or tea was poured first. Fisher designed a protocol: eight cups, four of each type, presented in random order. Under the null hypothesis (she is guessing), the probability of getting all eight correct is 1/70. The example contains randomization, a null hypothesis, and a significance calculation in three pages.

Discussion

Fisher formalized what wpMill called the method of difference: if two situations differ in one factor and produce different results, that factor is the cause. Randomization is what makes the "differ in one factor" condition achievable in practice. Without it, Mill's method is an aspiration. With it, the method becomes a protocol.

The wpNeyman-Pearson framework reframed significance testing as decision theory. Where Fisher treated the p-value as a measure of evidence against the null, Neyman and Pearson introduced Type I errors (false positives) and Type II errors (false negatives) and treated testing as choosing between two actions. Fisher rejected this framing for the rest of his life. He saw the p-value as a property of the data, not a decision rule. The two frameworks coexist in modern practice, often inconsistently. Most textbooks teach a hybrid that neither Fisher nor Neyman would endorse.

wpDeborah Mayo has spent decades reconciling them. Her "severe testing" framework says a result is evidence for a hypothesis only if the test had a high probability of not producing that result if the hypothesis were false. This recovers Fisher's intuition (the p-value matters) within a Neyman-Pearson structure (error probabilities matter). It also connects back to Popper: a severe test is one that could have falsified the hypothesis but didn't.

Failure and integrity

Fisher was a eugenicist. Not casually or incidentally, but as a matter of sustained intellectual commitment. He served on the editorial board of the Annals of Eugenics. He argued against the scientific consensus on the harms of tobacco, partly to defend individual liberty but also because his statistical judgment was colored by personal conviction. He opposed the UNESCO statement on race in 1950.

This is the hardest integrity question in the collection. Fisher's statistical methods are sound. Randomization works. ANOVA partitions variance correctly. The p-value does what it says. The tools are morally neutral. The scientist who built them was not.

It would be simpler if his eugenics reflected statistical errors, but it does not. Fisher applied competent statistics to questions framed by ideology. The framing itself was the failure: which questions to ask, which populations to study, which outcomes to measure. Sound method applied to a monstrous program produces monstrous conclusions with an unearned veneer of rigor. The tools did not cause the harm, but they lent it authority.

The lesson is that statistical competence does not protect against moral error. Chamberlin's method of multiple working hypotheses might have helped at the level of scientific questions, but the problem with eugenics was not a shortage of alternative hypotheses. The problem was that the questions themselves were corrupted by the assumption that human worth admits of quantitative comparison.

Neighbors
  • Chamberlin 1890 — hold multiple hypotheses; Fisher provided the machinery to test them
  • Popper 1934 — falsifiability as demarcation; Fisher's p-value as a falsification tool
  • 🎰 Probability Ch.3 — combinatorics and the null distribution: Fisher's exact test counts permutations to compute p-values
  • Mayo 2018 — severity testing as a rigorous reinterpretation of Fisher's p-value framework
  • Ioannidis 2005 — why Fisher's framework breaks down under selection pressure and low base rates

External

The book remains in print. Read it on the Internet Archive. The lady tasting tea is Chapter II.