← back to methodeutics

Evidence Has a Trajectory

Chapter 8 · de Finetti 1937, Ville 1939, Wald 1947, Robbins 1967, Shafer 2021, Ramdas 2023

You run an A/B test on two code variants. B converts 12% better than A, p < .01, sample size 50,000. You ship B. Six months later, B is worse than A ever was. The snapshot lied. This chapter introduces the e-value, a measure of evidence that preserves the temporal structure the p-value threw away.

Chapters 4–7 built the abductive primitive. You can generate hypotheses from a surprising observation. You can select which experiment to run next. But after the experiment runs and data arrives, how do you know the hypothesis is right?

The standard answer is the p-value: collect a fixed sample, compute a test statistic, compare to a threshold. If p < 0.05, declare significance. The p-value compresses the entire temporal structure of evidence into one number. That compression is mandatory. Peek at the data mid-experiment and the false positive guarantee breaks. The math demands you look once.

This chapter traces a 90-year lineage of work that removes that demand. The result is a measure of evidence you can compute at every observation, stop whenever you want, and whose error guarantee holds at every moment. The evidence becomes a trajectory. And the trajectory carries temporal patterns that a scalar discards.


Three snapshots that lied

Before the theory, three cases where compressing evidence to a snapshot produced catastrophic decisions.

Google: the feedback loop

Google A/B-tested ad load: more ads per page, more revenue per session, ship it. The test assumed stationarity (the treatment effect stays constant) and independence (each user's behavior is unaffected by others'). Both assumptions reduce to one: no feedback. The treatment enters, the outcome exits, nothing loops back.

Long-term measurement revealed the loop. Users learned to ignore ads. Click-through rates degraded over months. Ad blindness spread as a learned behavior. Google cut mobile ad load by 50%. The snapshot said "more ads = more revenue." The trajectory said "more ads = more revenue now, less revenue later, and the crossover happens outside your test window."

Boeing: oscillation killed 346 people

The Maneuvering Characteristics Augmentation System (MCAS) read a single sensor snapshot: nose too high. Correct response: push it down. Correct logic, wrong sensor. The angle-of-attack sensor was faulty. MCAS pushed the nose down. Pilots trimmed up. MCAS pushed again. Two subsystems fighting. The oscillation diverged. Lion Air Flight 610. Ethiopian Airlines Flight 302. 346 dead.

Nobody was watching the trajectory. Each MCAS activation was locally correct: the snapshot said "nose too high, push down." The oscillation between MCAS and pilot inputs was the signal. A system that tracked the temporal pattern of its own corrections would have seen the fight. The snapshot saw only the current angle.

Vioxx: delayed cardiovascular risk

Rofecoxib (Vioxx) passed its 9-month trial: fewer gastrointestinal bleeds than naproxen, ship it. The cardiovascular risk emerged at 18 months, past the trial window. Cumulative thrombotic damage, invisible in the snapshot. 88,000–140,000 excess coronary events before the drug was pulled.

The trial tested for GI bleeds over 9 months. It found what it tested for. The cardiovascular trajectory was in the data (thrombotic risk accumulated monotonically), but the test window ended before the risk crossed the detection threshold.


Same structure, same failure

Each case has the same skeleton:

  1. A system with feedback (users adapt, pilots fight the computer, cumulative organ damage).
  2. A test that assumes no feedback (stationarity, independence, fixed time window).
  3. A scalar summary (p-value, revenue delta, hazard ratio) computed at a single moment.
  4. A decision shipped on the scalar.
  5. A trajectory that would have revealed the danger, had anyone preserved it.

Meehl's structural flaw at industrial scale. In 1967 Meehl showed that directional predictions in soft psychology corroborate nothing because the crud factor guarantees significance. The same structure (directional prediction, snapshot window, no trajectory) now ships products and drugs. A/B testing culture inherited the pattern wholesale: directional predictions, fixed sample sizes, ship on significance.


The betting line

The alternative to snapshot testing has a 90-year lineage. Six contributions, each building on the last. The thread ran parallel to Fisher's fixed-horizon view for decades before converging.

Year Who Contribution
1937 de Finetti Probability as coherent betting. If your beliefs can be Dutch-booked, they aren't probabilities. Established that evidence is a bet, not a frequency.
1939 Ville Proved that von Mises' definition of randomness fails. Introduced martingales as a test of randomness: if you can make money betting against a sequence, it isn't random.
1947 Wald Sequential analysis. You don't need a fixed sample size. Collect data one observation at a time, decide to stop or continue after each one. The first formal framework for optional stopping.
1953 Doob The optional stopping theorem. A supermartingale cannot grow in expectation, even if you choose when to stop. This is the mathematical backbone: honest betting under the null can't make you rich on average, regardless of your stopping rule.
1967 Robbins & Darling Proved that confidence sequences (confidence intervals valid at every sample size) are achievable. Peek at every observation and the coverage guarantee holds.
2019–2023 Ramdas, Shafer, Grünwald Unified the lineage under the name e-values. Proved composability, anytime validity, and connections to game-theoretic probability. Made it usable.

De Finetti: evidence is a bet. Ville: test a hypothesis by trying to make money against it. Wald: you don't need a fixed sample size. Doob: the guarantee holds no matter when you stop. Robbins: confidence intervals hold at every peek. Ramdas, Shafer, and Grünwald unified it and named the central object.


E-values: definition from scratch

A p-value is the probability of seeing data at least as extreme as what you observed, assuming the null hypothesis is true. If p is small, the data is surprising under the null. Smaller p = more evidence against the null.

An e-value is the payoff of a bet against the null hypothesis. If the null is true, the expected payoff is at most 1:

E[e] ≤ 1    under the null

That's the entire definition. An e-value is a non-negative random variable whose expectation under the null is at most 1. Large values (e >> 1) mean the null is losing badly. An e-value of 20 means the data is 20 times more consistent with the alternative than the null.

The constraint E[e] ≤ 1 is the supermartingale property: the expected future value of the process is at most its current value. If you bet against the null and the null is true, your wealth cannot grow on average. If your wealth grows anyway, the null is losing.

Why "at most 1" and not "equals 1"

A martingale has E[e] = 1 exactly. A supermartingale relaxes this to E[e] ≤ 1. The relaxation matters because it makes e-values composable. If you run two sequential experiments producing e-values e1 and e2, their product e1 × e2 is also an e-value. The supermartingale property guarantees it: two honest bets composed are still an honest bet. This composition does not require any correction for multiple testing. It is exact.

P-values do not compose this way. Multiply two p-values and the result is not a valid p-value. Corrections (Bonferroni, Holm, BH) restore validity but throw away statistical power. E-values compose for free.

Anytime validity

The deepest property: you can compute an e-value at every observation and the guarantee holds at each one. No peeking penalty. No alpha spending. Look whenever you want, stop whenever you want, false positive rate controlled at every moment.

This is Doob's optional stopping theorem in action. A supermartingale evaluated at any stopping time is still a supermartingale. The math does not care when you look or why you stopped.

Contrast with a p-value. Compute p at n=100: 0.06, not significant. Peek again at n=200: 0.04, significant. Your false positive rate is no longer 5%. Peeking twice inflated it. The p-value guarantee requires exactly one look. The e-value guarantee holds at every look.


Trajectories expose feedback loops

A p-value compresses evidence into a scalar. An e-value stream preserves evidence over time. The system's dynamics live in the temporal structure. Compress to a scalar and they vanish.

Point an e-value stream at a system with feedback. The trajectory preserves temporal patterns that a terminal scalar discards.

Smoking and stress. The e-value climbs after each cigarette (relief), falls during withdrawal (stress spikes), climbs on relapse. The oscillation is the addiction cycle. A two-week snapshot says "smoking reduces stress, p < .05." The trajectory shows the loop.

Ad load. An e-value stream tracking revenue per session climbs for weeks (more ads = more revenue), flattens (ad blindness developing), then declines (degraded engagement). The inflection point marks the moment the feedback loop activates. A fixed-sample test that ends before the inflection sees only the climb.

MCAS. An e-value tracking "does the correction succeed?" oscillates. Push nose down (evidence for success), pilot trims up (evidence resets), MCAS fires again. The oscillation is two subsystems fighting. A snapshot at any single activation says "correction successful."


Evidence compounds across experiments

E-values compose multiplicatively across sequential experiments.

Framework Method Sample size Composition
FrequentistFixedFixedRequires correction (Bonferroni, etc.)
BayesianFixedVariableRequires shared prior
E-valueVariableVariableMultiply. No correction needed.

Run ten small experiments, each isolating one variable. Experiment A gives e1 = 3 (suggestive). Experiment B gives e2 = 4 (suggestive). Neither is strong alone. But e1 × e2 = 12, strong combined evidence, valid with no correction.

Change your test, your sampling strategy, your hypothesis mid-stream. The evidence still composes. The supermartingale guarantee holds regardless. Fixed-method testing cannot accumulate evidence across changing experiments. E-values can.


Code: e-value trajectory for a sequential test

A simple sequential test. Observations arrive one at a time. Under the null, N(0, 1). Under the alternative, N(μ, 1) for some μ > 0. At each step, compute the likelihood ratio (the simplest e-value) and track the trajectory.

import math
import random

def e_value_trajectory(observations, mu_alt=0.5):
    """Compute cumulative e-value trajectory via likelihood ratio.

    E-value at step t = product of likelihood ratios up to t.
    Under H0: X_i ~ N(0,1). Under H1: X_i ~ N(mu_alt, 1).
    Likelihood ratio for one observation:
        L(x) = exp(mu_alt * x - mu_alt^2 / 2)
    Cumulative e-value: product of L(x_i) for i = 1..t.
    """
    trajectory = []
    log_e = 0.0

    for x in observations:
        # Log-likelihood ratio for one observation
        log_lr = mu_alt * x - (mu_alt ** 2) / 2
        log_e += log_lr
        trajectory.append(math.exp(log_e))

    return trajectory

def p_value_at_t(observations):
    """Two-sided z-test p-value computed at the end of the sample."""
    n = len(observations)
    if n == 0:
        return 1.0
    mean = sum(observations) / n
    z = mean * math.sqrt(n)
    # One-sided p-value via normal CDF approximation
    p = 0.5 * math.erfc(z / math.sqrt(2))
    return p


# --- Scenario 1: Null is true (no effect) ---
random.seed(42)
null_obs = [random.gauss(0, 1) for _ in range(100)]
null_traj = e_value_trajectory(null_obs)

print("=== Null is true (no real effect) ===")
print(f"E-value at t=10:  {null_traj[9]:.4f}")
print(f"E-value at t=50:  {null_traj[49]:.4f}")
print(f"E-value at t=100: {null_traj[99]:.4f}")
print(f"P-value at t=100: {p_value_at_t(null_obs):.4f}")
print()

# --- Scenario 2: Alternative is true (real effect, mu=0.3) ---
alt_obs = [random.gauss(0.3, 1) for _ in range(100)]
alt_traj = e_value_trajectory(alt_obs, mu_alt=0.5)

print("=== Alternative is true (mu=0.3) ===")
print(f"E-value at t=10:  {alt_traj[9]:.4f}")
print(f"E-value at t=50:  {alt_traj[49]:.4f}")
print(f"E-value at t=100: {alt_traj[99]:.4f}")
print(f"P-value at t=100: {p_value_at_t(alt_obs):.4f}")
print()

# --- Scenario 3: System with feedback (effect reverses at t=50) ---
feedback_obs = (
    [random.gauss(0.5, 1) for _ in range(50)] +  # positive effect
    [random.gauss(-0.3, 1) for _ in range(50)]   # effect reverses
)
feedback_traj = e_value_trajectory(feedback_obs, mu_alt=0.5)

print("=== Feedback system (effect reverses at t=50) ===")
print(f"E-value at t=25:  {feedback_traj[24]:.4f}  (rising)")
print(f"E-value at t=50:  {feedback_traj[49]:.4f}  (peak)")
print(f"E-value at t=75:  {feedback_traj[74]:.4f}  (falling)")
print(f"E-value at t=100: {feedback_traj[99]:.4f}  (collapsed)")
print(f"P-value at t=50:  {p_value_at_t(feedback_obs[:50]):.4f}  (snapshot at peak)")
print(f"P-value at t=100: {p_value_at_t(feedback_obs):.4f}  (snapshot at end)")

Output:

=== Null is true (no real effect) ===
E-value at t=10:  0.1432
E-value at t=50:  0.0005
E-value at t=100: 0.0001
P-value at t=100: 0.2802

=== Alternative is true (mu=0.3) ===
E-value at t=10:  2.3119
E-value at t=50:  68.6328
E-value at t=100: 73236.5661
P-value at t=100: 0.0000

=== Feedback system (effect reverses at t=50) ===
E-value at t=25:  122.3931  (rising)
E-value at t=50:  1161.3707  (peak)
E-value at t=75:  1.5770  (falling)
E-value at t=100: 0.0027  (collapsed)
P-value at t=50:  0.0001  (snapshot at peak)
P-value at t=100: 0.0942  (snapshot at end)

Three scenarios. Three stories.

Null is true: the e-value drops below 1 immediately and keeps falling. By t=100, e = 0.0001. The bet against the null is thoroughly lost. Under the null, honest betting cannot grow in expectation. Individual paths fluctuate, but the trend is downward.

Alternative is true: the e-value climbs steadily. By t=100, e = 73,237. You could have stopped at t=50 (e = 69, already strong) or continued. The guarantee held at every peek. Monotone climb: accumulating.

Feedback system: the Google ad-load scenario. The effect is real for the first 50 observations, then reverses. The e-value climbs to 1,161 at t=50, then collapses as the reversal erodes the accumulated evidence. By t=100, e = 0.003. Meanwhile: p at t=50 = 0.0001 (highly significant), p at t=100 = 0.09 (borderline). Two snapshots, two conclusions, no way to reconcile them. The e-value trajectory tells the whole story: rise, peak, collapse. The shape of the trajectory is the shape of the system.


The p-value's guarantee demands one look

A p-value controls the type I error rate at a single, predetermined stopping time. Fix n in advance, compute once: P(p < α | H0) = α. Peek at t=50 and again at t=100, and you get two chances to cross the threshold. The error rate inflates.

An e-value controls the error rate uniformly over all stopping times. By Markov's inequality and the supermartingale property, for any stopping time τ:

P(eτ ≥ 1/α) ≤ α    under the null

This holds for any stopping time: fixed, random, data-dependent, adversarial. The proof is three lines from Doob's optional stopping theorem. The p-value guarantee requires the stopping time to be independent of the data. The e-value guarantee does not.

The tradeoff: at any single sample size, a p-value test is more powerful. If you commit to looking exactly once at exactly n observations, you should use it. The e-value pays a constant-factor efficiency loss at each time point. What it buys is the entire trajectory.


Connection to Part II

The economy of research (Chapter 7) selects which experiment to run. The e-value tracks what the experiment says over time. The two connect directly:

  1. Generate hypotheses via abduction (Chapters 4–6).
  2. Select the next experiment via information gain per unit cost (Chapter 7).
  3. Run the experiment and track the e-value trajectory (this chapter).
  4. Read the trajectory's shape to classify the system's dynamics (Chapter 9).

Step 3 is new. Before e-values, step 2 fed directly into a terminal decision: run, compute p, decide. Now a trajectory sits between the experiment and the decision, carrying information the terminal scalar discards.


Trajectory shape is ambiguous without classification

You have the trajectory. It oscillates. Is the system fighting itself (MCAS vs. pilots), or is your measurement just noisy?

Structural oscillation (cyclic dynamics in the system) and statistical oscillation (weak effect plus finite-sample noise) look the same in the raw trajectory. The trajectory alone cannot distinguish them.

Classifying whether evidence is converging, diverging, oscillating, or chaotic requires tools from dynamical systems theory: convergence rates, Lyapunov exponents, stability analysis. These tools apply directly to evidence trajectories. Chapter 9 builds the classification scheme.


Sources

de Finetti 1937 "La prévision: ses lois logiques, ses sources subjectives." Probability as coherent betting. Established the foundational framing.
Ville 1939 Étude critique de la notion de collectif. Introduced martingales as tests of randomness. The origin of the betting interpretation of evidence.
Wald 1947 Sequential Analysis. Fixed-sample testing is a special case. Optional stopping is the general framework.
Doob 1953 Stochastic Processes. The optional stopping theorem: supermartingales can't grow in expectation at stopping times.
Robbins & Darling 1967 "Confidence Sequences for Mean, Variance, and Median." Anytime-valid confidence intervals are achievable.
Shafer 2021 Testing by Betting. Game-theoretic probability as the foundation for hypothesis testing.
Ramdas et al. 2023 "Testing Exchangeability: Game-Theoretic and E-Values." Unified treatment of e-values, anytime validity, and composability.
Grünwald et al. 2024 "Safe Testing." E-values as the basis for hypothesis tests that are valid under optional stopping and optional continuation.
Meehl 1967 "Theory-Testing in Psychology and Physics." The structural flaw in directional-prediction snapshot testing.
Neighbors

External