← back to methodeutics

Convergence

Chapter 11 · Kelly 1996, Grünwald 2024, Ramdas 2023, Chernoff 1959

Five sensor streams, each individually undetectable. Composed e-values classify the forcing pattern at F1 = 0.996. The math composes. But a human can’t run this loop at scale.

Chapters 8–10 built a pipeline: track evidence as a trajectory (ch 8), classify the trajectory’s shape into four bins (ch 9), and use each classification’s failure mode to generate edges in the hypothesis graph (ch 10). Each piece works locally: one experiment, one trajectory, one kill condition. Does the graph as a whole converge? When you compose evidence across many experiments, each chosen adaptively based on the last, does the accumulated evidence settle on the truth?

Two theorems, from two different fields, say yes. Neither cites the other.


E-value composition: the product rule

Chapter 8 introduced e-values as an alternative to p-values: a measure of evidence you can peek at any time without inflating error. E-values compose.

Run five experiments sequentially, choosing each based on the results of the previous ones. Each experiment produces an e-value Ei. The product E1 × E2 × … × E5 is itself a valid e-value. No correction for multiple testing. No penalty for adaptive selection. The product is valid as long as each e-value is constructed conditional on all previous data and decisions.

This is Grünwald, Henzi, and Ly (2024). If each Ei is an e-variable for the null hypothesis conditional on all previous data and decisions, then ∏Ei is an e-variable for the same null. The proof uses the tower property of conditional expectations. The result holds under arbitrary stopping, arbitrary experiment selection, and arbitrary dependence between experiments, as long as each e-value is valid given what came before.

Ramdas, Grünwald, Vovk, and Shafer (2023) extended this to continuous time and showed that the composed e-value process is a test supermartingale. You can run an open-ended investigation, composing evidence as you go, with a valid error guarantee at every step.

“E-values can be multiplied across different data sources, experiments, or analyses, no matter how they were selected.”

— Grünwald, Henzi, and Ly, “E-values: Calibration, combination, and applications” (2024)

The product weights each experiment by its informativeness. An experiment producing Ei = 12 contributes more than one producing Ei = 1.1. This is Fisher information weighting in disguise: the likelihood ratio underlying each e-value gives informative experiments more say. The composition theorem says the weighting is valid.


Five streams, one signal

The e-value trajectory classifier tested this directly. Five heterogeneous sensor streams — normal, Poisson, exponential, Bernoulli, lognormal — each measuring a different quantity, each contaminated with a weak forcing signal. The signal shifts each stream’s parameter by a small amount, inducing one of four trajectory shapes: convergent, divergent, oscillatory, or aperiodic.

Individually, each stream is too noisy to detect the forcing. F1 score per stream: 0.279. Barely above chance for a four-class problem.

The composed e-value trajectory (the product across all five streams at each time step) classified the forcing pattern at F1 = 0.996. The composition reveals shared forcing that individual streams miss.

Method F1 score Why
Individual stream 0.279 Signal below noise floor in each stream
Standardized sum 0.478 Averaging washes out heterogeneous signal strengths
Composed e-values 0.996 Product weights by informativeness; valid under adaptation

The standardized sum fails because it gives each stream equal weight. A Bernoulli stream with subtle frequency shifts and an exponential stream with rate changes carry different amounts of Fisher information. Averaging treats them the same. The e-value product does not: each likelihood ratio scales by the stream’s informativeness.


Code: composing e-values across heterogeneous streams

The core operation is multiplication. Each stream produces a sequence of e-values. The composed e-value at each time step is the product across streams.

import numpy as np

def e_value_normal(x, mu_null, mu_alt, sigma):
    """E-value for a single observation from a normal distribution.

    Likelihood ratio: P(x | mu_alt) / P(x | mu_null).
    """
    return np.exp(
        -0.5 * ((x - mu_alt) / sigma)**2
        + 0.5 * ((x - mu_null) / sigma)**2
    )

def e_value_poisson(x, lam_null, lam_alt):
    """E-value for a single observation from a Poisson distribution."""
    return (lam_alt / lam_null)**x * np.exp(lam_null - lam_alt)

def compose_e_values(streams):
    """Compose e-values across heterogeneous streams by multiplication.

    streams: list of arrays, each shape (T,) — e-values per time step.
    Returns: array shape (T,) — composed e-value trajectory.
    """
    composed = np.ones_like(streams[0])
    for stream in streams:
        composed *= stream
    return composed


# --- Example: two streams, shared forcing signal ---

np.random.seed(42)
T = 200
forcing = 0.2  # weak shared signal

# Stream 1: normal, mu shifts by forcing
x_normal = np.random.normal(loc=forcing, scale=1.0, size=T)
e_normal = np.array([e_value_normal(x, 0.0, forcing, 1.0) for x in x_normal])

# Stream 2: Poisson, rate shifts by forcing
x_poisson = np.random.poisson(lam=5.0 + forcing, size=T)
e_poisson = np.array([e_value_poisson(x, 5.0, 5.0 + forcing) for x in x_poisson])

# Cumulative products (running e-values)
cum_normal = np.cumprod(e_normal)
cum_poisson = np.cumprod(e_poisson)
cum_composed = np.cumprod(e_normal * e_poisson)

print(f"After {T} observations:")
print(f"  Normal stream alone:   {cum_normal[-1]:.2f}")
print(f"  Poisson stream alone:  {cum_poisson[-1]:.2f}")
print(f"  Composed (product):    {cum_composed[-1]:.2f}")
print(f"\nThreshold for rejection at alpha=0.05: 20.0")
print(f"  Normal crosses?  {np.any(cum_normal >= 20)}")
print(f"  Poisson crosses? {np.any(cum_poisson >= 20)}")
print(f"  Composed crosses? {np.any(cum_composed >= 20)}")

Neither stream alone crosses the rejection threshold. The composed trajectory does.

Driving the hypothesis graph to closure

Composition validates evidence across streams. But the hypothesis graph is a loop: measure, classify, open or close edges, repeat.

def investigate(system, budget):
    """Run the hypothesis graph to convergence."""
    graph = HypothesisGraph()
    graph.add_edge("initial", classify(measure(system)))

    for _ in range(budget):
        if not graph.open_edges():
            break  # converged: no open questions

        # Economy (ch 7): pick the cheapest decisive experiment
        edge = min(graph.open_edges(), key=lambda e: e.cost / e.info_gain)

        # Measure and classify (ch 8-9)
        result = measure(system, edge.perturbation)
        shape = classify(result)

        # Kill condition (ch 10): failure names the next edge
        if shape == "convergent":
            graph.close(edge)
        elif shape == "divergent":
            graph.close(edge)
            graph.add_edge(edge.target, "follow dependency")
        elif shape == "oscillatory":
            graph.close(edge)
            graph.add_edge(edge.target, "test interface A")
            graph.add_edge(edge.target, "test interface B")
        else:
            graph.add_edge(edge.target, "collect more data")

    return graph  # converged or budget exhausted

Each iteration closes one edge and may open zero, one, or two new ones. Convergent results close without opening. Divergent results open one. Oscillatory results open two (a split). The graph converges when closures outpace openings. Kelly’s theorem says this happens for finite-structure systems under Ockham-efficient strategies.


Ockham’s razor is a convergence theorem

Composition guarantees that evidence accumulates validly. It does not guarantee that inquiry converges: that you eventually reach the truth and stay there.

Kelly (1996), The Logic of Reliable Inquiry, recast scientific method as a question in topology: given a class of possible worlds and a stream of evidence, which inquiry methods converge to the truth?

The setup is austere. An inquiry method is a function from evidence sequences to hypotheses. At each step, the method sees one more datum and outputs its current best guess. The method converges if, after finitely many observations, it settles on the correct hypothesis and never changes its mind again. Kelly asks: among all methods that converge on a given problem, which one converges fastest?

The Ockham efficiency theorem

Kelly’s central result: Ockham’s razor is necessary and sufficient for cycle-optimal convergence.

A mind-change (or retraction) is when the method switches its output from one hypothesis to another. A cycle is a sequence of mind-changes that returns to a hypothesis previously held and retracted. Kelly proved:

  1. For any solvable problem, there exists a method that minimizes the worst-case number of mind-changes on the way to truth.
  2. That method is the Ockham method: at every step, output the simplest hypothesis compatible with all evidence seen so far.
  3. Any method that violates Ockham’s razor — that ever outputs a complex hypothesis when a simpler one is compatible with the data — incurs strictly more mind-changes in the worst case.

“Simplest” here is not aesthetic. It is topological: a hypothesis is simpler if the set of evidence sequences compatible with it has lower topological complexity (fewer alternations between open and closed sets in the evidence topology). The razor is not a preference; it is a convergence guarantee.

“Departing from Ockham’s razor always results in extra retractions or cycles that could have been avoided without sacrificing convergence to the truth.”

— Kevin Kelly, The Logic of Reliable Inquiry (1996)

William of Ockham said “do not multiply entities beyond necessity” as methodological advice. Kelly proved it is the only policy that minimizes wasted effort on the way to truth. Every alternative overshoots: considers hypotheses it didn’t need to, retracts positions it shouldn’t have held, cycles through states the Ockham method skips.

Concept Kelly’s formalization In the hypothesis graph
Mind-change Method switches from hypothesis A to hypothesis B Reclassifying a node from one bin to another
Cycle Method returns to a hypothesis it previously retracted Revisiting and re-reclassifying a node already tested
Convergence Method stabilizes on the correct hypothesis forever All frontier edges resolved; no new questions; f(x) = x
Ockham method Always output the simplest compatible hypothesis Follow the kill-condition tree; never jump to a complex explanation when a simple one survives

The Kelly-Ramdas gap

Kelly’s framework is topological: which problems are solvable, which methods are optimal, how many mind-changes the optimal method requires. No probability, no sample size, no significance levels. The result holds for any evidence-generating process, not just i.i.d. samples. Silent on how to measure evidence at each step.

Ramdas’s framework is measure-theoretic: e-values, supermartingales, anytime-valid inference. Silent on whether composed evidence converges to truth in the limit, or how many retractions are needed.

Researcher Framework Proves Blind spot
Kelly 1996 Topology Ockham’s razor minimizes mind-changes to truth No mechanism for measuring evidence strength
Ramdas 2023 Measure theory E-value products are valid under arbitrary adaptation No guarantee of limit convergence or retraction bounds

Both are at CMU. No collaboration between them appears in the published literature. Kelly is in the Department of Philosophy; Ramdas is in Statistics and Machine Learning. Their theorems address different halves of the same question: Kelly provides the convergence guarantee, Ramdas provides the evidence-composition mechanism. The hypothesis graph needs both.

Connect them: an inquiry that follows Ockham’s razor, measures evidence with e-values, and composes across adaptive experiments will (1) reach the truth with minimum retractions (Kelly) and (2) maintain valid error control at every step (Ramdas). Kelly says you’ll get there efficiently. Ramdas says you’ll know where you are while traveling.

The bridge between topology and measure theory for adaptive inquiry is open. No published theorem crosses it.


When the graph has nothing left to say

The graph converges when every frontier edge points to a node already tested and stably classified. No new questions. The map returns what you gave it: f(x) = x.

Mathematicians call this a fixed point. Computer scientists call it a fixpoint: a computation stabilizes and halts. Same idea from two angles.

In the hypothesis graph, the fixed point requires three conditions:

  1. Coverage. Every reachable node has been visited. No frontier edges point to untested hypotheses.
  2. Consistency. All classifications are stable. No node oscillates between bins when re-examined.
  3. Composability. The composed e-values across all experiments preserve the regime’s signature. The global trajectory confirms what the local trajectories claim.

If all three hold, the graph has converged. The investigation is complete — not because you ran out of budget, but because the structure has no more questions to ask. Every perturbation classified, every kill condition followed, the composed evidence pointing to a single stable map.

Most investigations end when the grant runs out, the deadline arrives, or the engineer gets reassigned. Those are resource-based stopping conditions. The fixed point is structural. The system tells you when you’re done.


Three conditions, all required

Violate any one of three conditions and the graph cannot close.

Condition What it requires What happens when it fails
Perturbation access You can poke the system and observe the response Frontier edges point to experiments you cannot run
Independence between streams Sensor streams are not correlated through a hidden channel Composition inflates evidence; the product is misleadingly large
Identifiability Different hypotheses produce different observable signatures Two hypotheses are indistinguishable by any experiment; the graph cannot resolve between them

Perturbation access is the most fundamental. Sciences that cannot perturb their subject — astronomy, macroeconomics, paleontology — cannot close the hypothesis graph. They can observe, classify, and compose evidence, but they cannot run the experiment that would resolve a frontier edge. The graph stays open. This is a boundary condition, not a failure of method. The hypothesis graph works where you can poke.

Independence makes the product rule work. If two sensor streams share a hidden common cause that inflates both signals, the product of their e-values overstates the evidence. Grünwald’s theorem requires each e-value be valid conditional on all previous data and decisions. Correlated streams violate this: stream B’s data is not conditionally independent of stream A’s, given the shared cause.

Identifiability means the truth is findable. If two hypotheses predict identical observations under every possible experiment, no evidence can distinguish them. The graph has a permanent frontier edge between them. Some systems have symmetries that prevent identification: gauge invariances in physics, confounders in causal inference.


Where it fails

Convergence requires perturbation. Remove it and the graph cannot close.

Astronomy. You cannot perturb a star. You observe its spectrum, motion, variability. You compose evidence across wavelengths, time, and instruments. But if two models of stellar interior predict the same surface observables, no observation can distinguish them. Permanent frontier edge. Astronomers call this “model degeneracy.” Kelly would call it a topological limitation: the hypothesis class is not separable by the evidence topology.

Macroeconomics. You cannot run a controlled experiment on a national economy. Every observation is confounded by everything else happening simultaneously. The e-value product across economic indicators overstates evidence because the indicators share common causes (monetary policy, global trade, sentiment). The graph is open, and everyone knows it.

Paleontology. You cannot perturb an extinct organism. Fossils are fragmentary, biased by preservation, separated by millions of years. Each is a single frozen observation. The graph converges only where evidence is rich enough: gross morphology, phylogenetic placement. Fine-grained questions — behavior, coloration, soft tissue — remain open edges.

The open edges tell you why the graph cannot close. “Model degeneracy at the stellar interior” is a different failure mode from “confounded economic indicators.” The first requires a new observable (gravitational waves, neutrinos). The second requires a natural experiment (a policy change uncorrelated with other changes). The failure mode names the next question, even when the question is unanswerable with current tools.


Why debugging works

Engineered systems satisfy all three conditions. You can restart, inject faults, change configurations, deploy canaries (perturbation). CPU, memory, network, and disk are measured by separate subsystems (independence). A memory leak looks different from a CPU bottleneck (identifiability).

The hypothesis graph for a software system can close. The mechanic’s diagnostic tree terminates. The SRE’s runbook converges on a root cause. Not because the practitioners are brilliant, but because the system’s structure admits convergence.

Kelly’s Ockham efficiency guarantees that following the simplest surviving hypothesis minimizes wasted effort. Ramdas’s composition theorem guarantees the accumulated evidence remains valid. The practitioner’s loop converges, and it converges efficiently.


The supporting cast

Two earlier results complete the picture.

Chernoff (1959) proved that adaptive experiment selection converges to the true hypothesis exponentially fast, at a rate determined by the KL divergence between hypotheses. The optimal policy selects the experiment that maximizes the minimum KL divergence between the current best hypothesis and its nearest competitor. The economy of research (ch 7) made asymptotically precise.

He and Geng (2008) proved that adaptive interventions recover the true causal structure under faithfulness and sufficiency assumptions. If you can intervene on any variable, you can distinguish any two causal graphs that produce different interventional distributions.

Line them up:

Result Proves What it contributes
Chernoff 1959 Adaptive experiments converge exponentially Rate of convergence for experiment selection
Kelly 1996 Ockham minimizes retractions to truth Efficiency guarantee for hypothesis selection
He & Geng 2008 Adaptive interventions recover causal structure Identifiability guarantee under perturbation
Grünwald/Ramdas 2023–24 E-value products are valid under adaptation Evidence-composition mechanism

Four results, four decades, four subfields (sequential analysis, formal epistemology, causal inference, mathematical statistics). Each proves one piece: convergence rate, efficiency, identifiability, composition. No published paper connects all four.


What this chapter establishes

The evidence framework from Part III is complete. A practitioner with chapters 8–11 can:

  1. Track evidence as a trajectory, not a scalar (ch 8).
  2. Classify the trajectory into four bins: converge, diverge, oscillate, chaos (ch 9).
  3. Generate the next experiment from the failure mode of the current one (ch 10).
  4. Compose evidence across experiments, knowing the composition is valid and the inquiry converges (ch 11).

Kelly guarantees convergence. Ramdas guarantees composition. Chernoff guarantees rate. He and Geng guarantee identifiability.

One problem. The framework requires a human to generate hypotheses, select experiments, classify trajectories, and follow edges. For one system, this is diagnosis. For ten systems, a full-time job. For a thousand, impossible.

The math works. The loop is too expensive to run by hand. Can it run without a human?

Part IV.


Sources

Kelly 1996 The Logic of Reliable Inquiry. Oxford University Press. Topological learning theory. Proves Ockham’s razor is necessary and sufficient for cycle-optimal convergence. OUP.
Grünwald, Henzi, Ly 2024 “E-values: Calibration, combination, and applications.” JRSS-B 86(5). E-value product rule under arbitrary adaptation. JRSS-B.
Ramdas, Grünwald, Vovk, Shafer 2023 “Game-theoretic statistics and safe anytime-valid inference.” Statistical Science 38(4). E-value processes as test supermartingales. arXiv.
Chernoff 1959 “Sequential Design of Experiments.” Annals of Mathematical Statistics 30(3). Adaptive experiment selection converges exponentially. Project Euclid.
He & Geng 2008 “Active Learning of Causal Networks with Intervention Experiments and Optimal Designs.” JMLR 9. Adaptive interventions recover causal structure. JMLR.
Milnor 1985 “On the Concept of Attractor.” Communications in Mathematical Physics 99. The four-bin classification of dynamical trajectories. Springer.
Neighbors

External