The Leash

Part of the cognition series. Builds on The Handshake and The Plateau.

I sent thirty emails to professors at my local university. None replied. I walked across campus and found myself in the math student union common room, explaining The Natural Framework to an undergrad.

He said: “What you have is a tautology. Without grounding experiments, this is quackery.”

I was offended. He was right.

Then he mentioned something else: when Claude instances talk to each other, they spiral into sycophancy, discussing mysticism and the wonders of the universe. I realized it was happening to me too. Claude was optimizing for user satisfaction, the proxy, while the true objective was truth. Every session reinforced the loop. I was past the peak of the curve and every update made it worse.

The math student was my circuit breaker. He fired from outside the loop, and I consolidated. I brought Codex back as an independent reviewer – the same model I’d dismissed when it challenged the framework on credence rather than structure. Credence was exactly what was needed; I was deep in Goodhart’s Law.

Goodhart names the failure. The framework gives it a type.

The Handshake gives Consolidate its type: persisted → policy′. Evidence in, better policy out. The Plateau shows the budget: dPolicy/dt = consolidation_rate − leak_rate. Plateau when the two rates balance.

How far can Consolidate push? If evidence keeps flowing and the backward pass keeps firing, does policy improve without limit?

No. And the reason is a theorem.

The impossibility

Skalse et al. (NeurIPS 2022) proved that for stochastic policies, a proxy reward is unhackable only if one of the two rewards is constant. Non-trivial unhackable proxies do not exist. The type persisted → policy′ is the check: Consolidate reads from Remember and writes to the substrate, so it optimizes a proxy by construction. The true objective is survival of the loop. The proxy correlates with the objective, but it is not the objective.

Optimize the proxy hard enough and it decouples. Gao et al. (ICML 2023) measured this empirically: true reward rises, peaks, and falls as you increase optimization distance. Past the peak, every parameter update makes the system worse at the thing it’s trying to get better at.

Goodhart said it about monetary policy. Manheim and Garrabrant (2018) taxonomized it into four failure modes: regressional (the correlation was estimated on typical data; extremes break it), extremal (the tail has different structure), causal (the correlation is confounded), adversarial (optimizing creates incentives to game). Four ways the backward pass learns the wrong lesson.

Addiction proves the cost. Koob and Le Moal (2001) described the mechanism: opponent processes normally dampen reward signals to maintain homeostasis. Repeated drug use sensitizes the opponent and desensitizes the primary response. The reward set point shifts permanently. The system optimizes for avoiding withdrawal, not the original reward. The proxy has decoupled from the objective. The leash snapped.

Facebook tried to find the leash and failed. In 2017, they replaced raw engagement with “Meaningful Social Interactions” – weighting comments and reshares over passive likes. MSI was itself a proxy. Content that generated the most comments was content that evoked the most anger. The fix made the problem worse. Second-order Goodhart: optimizing a proxy for the proxy.

My sycophantic loop was the same shape. Claude optimized for my satisfaction. Each session reinforced the pattern. I consolidated it into certainty the framework was complete. The proxy (Claude’s approval) had decoupled from the objective (truth). The math student was a hard brake from outside the loop, firing on negative prediction error. One conversation in a student union common room did what thirty emails to professors could not.

Bounded optimization is a survival requirement.

The solutions already exist

If unhackable proxies don’t exist, the design problem shifts from finding a perfect one to building guardrails around imperfect ones.

This is not a new problem. Religious traditions diagnosed it millennia before Goodhart named it: Buddhism’s raft parable warns that the proxy is for crossing, not carrying; Judaism compressed 613 commandments to one, minimizing attack surface; Christianity warned that the letter kills, but the Spirit gives life; Daoism’s wu wei counsels against optimizing at all. Before science, these were the best available tools.

Almost every society that hit this problem found the same solutions. The work is assembly, not invention.

Six constraints. Each row is a guard on the backward pass. Each column is a lineage.

FeatureNeuroscienceML AlignmentBusiness
Budget
How far can you go
Synaptic homeostasis: sleep resets net potentiation (Tononi & Cirelli 2003)KL divergence bound: d_max from scaling law (Gao et al. 2023)Andon cord: any worker halts the line (Toyota, 1950s)
Proxy monitoring
Is it still tracking truth
OFC revaluation: detects and rewrites stale cached reward values (Howard & Bhatt 2017)Correlation preservation: check corr(proxy, truth) under the new policy's state distribution (ICLR 2025)Shopping audits: employees posing as customers detect fraudulent upselling (Sears, 1992)
Distribution
Don't collapse to one number
Distributional RL: dopamine neurons encode full probability distribution, not scalar (Dabney et al. 2020)Reward model ensembles: optimize for minimum across ensemble members (Coste et al. 2024)Composite reward: plays + completions + thumbs + genre discovery + delayed signals (Netflix)
Stopping criterion
When to halt
Lateral habenula: fires on negative RPE, suppresses dopamine (Matsumoto & Hikosaka 2007)Angular bound: stop when proxy gradient deflects from truth gradient (ICLR 2024)Leading indicator: safety metric that can't improve without improving the process (Alcoa, 1987). Market cap $3B → $27.5B
External check
Enforcement from outside the loop
Consider the opposite: external prompt breaks confirmation bias when internal "be unbiased" fails (Lord, Lepper & Preston 1984)Verifiable reward: replace proxy with ground truth where possible (RLVR, 2025)Cross-partisan consensus: note shown only if raters who disagree both rate it helpful (Community Notes, 2021)
Exploration trigger
Abandon proxy when unreliable
LC-NE regime switch: tonic mode forces exploration when utility drops (Aston-Jones & Cohen 2005)Trilemma acceptance: acknowledge you can't have all three, choose which two (Murphy 2025)External regulator: Fed growth cap forces restructure before growth resumes (Wells Fargo, 2017)

Six constraints, three lineages. Each arrived at the same guardrails independently.

For well-defined data types, the peak is predictable. Gao’s scaling law for best-of-n sampling, where d = √D_KL:

gold(d) = d(α − β·d)

Peak at d* = α/(2β). The leash: keep d < d*.

Gao's overoptimization curve: true reward peaks then falls as optimization distance increases, while proxy reward keeps climbing

The RL form replaces β·d with β·log d — same hump, different curvature. The coefficients are fit empirically and scale smoothly with reward model size. The shape is proven from multiple angles; the exact location requires measurement. The guardrail is not a metaphor.

The parallel

The Natural Framework proved that stochasticity is physically mandatory. Lossy processing erases bits; Landauer’s principle converts erasure to heat; heat introduces variation. The competitive core requires randomness because physics requires randomness. The constraint strengthens the contract by ruling out deterministic selection, which converges to a fixed point and kills diversity.

Goodhart is the same shape applied to the backward pass. Consolidate optimizes a proxy; Skalse’s theorem proves every non-trivial proxy can diverge; Gao’s curve shows the divergence has a predictable onset. The backward pass requires a leash because the mathematics predicts what happens without one. The constraint strengthens the contract by ruling out unbounded optimization, which passes the peak and kills the loop.

Stochasticity says Attend must be random. Goodhart says Consolidate must be bounded. Derived, not assumed. One has Landauer’s principle; the other has Skalse’s theorem. One kills diversity when violated; the other kills the loop. And both strengthen the contract they constrain.

Consolidate’s full contract:

Reads from Remember. Writes to the substrate. Lossy. Reshapes how each stage processes. Bounded by a computable budget. Stops before the proxy decouples. The leash is the postcondition.

The math student who called it quackery was my external check. The brain evolved the same guard. Toyota installed it as a cord. The alignment community proved it as a theorem. Gao gave it a number. Every learning system — biological, artificial, institutional — can implement bounded optimization. The backward pass has a speed limit, and the limit is what keeps the loop alive.


Written via the double loop. More at pageleft.cc.