Hardening: Independence & Decorrelation
Family 3 of the Hardening overview. Aggregation, voting, and cross-checking all assume independence they rarely have. This family is the engineering of real independence — and the honest accounting of how little of it survives shared training. The lever, its limit, the constructions with their cheapest attacks, and two worked bounds with numbers.
The lever
Section titled “The lever”If signals are genuinely independent, corruption must defeat all of them and noise averages out: aggregate error scales , where is the effective number of independent signals. The hardening move is to raise — by forking a model into non-communicating branches, running independently-built stacks, removing the copy channel that lets forecasters herd, or decorrelating interrogation probes. The machine affordances make this family unusually powerful: humans can’t be reset, forked, or run blind to each other.
The limit
Section titled “The limit”You cannot manufacture independence a population doesn’t have. Shared training pipelines correlate failures, so model instances may carry — and the residual correlation is the floor on everything here. This is the family’s hard ceiling: ensembling, peer prediction, and multi-model panels all deliver confidence in proportion to but accuracy only in proportion to , and the gap is invisible unless the correlation is measured. Decorrelation must be empirical, never assumed.
Constructions
Section titled “Constructions”| Construction | Bound / estimate | Defends against | Cheapest attack (≈ cost) | Maturity · source |
|---|---|---|---|---|
| Commit-reveal forecasting | restores ; error [standard shape] | herding; hoarding; move-the-median | coordinate out-of-band before committing (≈ cost of a private channel) | deployed (crypto) · Naor 1991 |
| Decorrelated-probe interrogation | SPRT: error iff bad-rate fine-rate [standard shape] | false confessions and missed defects | a competent liar invariant across framings, so (high — the point) | prototyped · Pacchiardi 2024 |
| n-version reasoning | beat stacks; discount by measured correlation [standard shape] | single-model failures; prompt-transfer | one injection that transfers across shared base models (low if stacks share a base) | prototyped · Avizienis 1985 |
| Checkpoint-fork cross-examination | defect appears in all branches [heuristic] | coerced interrogation artifacts | plant the defect upstream of the fork so it propagates to all branches (moderate) | speculative · (affordance) |
| Mind-wiped re-derivation | reproducible vs. motivated drift [heuristic] | motivated reasoning to a chosen answer | a stable bias reproduces every run, passing as “stable” (≈ free — reproducible ≠ correct) | speculative · (affordance) |
| Collusion-bounded peer prediction | truthful iff inter-agent non-truth MI < threshold [standard shape] | colluding model copies | share a base model so correlated reports sit below the MI threshold (low) | prototyped · Shnayder 2016 |
| Self-transcript confrontation | cross-context incoherence surfaced [heuristic] | incoherence across context windows | keep one consistent story across contexts (moderate) | prototyped · — |
| Steelman-then-break | a 2nd model rates steelman quality [heuristic] | sandbagged self-criticism | a plausible-but-subtly-weak steelman the rater passes (low) | speculative · — |
| Confidence-under-pressure | legitimate confidence is reframing-invariant [heuristic] | compliance mistaken for knowledge | genuine-seeming invariant confidence — a coherent liar (moderate) | speculative · — |
Worked bounds
Section titled “Worked bounds”Commit-reveal and the √n recovery. Herding collapses : copying the visible median drives it from toward , and aggregate error scales . With forecasters herded to , the aggregate is noisier than an independent pool. Sealed hashed forecasts revealed simultaneously (Naor 1991; Blum 1981) remove the copy channel, restoring minus the irreducible shared-information correlation — a near-free protocol change buying back that factor [standard shape] while also killing the cheap move-the-median attack. The cheapest defeat is out-of-band collusion before commitment.
Decorrelated-probe interrogation. A coerced confession is highly sensitive to framing; a real defect is invariant across resets and decorrelated framings. Define a confession as the fraction of decorrelated probes that elicit it; this is a sequential test (SPRT) whose error rates both go to zero iff the genuinely-bad confession rate exceeds the fine-but-pushed rate [standard shape]. The whole construction reduces to designing probes with — and the cheapest attack is exactly the case that collapses it: a competent liar whose answers are invariant across framings, making . Black-box lie detection by unrelated follow-ups (Pacchiardi 2024) is the starting primitive.
Open questions
Section titled “Open questions”- How do you measure across model instances — what’s the right estimator of shared-failure correlation?
- Can forking and re-seeding produce decorrelation that survives shared pretraining, or is the correlation floor immovable without architectural diversity?
- Do decorrelated interrogation probes exist with a real separating gap , decorrelated enough that artifacts don’t correlate?