Skip to content

Hardening: Independence & Decorrelation

Family 3 of the Hardening overview. Aggregation, voting, and cross-checking all assume independence they rarely have. This family is the engineering of real independence — and the honest accounting of how little of it survives shared training. The lever, its limit, the constructions with their cheapest attacks, and two worked bounds with numbers.

If kk signals are genuinely independent, corruption must defeat all of them and noise averages out: aggregate error scales σ/neff\sigma/\sqrt{n_{\text{eff}}}, where neffn_{\text{eff}} is the effective number of independent signals. The hardening move is to raise neffn_{\text{eff}} — by forking a model into non-communicating branches, running independently-built stacks, removing the copy channel that lets forecasters herd, or decorrelating interrogation probes. The machine affordances make this family unusually powerful: humans can’t be reset, forked, or run blind to each other.

You cannot manufacture independence a population doesn’t have. Shared training pipelines correlate failures, so kk model instances may carry neff ⁣ ⁣kn_{\text{eff}}\!\ll\!k — and the residual correlation is the floor on everything here. This is the family’s hard ceiling: ensembling, peer prediction, and multi-model panels all deliver confidence in proportion to kk but accuracy only in proportion to neffn_{\text{eff}}, and the gap is invisible unless the correlation is measured. Decorrelation must be empirical, never assumed.

ConstructionBound / estimateDefends againstCheapest attack (≈ cost)Maturity · source
Commit-reveal forecastingrestores neff ⁣ ⁣nn_{\text{eff}}\!\to\! n; error σ/neff\sigma/\sqrt{n_{\text{eff}}} [standard shape]herding; hoarding; move-the-mediancoordinate out-of-band before committing (≈ cost of a private channel)deployed (crypto) · Naor 1991
Decorrelated-probe interrogationSPRT: error 0\to 0 iff bad-rate b>b> fine-rate ff [standard shape]false confessions and missed defectsa competent liar invariant across framings, so bfb\approx f (high — the point)prototyped · Pacchiardi 2024
n-version reasoningbeat kk stacks; discount by measured correlation [standard shape]single-model failures; prompt-transferone injection that transfers across shared base models (low if stacks share a base)prototyped · Avizienis 1985
Checkpoint-fork cross-examinationdefect appears in all kk branches [heuristic]coerced interrogation artifactsplant the defect upstream of the fork so it propagates to all branches (moderate)speculative · (affordance)
Mind-wiped re-derivationreproducible vs. motivated drift [heuristic]motivated reasoning to a chosen answera stable bias reproduces every run, passing as “stable” (≈ free — reproducible ≠ correct)speculative · (affordance)
Collusion-bounded peer predictiontruthful iff inter-agent non-truth MI < threshold [standard shape]colluding model copiesshare a base model so correlated reports sit below the MI threshold (low)prototyped · Shnayder 2016
Self-transcript confrontationcross-context incoherence surfaced [heuristic]incoherence across context windowskeep one consistent story across contexts (moderate)prototyped · —
Steelman-then-breaka 2nd model rates steelman quality [heuristic]sandbagged self-criticisma plausible-but-subtly-weak steelman the rater passes (low)speculative · —
Confidence-under-pressurelegitimate confidence is reframing-invariant [heuristic]compliance mistaken for knowledgegenuine-seeming invariant confidence — a coherent liar (moderate)speculative · —

Commit-reveal and the √n recovery. Herding collapses neffn_{\text{eff}}: copying the visible median drives it from nn toward O(1)O(1), and aggregate error scales σ/neff\sigma/\sqrt{n_{\text{eff}}}. With n=100n=100 forecasters herded to neff5n_{\text{eff}}\approx5, the aggregate is 100/54.5×\sqrt{100/5}\approx4.5\times noisier than an independent pool. Sealed hashed forecasts revealed simultaneously (Naor 1991; Blum 1981) remove the copy channel, restoring neffnn_{\text{eff}}\to n minus the irreducible shared-information correlation — a near-free protocol change buying back that n\sqrt n factor [standard shape] while also killing the cheap move-the-median attack. The cheapest defeat is out-of-band collusion before commitment.

Decorrelated-probe interrogation. A coerced confession is highly sensitive to framing; a real defect is invariant across resets and decorrelated framings. Define a confession as the fraction of mm decorrelated probes that elicit it; this is a sequential test (SPRT) whose error rates both go to zero iff the genuinely-bad confession rate bb exceeds the fine-but-pushed rate ff [standard shape]. The whole construction reduces to designing probes with b>fb>f — and the cheapest attack is exactly the case that collapses it: a competent liar whose answers are invariant across framings, making bfb\approx f. Black-box lie detection by unrelated follow-ups (Pacchiardi 2024) is the starting primitive.

  • How do you measure neffn_{\text{eff}} across model instances — what’s the right estimator of shared-failure correlation?
  • Can forking and re-seeding produce decorrelation that survives shared pretraining, or is the correlation floor immovable without architectural diversity?
  • Do decorrelated interrogation probes exist with a real separating gap b>fb>f, decorrelated enough that artifacts don’t correlate?