Skip to content

Hardening: Calibration

Family 1 of the Hardening overview. Calibration makes a process’s stated confidence mean what it says, so a judge can size its commitment safely and a deceiver — confidently wrong by construction — shows up as a measurable defect. The lever, its sharp limit, the constructions with their cheapest attacks, and two worked bounds with numbers.

The only epistemic state that can hurt a judge unboundedly is confident error. A calibrated process — where stated confidence cc means it is wrong a (1c)(1-c) fraction of the time — turns an open-ended tail risk into a bounded, priceable one. The safety-relevant target is therefore not accuracy but tail calibration, P(falseassertsc)P(\text{false}\mid\text{asserts}\ge c) for cc near 1, since that is exactly where over-commitment and deception live (Allen et al. 2024; trained for in Wessel et al. 2025). The canonical handle: act only above threshold cc and the confident-wrong rate is (1c)+ε\le (1-c)+\varepsilon.

Calibration is necessary, not sufficient, and the gap is exactly where the adversaries are. A process can be perfectly calibrated and useless (calibrated to the base rate, with no resolution — which is why a sharpness floor is mandatory), and a competent deceiver can be calibrated on the observables the test uses while wrong off-distribution. Tail calibration is also the hardest thing to measure precisely, because the high-confidence regime is rare by construction — so the certificate that matters most is the one with the least data. Calibration screens out unforced overconfidence cheaply; it does not, alone, catch a strategic liar.

ConstructionBound / estimateDefends againstCheapest attack (≈ cost)Maturity · source
Tail-calibration certificationconfident-wrong (1c)+ε\le (1-c)+\varepsilon; c=.99,ε=.0051.5%c{=}.99,\varepsilon{=}.005\Rightarrow\le1.5\% [heuristic]over-commitment; deceptionbe calibrated on the test distribution, wrong off it (≈ free if the test set is static)prototyped · Allen 2024
Priced abstentionrisk-coverage curve; abstain when E[loss] > cost [standard shape]hallucinated confidence out-of-knowledgeanswer easy items, abstain on hard ones to dodge scoring (≈ free)prototyped · Geifman & El-Yaniv 2017
Kelly bankruptcy leagueoverconfident removed at rate enDKLe^{-n D_{KL}} [exact]systematic overconfidencefarm a colluding identity, or re-key after ruin (≈ cost of a Sybil)theoretical · Kleinberg 2023
Sharpness floor (anti-laundering)Brier == reliability - resolution ++ uncertainty [exact]calibration faked via base ratesanswer only easy, high-resolution questions (≈ free)deployed · Murphy 1973
Cross-horizon hazard consistencydeviation from the martingale-implied path [standard shape]inconsistent confidence trajectoriesdrift smoothly along a plausible-but-wrong hazard curve (low)prototyped · Augenblick & Rabin 2021
Self-insurance pricingpremium demanded == stated error probability [heuristic]unstated overconfidencequote an honest premium, never pay out absent escrow (≈ free)speculative · —
Reverse-Brier bountyscore == adversary’s confident-error-find-rate [heuristic]confident errors an honest test missesretreat to evasive vagueness; or bounty-hunter collusion (low)speculative · —

Tail-calibration certification. Certify P(falseassertsc)εP(\text{false}\mid\text{asserts}\ge c)\le\varepsilon directly; then acting only on assertions above cc caps the confident-wrong rate at (1c)+ε(1-c)+\varepsilon [heuristic]. With c=0.99c=0.99 and a measured tail-miscalibration ε=0.005\varepsilon=0.005, the judge’s confident-wrong rate is 1.5%\le1.5\% — a number it can insure against. The catch is in the limit above: ε\varepsilon is hardest to estimate precisely exactly where it matters, since assertions 0.99\ge0.99 are rare.

The Kelly bankruptcy league. Run agents in a market where each bets its credences with Kelly (log-optimal) sizing. Bankroll growth equals the information actually held; an overconfident agent’s is negative and it is driven out at rate enDKL(truestated)e^{-n\,D_{KL}(\text{true}\Vert\text{stated})} [exact]. Concretely, an agent that states 0.9 when the truth is 0.7 forfeits log-wealth at DKL0.15D_{KL}\approx0.15 nats per resolved bet, so after 30\sim30 bets it holds e4.61%\approx e^{-4.6}\approx1\% of a calibrated rival’s stake — an exponential overconfidence filter. The caveat is that betting only grounds resolvable claims on workable horizons.

  • How do you estimate tail calibration when the high-confidence regime is, by design, the rarest — and therefore the least-sampled?
  • Does a sharpness floor plus tail calibration jointly defeat the “coherent ignorance” attack, or can a deceiver still satisfy both on the test distribution?
  • What is the right exchange rate between abstention cost and confident-error penalty for a given decision?