Hardening: Calibration
Family 1 of the Hardening overview. Calibration makes a process’s stated confidence mean what it says, so a judge can size its commitment safely and a deceiver — confidently wrong by construction — shows up as a measurable defect. The lever, its sharp limit, the constructions with their cheapest attacks, and two worked bounds with numbers.
The lever
Section titled “The lever”The only epistemic state that can hurt a judge unboundedly is confident error. A calibrated process — where stated confidence means it is wrong a fraction of the time — turns an open-ended tail risk into a bounded, priceable one. The safety-relevant target is therefore not accuracy but tail calibration, for near 1, since that is exactly where over-commitment and deception live (Allen et al. 2024; trained for in Wessel et al. 2025). The canonical handle: act only above threshold and the confident-wrong rate is .
The limit
Section titled “The limit”Calibration is necessary, not sufficient, and the gap is exactly where the adversaries are. A process can be perfectly calibrated and useless (calibrated to the base rate, with no resolution — which is why a sharpness floor is mandatory), and a competent deceiver can be calibrated on the observables the test uses while wrong off-distribution. Tail calibration is also the hardest thing to measure precisely, because the high-confidence regime is rare by construction — so the certificate that matters most is the one with the least data. Calibration screens out unforced overconfidence cheaply; it does not, alone, catch a strategic liar.
Constructions
Section titled “Constructions”| Construction | Bound / estimate | Defends against | Cheapest attack (≈ cost) | Maturity · source |
|---|---|---|---|---|
| Tail-calibration certification | confident-wrong ; [heuristic] | over-commitment; deception | be calibrated on the test distribution, wrong off it (≈ free if the test set is static) | prototyped · Allen 2024 |
| Priced abstention | risk-coverage curve; abstain when E[loss] > cost [standard shape] | hallucinated confidence out-of-knowledge | answer easy items, abstain on hard ones to dodge scoring (≈ free) | prototyped · Geifman & El-Yaniv 2017 |
| Kelly bankruptcy league | overconfident removed at rate [exact] | systematic overconfidence | farm a colluding identity, or re-key after ruin (≈ cost of a Sybil) | theoretical · Kleinberg 2023 |
| Sharpness floor (anti-laundering) | Brier reliability resolution uncertainty [exact] | calibration faked via base rates | answer only easy, high-resolution questions (≈ free) | deployed · Murphy 1973 |
| Cross-horizon hazard consistency | deviation from the martingale-implied path [standard shape] | inconsistent confidence trajectories | drift smoothly along a plausible-but-wrong hazard curve (low) | prototyped · Augenblick & Rabin 2021 |
| Self-insurance pricing | premium demanded stated error probability [heuristic] | unstated overconfidence | quote an honest premium, never pay out absent escrow (≈ free) | speculative · — |
| Reverse-Brier bounty | score adversary’s confident-error-find-rate [heuristic] | confident errors an honest test misses | retreat to evasive vagueness; or bounty-hunter collusion (low) | speculative · — |
Worked bounds
Section titled “Worked bounds”Tail-calibration certification. Certify directly; then acting only on assertions above caps the confident-wrong rate at [heuristic]. With and a measured tail-miscalibration , the judge’s confident-wrong rate is — a number it can insure against. The catch is in the limit above: is hardest to estimate precisely exactly where it matters, since assertions are rare.
The Kelly bankruptcy league. Run agents in a market where each bets its credences with Kelly (log-optimal) sizing. Bankroll growth equals the information actually held; an overconfident agent’s is negative and it is driven out at rate [exact]. Concretely, an agent that states 0.9 when the truth is 0.7 forfeits log-wealth at nats per resolved bet, so after bets it holds of a calibrated rival’s stake — an exponential overconfidence filter. The caveat is that betting only grounds resolvable claims on workable horizons.
Open questions
Section titled “Open questions”- How do you estimate tail calibration when the high-confidence regime is, by design, the rarest — and therefore the least-sampled?
- Does a sharpness floor plus tail calibration jointly defeat the “coherent ignorance” attack, or can a deceiver still satisfy both on the test distribution?
- What is the right exchange rate between abstention cost and confident-error penalty for a given decision?