Skip to content

Hardening: Overview

The opening of Part III: given the Core Model’s quantities, how do you engineer good values of them? One objective — eliminate confident error — and six families of mechanism that achieve it, each developed on its own page. This overview states the objective, maps the families onto the field’s two axes, gives the threat-coverage picture and the composition algebra that span families, and notes the machine affordances that let every mechanism run at superhuman intensity.

The only epistemic state that can hurt a judge unboundedly is being confidently wrong. A calibrated process — whose stated confidence cc means it is wrong a (1c)(1-c) fraction of the time — converts an open-ended tail risk into a bounded, priceable one: the judge sizes its commitment to cc, and a deceiver, being confidently wrong by construction, shows up as a tail-calibration failure. So the safety-relevant target is not accuracy but tail calibration, P(falseassertsc)P(\text{false}\mid \text{asserts} \ge c) for cc near 1 — and the agenda becomes “make the process never confidently wrong” (achievable: it can always hedge or abstain) rather than “make it accurate” (often impossible) (Allen et al. 2024; trained for in Wessel et al. 2025). [heuristic] Every family below is, ultimately, a way to make confident error rare or expensive.

A hardening mechanism moves a process on one of the field’s two axes — it makes validated information cheaper, or it makes the output costlier to corrupt. The six families sort onto those axes, and the sort explains the Catalogue’s empty quadrant:

FamilyWhat it doesPrimarily buys
Calibrationmake stated confidence match realitycheaper trustworthy bits (the output self-rates)
Verifiability asymmetrymake checking cheaper than producingcheaper validated bits
Independence / decorrelationmanufacture genuinely independent signalscheaper validated bits + higher corruption cost
Invariance / low sensitivityoutput insensitive to controllable inputshigher corruption cost
Incentive-compatibilitymake honesty the best strategyhigher corruption cost
Deterrenceoff-equilibrium penaltieshigher corruption cost at near-zero information cost

The non-obvious structural point: most corruption-resistance is bought with verification, which costs real information-extraction work every time the process runs — so it raises both axes together and never reaches the cheap-and-incorruptible corner. Deterrence is the exception — bonds, clawbacks, tamper-evidence cost almost nothing on the honest path — which is why the empty quadrant is reachable only through it, and why identity/track-record infrastructure (which sets the maximum enforceable penalty) is the field’s most leveraged missing institution. Each family is developed on its own page, with its mechanism, its limit, its constructions, and worked bounds.

Sorting the families by the threats they close shows both the coverage and, more usefully, where the field has none. (Primary defenders listed; many threats are addressed weakly by several families.)

ThreatPrimary defending familiesWhere coverage is thin
Sybil / identity forgeryDeterrence, Invarianceneeds identity infrastructure to bind bonds
Prompt-injection / framingInvariance, Calibration
Persuasion / sycophancyIncentive-compatibility, Verifiability, Calibrationclaims that never resolve
CollusionIndependence, Incentive-compatibilityshared-training correlation
Sandbagging / capability-hidingIndependence, affordances (read-state)interpretability is unreliable
Resolution-criteria gamingDeterrence, Incentive-compatibilitywho defines the criteria
Selection / optional-stoppingVerifiability, Deterrence
Correlated errorIndependence (weakly)largely undefended — the field’s biggest hole
Goodhart-after-publicationDeterrence, externally-grounded familiesstyle-mimicry of high-tier forms

The standout is correlated error: every aggregation-, ensemble-, and panel-based defense assumes an independence that shared training pipelines erode, and no family closes it cleanly. That empty row is a research priority, not a footnote.

The family pages treat constructions as atoms; real systems are stacks, and the catalogue’s value is conditional on how stacking behaves. Model a hardened stack as an attack graph: its corruption cost is the min-cut over attack paths. Series composition without randomization inherits the cheapest stage (the weakest link); randomized escalation with probability pp to an expensive stage adds roughly pCexpensivep\cdot C_{\text{expensive}} to every path that must survive escalation — the inspection-game frontier made compositional. Whether real stacks multiply corruption costs or merely add the weakest link is the central unmeasured question, the same one the Process Catalogue flags.

The cross-cutting multiplier: machine-oversight affordances

Section titled “The cross-cutting multiplier: machine-oversight affordances”

The Process Catalogue lists processes evolved for humans — who have rights, privacy, fatigue, and exist as single un-forkable instances. None bind for an AI reasoner, and the slack is the field’s largest untapped resource. The affordances are not a seventh family; each lets you turn one of the six cranks harder than any human process allows:

AffordanceTurns up which familyExample construction
Reset / re-runIndependencemind-wiped re-derivation
Fork mid-reasoningIndependencecheckpoint-fork cross-examination
Decorrelated resamplingIndependenceconfession-as-invariant interrogation
Sandboxed information dietInvariancereject conclusions resting on un-logged retrievals
Narrow, fixed action spaceInvariancelower the Lipschitz constant by removing levers
Full logging / no privacyDeterrencetamper-evident reasoning trace, free
Forced bets / bondsDeterrence + Incentivesset the enforceable penalty BB
Read internal state(all, weakly)activation-honesty cross-check — [speculative]

The “read their minds” affordance is the highest-leverage and the least reliable: truth appears linearly represented in activations (Marks & Tegmark 2024; Goldowsky-Dill et al. 2025), but chain-of-thought is often unfaithful (Chen et al. 2025), deception survives safety training (Hubinger et al. 2024), and training against an honesty probe can induce evasion (Cundy et al. 2025). Use it as corroboration, never as the load-bearing check.

The capstone — one fully-worked hardened stack on a real task — is planned as the closing chapter of this Part.

  • Are six families the right cut, or do calibration and verifiability collapse into one (both make the output self-checkable), and does independence belong under invariance?
  • Which families compose multiplicatively and which merely add the weakest link?
  • Is there a single inequality the whole Part is special cases of — corruption cost as a function of sensitivity, enforceable penalty, and effective independence?
  • What closes the correlated-error row, the one threat with no clean defender?