Skip to content

What Grounds an Oversight Protocol?

Scalable-oversight protocols are best organized by what their incentives ground out in: a judge’s verdict, external resolution, internal coherence, or peer agreement. No grounding dominates — they cover different claim classes and fail differently — which is why the natural design is a composition, and why the field still lacks an economics of which claims to verify at what price.

Scalable oversight (DeepMind’s term: amplified oversight) asks: how does a weaker principal extract honest, high-quality information from a more capable agent — essentially the ELK problem restated? The incentive-based proposals — debate, market-making, prover-verifier games, recursive reward modeling — share a structure: each is a mechanism for making honesty the agent’s best strategy. Non-incentive approaches — AI control, interpretability, weak-to-strong generalization — sit outside this taxonomy by design.

The claim of this page: these protocols are best organized by what their incentive grounds out in. Every protocol needs some final arbiter that the reward chain terminates in, and there are only a few candidates:

GroundingProtocolsThe agent profits by…Characteristic failure
A judge’s verdictDebate, consultancy, prover-verifier games, recursive reward modeling, market-makingconvincing the judgepersuasion ≠ truth; obfuscation
External resolutionPrediction markets, retrodiction games, producer/consumer prediction gamesbeating the market and being right lateronly covers resolvable claims; latency; resolution ambiguity/manipulation
Internal coherenceConsistency evaluations, Dutch-book testsbeing unexploitableconsistent-but-wrong; evasive vagueness
Peer agreementPeer prediction, Bayesian Truth Serumreporting what informed peers would reportcollusion; herding on plausible falsehoods

No grounding dominates. They cover different claim classes, fail differently, and cost differently — which is exactly why they compose. (A possible fifth grounding — a principal’s own future decisions — was sketched in decision forecasting (2018); we leave it aside here.)

Judge-grounded: debate and its discontents

Section titled “Judge-grounded: debate and its discontents”

In safety via debate, two agents argue opposing sides and a weaker judge picks the winner; the hope is that the equilibrium strategy is honesty, because lies have refutable structure. The line has real theory (doubly-efficient debate gives complexity-theoretic guarantees) and real experiments — with human debaters and judges (Michael et al. 2023), with weak LLM judges overseeing strong LLM debaters (Kenton et al. 2024), and with the finding that selecting debaters for inference-time persuasiveness increased judge accuracy (Khan et al. 2024), and that training debaters via self-play likewise improved judge accuracy over consultancy (Arnesen et al. 2024) — though Kenton et al. found such gains largely confined to information-asymmetric QA, not replicating across tasks.

The predict-the-judge structure predates LLMs: Prediction-Augmented Evaluation Systems (2018) proposed amplifying expensive evaluations with forecasters, and a 2019 experiment found crowd predictions of a trusted evaluator captured roughly 72% of the benefit-cost value of direct evaluation (Amplifying generalist research via forecasting).

Its pathologies all stem from the grounding. The training signal is what convinces the judge, so truth wins only where the argument landscape favors it. LLM judges add an instability of their own: prompt phrasing alone can shift their judgments by double digits, and ranges across models and personas are far larger (opinion fuzzing measures and partially tames this). The obfuscated arguments problem is the sharpest version: a dishonest debater can construct arguments whose flaws are too expensive to locate, and the judge cannot tell obfuscation from depth — a problem that symmetrically denies honest debaters their large implicit argument trees. A protocol-level fix has been proposed — prover-estimator debate — but is so far theoretical, and its honesty guarantee requires a stability assumption on the problem.

In economic terms: an obfuscated argument is like a mispriced security whose audit cost exceeds the arbitrage profit — nobody is paid enough to find the flaw. But the supply is endogenous — a debater manufactures obfuscated claims more cheaply than anyone can audit them — so verification subsidies alone are drainable; the load-bearing instrument is an assertion fee (stake-to-assert), with subsidies and decomposition bounties secondary. The game-theoretic framing generates none of this.

Debate’s distinctive virtue: for claims that never resolve, it is the only grounding that delivers a verdict on the specific claim at hand — peer prediction offers only a statistical incentive, consistency only a partial filter.

Resolution-grounded: markets and prediction games

Section titled “Resolution-grounded: markets and prediction games”

The alternative is to ground incentives in external resolution.

Our construction (not established literature), named so it can be referenced:

The consumer must be a committed, subsidized market maker scored by a proper scoring rule (Hanson’s LMSR): it knowingly loses money to buy information — the subsidy is the price of elicitation — which also dissolves the Milgrom–Stokey no-trade objection (a rational counterparty would update on the offer itself; a committed market maker cannot).

Hubinger’s market-making proposal is the closest relative, but it grounds out in the human judge’s stable equilibrium belief, not in any external event — hence the judge row above, market machinery notwithstanding. The producer/consumer game’s explicit novel move is replacing that human-equilibrium resolution with genuine external resolution. Empirical work is only beginning: a proof-of-concept demo (2024) and an inference-time multi-agent version reporting accuracy gains over single-agent baselines (Gho et al. 2025) — neither pays at resolution, and training models to a market’s stable equilibrium remains untested.

Persuading the consumer is necessary but not sufficient — false persuasion is priced at the resolution horizon, wherever resolution is trusted and actually reached. This removes judge-grounding’s core failure — a training signal that rewards persuasiveness rather than truth — as the terminal incentive, though whoever interprets the resolution criteria becomes a smaller persuasion target. The incentive failures here are documented, not hypothetical: real platforms’ scoring rules and tournament structures reward probability distortion and information hoarding (Alignment Problems with Current Forecasting Platforms), and corporate-internal prediction markets have repeatedly failed on tooling, question-writing cost, and social disruption (Prediction Markets in the Corporate Setting). The price is coverage: the protocol only directly handles resolvable claims, and resolution is slow. Resolution can also be deferred: fix a protocol now for selecting the most-trusted resolver at resolution time — Epistemic Selection Protocols — a question first posed as whether post-AGI evaluators can be trusted (2019).

Both costs can be pushed back:

  • Retrodiction can expand the resolvable set — hide known facts (redacted study results, post-cutoff events, masked dataset entries) and resolve immediately against them — if contamination can be controlled; held-out banks decay with each model release, and authoring genuinely hidden questions is itself costly.
  • Fast partial validatorsconsistency checks, spot-audits — let the system advance credit against slow resolution. The advance is a gameable proxy, so it wants escrow, clawbacks at resolution, and randomized audit timing.

Question provenance needs the same care — a known cost center on human forecasting platforms. A producer that picks its own questions can arbitrage ambiguity, or under retrodiction concentrate bets on items it has memorized. Hence: separate proposer from trader, overseer-selected retrodiction targets, resolution criteria fixed before trading, no profit on questions later ruled ambiguous.

Consistency protocols need no judge and no resolution: the agent is scored on whether its own outputs can be arbitraged — Dutch-booked, as when its probabilities on a claim and its negation sum past 1. Cheap, unsupervised, scalable, and strictly partial — a useful screen, neither necessary (honest bounded reasoners are incoherent somewhere) nor sufficient (coherent deceivers exist). See Consistency Evaluations for the full treatment.

Peer prediction scores a report by how well it predicts other informed agents’ reports, making truth-telling an equilibrium with no judge and no resolution — built precisely for claims with no ground truth (Miller, Resnick & Zeckhauser 2005; Prelec 2004). The mechanism-design line has since hardened it considerably — removing the common-prior assumption (Witkowski & Parkes 2012), making truth-telling the maximally-paid equilibrium via Correlated Agreement (Shnayder et al. 2016), and unifying the family under an information-theoretic, dominantly-truthful framework (Kong & Schoenebeck 2016). It remains the least trusted grounding here: the guarantees assume non-colluding reporters, and LLM copies are the easiest colluders imaginable.

Since the groundings cover complementary territory, the natural design is layered:

flowchart LR
    C[Unresolvable claim] --> D[Debate]
    D --> S1[Resolvable subclaims] --> M[Markets / prediction games]
    D --> S2[Debater belief structure] --> K[Consistency checks]
    M --> G[Resolution / retrodiction]

Debate operates over the unresolvable claim, but its subclaims are anchored by markets wherever they touch resolvable ground, and the debaters’ probability assignments are continuously consistency-checked. The judge’s job shrinks from “evaluate everything” to “evaluate the genuinely unresolvable residue” — which is the only part a judge was ever the right tool for. Each edge buys something specific — markets cap the payoff on false resolvable subclaims; consistency checks catch incoherent belief structure — but the cover is partial: a consistent, obfuscated, unresolvable argument slips all three.

A final observation. Debate gave oversight a game theory: how to verify a claim, given unbounded debaters and a bounded judge. It says nothing about which claims are worth verifying, what verification should cost, or how to allocate bounded oversight across thousands of claims. That allocation layer — value of information, subsidies, question weights, portfolio effects — is an economics. AI control has the nearest existing piece — explicit audit budgets and defer-to-trusted thresholds — but for catching bad actions, not pricing truth-seeking. Epistemic Impact Analysis is our attempt to extend allocation from audit budgets to epistemic value across claims, and Overseeing Automated Research is the application where the economics becomes the whole point.

Composition shrinks the judge’s job to the unresolvable residue; the same move generalizes into a design discipline for the whole pipeline. Deception is a live risk only inside a conjunction of conditions — verification gap, reproduction gap, output-dependence, goal-divergence, never-resolves, stakes — so the program is to route as many sub-questions as possible out of that conjunction: make them self-checkable (break verification), trusted-reproducible (break reproduction), or eventually-resolvable (break never-resolves). This is the defender’s design-time move: the untrustworthy-sources chapter frames routing as attack-surface reduction — the first move of a Stackelberg game, not a cleanup step. What remains is the residual attack surface — questions that turn on an irreplaceable, output-critical advantage, never resolve, and where goals diverge — and on those you accept trusted-frontier quality (the Blackwell-best trusted experiment) unless hardening that specific question pays for itself. That fallback is the defer-to-trusted move of AI control (Greenblatt et al. 2024), generalized from runtime actions to question types.

Two consequences:

  • This sharpens the allocation question the economics above raised. Spend hardening budget only where the value of the source’s advantage over the trusted frontier is both large and irreducible; everywhere else, routing is cheaper than oversight. AI control supplies a partial instance already — a fixed audit budget spent on the most-suspicious tail — but it routes actions by suspicion at runtime rather than sub-questions by epistemic type at decomposition time, and stops short of a value-of-information economics.
  • Routing can be coarse, because its errors are asymmetric. Misrouting a safe question into the residue only costs some quality (you fall back to the trusted frontier); misrouting a dangerous one as safe is what gets you deceived. This is the logic of selective classification — Chow’s reject-option rule and Neyman–Pearson / cost-sensitive thresholding all push the boundary toward the costlier error — so the router should err toward “treat as residue,” and a crude conservative router suffices; the hard problem of perfectly categorizing real-world questions need not be solved first. The caveat carries over too: asymmetric-cost theory tells you where to set the threshold, not that an uncalibrated router actually hits it.
  • Head-to-head empirics: on the same task set with the same weak overseer, how do resolution-grounded protocols compare to debate? Standardized comparison infrastructure is emerging (a benchmark for oversight protocols; scaling laws for oversight across capability gaps), but no published comparison yet includes payment-at-resolution incentives or deceptive-producer conditions — and some prior optimism may be artifactual: Recchia et al. 2025 find simple oversight protocols show no overall advantage once the evaluators’ knowledge edge over the model is removed, an edge that vanishes as models scale.
  • How far can retrodiction extend resolution-grounding before models’ background knowledge contaminates the held-out facts?
  • Do peer-prediction mechanisms survive contact with colluding LLM agents? Qiu et al. 2026 post-train LLMs via peer prediction and find resistance to deception grows with the capability gap, but collusion remains open.
  • What does equilibrium look like in the hybrid protocol — does anchoring subclaims actually change debate’s training dynamics, or just its evaluation?
  • Who controls decomposition in the hybrid? A dishonest debater could route convenient subclaims onto resolvable ground while the load-bearing claim stays in the unresolvable residue.
  • Can questions be cheaply and conservatively routed out of the deception conjunction at scale, and what is the quality cost of the residue left behind — how often must we settle for trusted-frontier quality rather than the best available source?