Skip to content

The Process Catalogue

Seventeen reasoning processes — from formal proof to single LLM judges, with Metaculus-style tournaments and Kalshi-style markets as distinct entries — compared on information cost, scope, cost to bias, and built-in biases. The headline finding: cheap and hard-to-corrupt do not yet coexist; the empty quadrant is the field’s target.

The Core Model sketches two headline numbers for every reasoning process: cost per validated bit (what extracting trustworthy information costs) and the corruption cost curve (what an adversary must spend to bend the output). The table keeps the first as a compact rating (Info) and unpacks the second into two distinct things that the old single column conflated:

  • Cost & challenges to bias it — what an active adversary must spend or pull off, and where the cheap attack surface is. This is the corruption cost curve.
  • Built-in biases — the systematic distortions the process exhibits with no adversary at all: who self-selects in, what questions get asked, what the incentive structure rewards by default. These are free to exploit (an adversary just leans on them) and they bound accuracy even in a fully honest world.

A fourth column, Scope / applicability, records what kinds of claims the process can handle at all — dominance comparisons only make sense between processes whose scopes overlap, and several rows buy corruption resistance precisely by narrowing scope.

Ratings: Low / Medium / High. High info cost is bad; high cost-to-bias is good.

Plotting the catalogue on the two axes is the field’s version of a periodic table: it shows which processes dominate which, where the efficient frontier runs, and — most usefully — which cells are empty.

The empty quadrant cheap to run, expensive to corrupt — nothing lives here yet Cost per validated bit → Cost to corrupt → cheap expensive easy hard Formal proof Full replication Forecaster track records Real-money market (Kalshi-style) Output-metered market Tournament (Metaculus-style) Play-money market Courts Expert panel Financial audit Peer review Composite indices Online reviews Debate protocol Consistency battery Multi-model LLM panel Single LLM judge Grounded in resolution / track records Institutional judgment AI-native
Positions are version-0 qualitative judgments from the Process Catalogue, drawn to be argued with. Replacing them with measured coordinates is one of the field's central projects.
ProcessInfoScope / applicabilityCost & challenges to bias itBuilt-in biases
Formal proof + mechanical checking (Lean/Coq-style)H to produce, L to verifyOnly formalizable claims (math, program properties); the spec must be writableH — the checker is effectively unownable; the viable attack is the spec itself (prove the wrong theorem) and the formal–informal gapCovers only what formalizes; effort flows to provable variants of questions rather than the questions asked
Full replication (registered, independent teams)H (≈ cost of original)Empirical claims with shareable methods; excludes one-off events and decade-scale effectsH — must corrupt several independent teams; the cheap attack is correlated methods (everyone inherits the same flawed protocol)Selection toward cheap, fast, fashionable studies; the original authors’ framing defines what counts as “replicating”
Track-recorded forecaster team (Good Judgment-style)MResolvable questions on months-to-years horizons; transfer to novel domains unprovenH — verified long records take years to fake; cheaper targets are the question-writer and the resolverSelf-selected demographic (numerate hobbyists); calibrated only where past questions resolved; conservative on regime change
Prediction tournament, reputation-scored (Metaculus-style)MBroad — long-horizon, thin-interest, and politically sensitive questions all work, since no capital is locked up; needs crisp resolution criteriaM — accounts are cheap but score-weighting makes influence require a slowly built record; documented scoring-rule gaming and info-hoarding near deadlines; question-framing and the resolver are the cheap targetsHerding toward the visible community median; participant monoculture; the platform decides which questions exist
Prediction market, real-money (Kalshi/Polymarket-style)MNarrower — needs regulatory approval or crypto rails, enough trader interest for liquidity, and short-to-medium horizons (capital lockup punishes long-dated questions)M–H, scaling with liquidity — holding the price wrong means paying informed traders’ profits indefinitely; thin or new markets are cheap to move; resolution-criteria lawyering is documentedFavorite-longshot bias; interest-rate discount distorts long-dated prices; trader pool skews risk-tolerant; fees and house question selection
Prediction market, play-money (Manifold-style)L–MWidest question variety — anyone can write a market; resolution often by the question’s authorL–M — play currency is farmable; author-resolution invites self-dealing; only reputational stakes back the priceEntertainment-driven question selection; whale accounts dominate; author-as-resolver conflict of interest
Adversarial proceedings (court-style)HDisputes between identified parties under an existing rule-set; one verdict per case, slowlyM–H — procedural safeguards are expensive to subvert outright, but resource asymmetry reliably wins at the margin; forum- and judge-capture are documentedPrecedent lock-in; procedure over truth; rules structurally favor the represented and the wealthy
Expert panel / consensus report (IPCC/NAS-style)MQuestions a recognized discipline claims; works without resolvable outcomes — which is also the weaknessM — capture the convener’s selection process or fund the field’s pipeline; social pressure inside the panel is freeCredentials ≠ accuracy and no calibration record; groupthink; paradigm conservatism; consensus drafting buries disagreement
Financial audit (Big-Four-style)MClaims reducible to documented transactions checked against codified standardsL–M — the auditee pays the auditor; capture via fees and consulting cross-sell is documented, not hypotheticalIncentive inversion baked into the business model; checklist ossification; materiality thresholds hide small-but-systematic distortion
Peer review (journal-style)MGatekeeps publishability in any field, but outputs accept/reject verdicts, not probabilitiesL–M — suggest friendly reviewers, log-roll, exploit prestige signals; never benchmarked, so corruption is invisibleNovelty preferred over replication; prestige and institutional bias; conservatism toward field consensus; unpaid labor → noise
Composite indices / rankings (CPI-style)L (marginal)Anything quantifiable at scale; enables cross-unit and cross-time comparisonL — Goodhart the moment the formula is public; lobbying the weighting is cheaper than changing the worldMeasures what’s easy to measure; weightings are hidden judgment calls; proxy drift
Online review systems (Amazon/Yelp-style)LMass-market experience goods with many independent ratersL — fake reviews and brigading are cheap absent identity infrastructure; platforms fight a losing volume warExtreme experiences self-select into reviewing; recency and survivorship effects; platform moderation follows platform incentives
Single LLM judgeLAnything expressible in text, instantly — the widest scope in the table and the floor of trustL — prompt injection, sycophancy, persuasion artifacts; attacks transfer across deployments of the same modelPosition, verbosity, and style biases; training-data majority view; RLHF agreeableness; confident error
Multi-model LLM panelLSame as single judgeL–M — must fool several models at once, but shared training pipelines correlate their failuresFalse independence — agreement without accuracy; common-corpus priors
Consistency batteryL (unsupervised)Any producer that answers many related questions; checks coherence, never truthL–M — a coherent worldview passes; projecting onto consistent-but-wrong beliefs defeats itRewards internal coherence over correspondence with reality; necessary, never sufficient
Debate protocol (judge-grounded)MClaims where decisive evidence can surface within a bounded exchangeM — obfuscated arguments; out-persuading beats out-arguing; the judge is the attack surfacePersuasion ≠ truth; the judge’s priors and attention limits set the verdict
Output-metered producer market (resolution-grounded)MOnly resolvable claims; producers paid per validated outputM–H where resolution is trusted — advance-credit gaming; manipulating the resolution layerPortfolio tilts toward claims that resolve cheaply and soon

The frontier runs through validation, not intelligence. The hard-to-bias rows share one feature: their outputs are checked against something an adversary cannot cheaply own — a proof checker, independent replication, a long resolved track record, informed counterparties with money at stake. The easy-to-bias rows share the inverse: their verdicts bottom out in an ownable judgment (a reviewer, a rating, a single model). This is the grounding taxonomy of What Grounds an Oversight Protocol? showing up as an empirical regularity.

Scope and corruption resistance trade off, visibly. The Metaculus/Kalshi pair is the cleanest illustration: real money makes sustained manipulation expensive where liquidity exists, but regulation, capital lockup, and trader interest shrink the question space to short-horizon, mass-interest events. Reputation scoring keeps the question space wide open — long-horizon, obscure, politically sensitive — but lowers the price of an attack from “pay the informed traders” to “game the scoring rule or capture the resolver.” Several other rows make the same trade (formal proof is the extreme case: near-incorruptible, vanishingly narrow). A fair comparison between two processes has to hold the question class fixed first.

Built-in biases are the cheapest attack of all. The last column matters adversarially, not just statistically: an attacker rarely needs to corrupt a process when they can simply pose the question where the process already leans their way — pick the venue, pick the framing, pick the resolution criteria. Built-in biases are corruption an adversary gets at zero marginal cost, which is why the column belongs in a catalogue ostensibly about corruption cost.

Two kinds of built-in bias, and only one is steerable. The column above conflates them. Broad / structural biases tilt outputs across all decisions — position and verbosity effects, RLHF agreeableness, the training-data majority view — and aim at no particular outcome. Decision-relative (goal) bias, bπ(D)b_\pi(D) (nearest standard term: directional bias), is a systematic lean toward one resolution of a specific decision DD: an auditor favoring the client who pays it, an LLM that would pick “Anthony” over “Ted” in a debate it has no information about (see The Funding Effect). The two come apart — a process can be riddled with broad biases yet perfectly neutral on a given DD, or stylistically clean yet captured on the one decision that matters. The normative target follows: a trustworthy reasoner should have bπ(D)0b_\pi(D)\approx 0 for any DD it has no truth-relevant information about. Unusually, that target is checkable without ground truth — swap the labels of a decision the process is ignorant about and the output should swap symmetrically; residual asymmetry is the goal bias. This label-swap neutrality check is a sibling of the gauge-invariance and identity-masking constructions and an unsupervised member of the consistency battery. It has two rigorous precedents — position-bias swapping in LLM-judge evaluation (judge both orderings; residual asymmetry is the bias — Wang et al. 2023) and, formally, counterfactual fairness (Kusner et al. 2017: invariance under swapping an irrelevant attribute, no ground truth required) — and one sharp limit: it is blind to symmetric content-borne bias, where both labels trigger the same cue, and to non-directional biases like verbosity or miscalibration. It is also what the source side of deception preconditions calls goal-divergence: with no goal bias, there is nothing to deceive toward.

Cheap and robust don’t yet coexist. Today’s cheap processes (LLM judges, indices, reviews) are cheap to corrupt; today’s robust ones (replication, proofs, deep track records) are expensive or narrow. The field’s constructive bet is that the empty quadrant is reachable by composition: cheap processes as filters, with randomized escalation to expensive robust ones — so the adversary faces the expensive process’s corruption cost while the judge mostly pays the cheap process’s information cost. Whether composition actually multiplies corruption costs, or merely adds the weakest link, is an open measurement question — but it has a standard algebra to test against: model the stack as an attack graph, and its corruption cost is the min-cut over attack paths. Series composition without randomization inherits the cheapest stage; randomized escalation with probability pp adds roughly pCexpensivep \cdot C_{expensive} to every path that must survive escalation — a predicted value for the v1 stack entries below to be checked against.

Several entries are moving. LLM rows are the fastest-changing: information cost is collapsing while corruption cost is, if anything, falling too (better models are better persuaders and better Sybil operators). The interesting derivative is the ratio. Conversely, identity infrastructure (Identity and Track-Record Infrastructure, planned) would raise the corruption cost of every row that currently fails by Sybil or reputation-laundering — it is the single most leveraged missing institution in the table.

Documented corruption is the best data we have. Where the table claims a low cost to bias, it mostly rests on the historical record rather than experiments: gamed tournament scoring and information hoarding on real platforms, auditor capture (auditors paid by auditees), review manipulation at scale. The experimental version — paying red-team producers to corrupt each process and recording the price — is the incentive audit practice the field needs to standardize (Certification and Gyms, planned).

  1. Numbers. Replace L/M/H with estimates: dollars per validated bit; dollars to shift a conclusion by a fixed amount against a fixed-capability attacker. Wide intervals are fine; the point is comparability.
  2. Capability-indexed corruption columns. CπC_\pi at human, current-model, and frontier-model attacker levels — the corruption-capacity curves sketched in The Core Model.
  3. Per-process pipeline decompositions. Where each process loses its value (codification vs. evaluator vs. interpretation) — the audit template applied row by row.
  4. Composition entries. The table treats processes as atoms; the most important real systems are stacks. Measured entries for two or three named stacks (e.g., consistency floor → LLM panel → randomized expert escalation) would test the composition bet directly.
  • Is “cost per validated bit” actually well-defined across processes with different output types (probabilities, rulings, models, rankings)? What’s the common currency?
  • How stable are corruption costs over time, given that attacks improve? Is the table a snapshot or can some entries be made future-proof?
  • Which rows’ weaknesses are intrinsic, and which are artifacts of missing infrastructure (identity, escrow, resolution)?