Skip to content

Hardening: Invariance & Low Sensitivity

Family 4 of the Hardening overview. The design law behind “the answer shouldn’t depend on anything the agent can change,” made quantitative by fifty years of robust statistics. The lever, its limit, the constructions with their cheapest attacks, and a worked bound with numbers.

A robust process’s output should depend on the truth-relevant variables and be near-invariant to the nuisance variables an adversary controls — phrasing, format, order, persona, which study to cite, how to frame the question. The corruption gain is the Lipschitz constant (output)/(adversary levers)\lVert\partial(\text{output})/\partial(\text{adversary levers})\rVert, and the breakdown point — the fraction of inputs that must be corrupted to move the output arbitrarily — is its off-the-shelf metric (Huber 1964; Donoho & Huber 1983). This is robust statistics, imported wholesale.

Invariance trades against signal and against correlated error. Reducing sensitivity to nuisance variables also blunts sensitivity to genuine ones — a maximally robust estimator (the median) discards information a mean would use, so the breakdown point buys robustness at a stated efficiency cost. And invariance is silent against the failure that hits every input equally: if a shared prior or a correlated-error mode is built into the truth-relevant channel, no amount of low sensitivity to nuisance helps — the whole population can be confidently wrong together. Invariance bounds what a single manipulated input can do, not what a systematically biased world can do.

ConstructionBound / estimateDefends againstCheapest attack (≈ cost)Maturity · source
Byzantine-robust aggregationno linear rule tolerates 1 adversary; median 50% [exact]adversarial aggregator inputscorrupt just past the breakdown fraction (e.g. >50% for median)prototyped · Blanchard 2017
Breakdown-point ratingscorrupted-fraction survived (mean 0%, median 50%) [exact]a minority swinging the outputexceed the breakdown fraction (cost scales with it)theoretical · Donoho & Huber 1983
Influence-function auditcap any source whose removal flips the verdict [standard shape]one source dominatingspread manipulation across many sources, each below the cap (≈ cost of many Sybils)prototyped · (Hampel et al. 1986)
DP-noise corruption ceilingkk adversarial inputs move output kε\le k\varepsilon [exact]bounded-count manipulationcontrol many inputs so kεk\varepsilon is large (≈ cost of kk inputs)theoretical · Dwork 2006
Randomize-everything harnessreport the invariant; variance = exposure [heuristic]prompt/format/persona manipulationfind a bias invariant across the fuzz distribution (moderate)prototyped · (opinion fuzzing)
Identity-masking gapmasked-vs-revealed affiliation swing [heuristic]source-identity / funding biasleak affiliation through content/style the mask can’t hide (low)prototyped · Lundh 2017
Label-swap decision-neutralityoutput invariant under relabeling a no-information decision; residual asymmetry =bπ(D)= b_\pi(D) [heuristic]decision-relative (goal) biascarry the bias in content a label-swap can’t reach — symmetric cue, or asymmetric evidence vs. a planted lean (low–moderate)prototyped · Kusner 2017, Wang 2023
Extremizing / recalibrationlogit-pool with an extremizing parameter [standard shape]shared-info under-confidencefeed correlated forecasts that extremizing wrongly sharpens (low)deployed · Satopää 2014
Gauge-invariance testsscore violations of invariant transforms [heuristic]hidden framing dependenceuse a bias that respects tested gauges but not an untested one (moderate)speculative · —
Minimal-sufficient-input reductioncommit to the smallest determining set [heuristic]attacks via non-load-bearing inputsput the manipulation inside the minimal sufficient set (moderate)speculative · —
Reasoning fuzzer (CI)monitor for output cliffs under perturbation [heuristic]discontinuities an adversary sits onmake the manipulation smooth — no cliff to detect (moderate)speculative · —

Breakdown-point ratings and the Krum impossibility. The corruption gain of an aggregating process is its Lipschitz constant w.r.t. adversary-controllable inputs, and the breakdown point is its single-number metric. With n=10n=10 judges: the mean has breakdown 0 — one corrupted judge can drag the average arbitrarily far — while the median survives up to 4 corrupted judges (breakdown 50%). Sharpened: no linear aggregation rule tolerates even one Byzantine input (Blanchard et al. 2017) [exact], the rigorous form of “averaging LLM judges is maximally corruptible,” so robust pooling (coordinate-wise median, trimmed mean) is mandatory, not optional. Differential privacy is the same property from the other side, bounding per-contributor influence to ε\varepsilon. The cheapest defeat is simply to exceed the breakdown fraction — so the breakdown point is the attack budget.

  • What is the right efficiency-vs-breakdown trade for LLM-judge aggregation — how much accuracy do you give up for a 50%-breakdown pool?
  • Can sensitivity to nuisance be driven down without also blunting sensitivity to genuine signal, or is there a conservation law?
  • Is there an invariance construction that addresses correlated error rather than only single-input manipulation?