Constructing Utility Functions

How weighted question portfolios — the utility functions downstream pages assume — can actually be built: relative value functions as the format, comparison polling as the elicitation mechanism, two small experiments’ worth of sobering empirical results, and what LLMs change.

To score research by how much it moves the questions an organization cares about — the bet Epistemic Impact Analysis and Overseeing Automated Research both make — you need a weighted question portfolio: a utility function $U$ . Both pages treat its construction as a quiet input cost — “weeks of specialist work.” This page covers how such objects can actually be built. QURI spent roughly 2021–2023 on this problem; the accumulated work — a format, an elicitation mechanism, two small experiments, statistical grounding — is the most complete attempt we know of at eliciting ratio-scale relative values over research artifacts and grants specifically (GiveWell’s and Rethink Priorities’ moral-weights work shares the goal, not the ratio-distribution format).

The format: relative value functions

Absolute units failed first, in QURI’s hands. Forcing heterogeneous items — grants, papers, forecasting questions — into one currency (QALYs, dollars) produced either extremely uncertain estimates or a proliferation of units with a huge rule book. Relative value functions sidestep this: represent value as pairwise ratio distributions, fn(id1, id2) => distribution, capturing comparisons where judgment is actually precise (“this grant is 2–10x that one”) without committing to a global unit.

The format’s load-bearing feature is correlation preservation. Two absolute estimates each spanning five orders of magnitude may still have a tightly known ratio; stored as independent distributions, that information is destroyed. Encoding values as programmatic functions (QURI’s implementation uses Squiggle) keeps shared uncertainty factors so they divide out in comparisons — for a question portfolio, the difference between individually meaningless weights and pairwise defensible ones. So far this is a design argument, not an empirical result: both experiments below used point ratios only.

The elicitation mechanism: comparison polling

Where do the ratios come from? Simple comparison polling: repeatedly ask “how much more valuable is X than Y?”, collect numeric ratios, and stitch them into a full utility function. QURI’s Utility Function Extractor implemented this as a web app — merge-sort ordering of comparisons (minimal comparison sets are cheap, but assume transitive answers and leave little redundancy to audit), one reference item pinned at value 1, everything else valued via comparison paths from it. The tool deliberately surfaces inconsistencies (A = 2×B, B = 10×C, yet A ≠ 20×C) for manual repair. Known weaknesses, flagged at the time: values spanning many orders of magnitude are hard to enter, and point ratios should really be distributions.

The empirical record

Real elicitation data is rare; the two experiments deserve careful reporting.

Nine Open Philanthropy grants (2022). Six researchers (two with incomplete participation) elicited relative values over nine 2018 Open Philanthropy AI-safety grants ($100k to roughly $1.1M) — about two to three hours per participant in total, across four rounds: extractor-based pairwise comparisons, hierarchical tree estimates against reference points, individual all-things-considered estimates, then revision after group discussion. The sharpest finding is intra-rater, inter-method disagreement: the same participant’s distributions for the same grant from the first two methods often did not overlap. Post-discussion estimates were fairly concordant across researchers — though they spanned merely an order of magnitude, which the writeup itself flags as plausibly overconfident. Documented pitfalls: geometric-mean aggregation breaks when some estimates are negative; the target quantity (expected vs realized value) was underspecified; the briefing materials may have anchored everyone. The author’s verdict: suitable for rapid evaluation; for decisions that matter, explicitly model pathways to impact instead — a verdict the oversight uses of $U$ must either escape (the precision question closing this page) or accommodate with explicit impact models behind the top-weight questions.

Fifteen research works (2022). Six EA researchers spent 1–2 hours each making pairwise comparisons over 15 research pieces; values were aggregated by taking the geometric mean over all monotonic comparison paths to a reference item. Inter-rater disagreement was structural, not noise: individual raters’ value ranges spanned from 5.1 to 12.6 orders of magnitude (average 7.6) — raters disagreed about the scale of the value landscape itself, not just orderings. Transitivity failed within single raters too: one participant’s chained comparisons implied a 400× ratio where the direct estimate was 33×, making aggregates path-sensitive. The stated conclusion: individual estimates, even from respected researchers, are likely very noisy and often inaccurate, with explicit skepticism that the aggregate constitutes ground truth.

What the data support: the pipeline runs end-to-end at hours of expert time per 10–15 items; intra- and inter-rater inconsistency is large; rank concordance, external validity, and test-retest reliability went unmeasured; and consistency violations were large enough that the aggregation method visibly matters. How costs scale to the hundreds-of-questions portfolios downstream pages assume is unextrapolated.

Statistical grounding

None of this needs inventing from scratch. Pairwise preference estimation has decades of literature in economics, marketing, and statistics — Thurstone (1927) and McFadden’s conditional logit work (1974) onward; the broader literature adds Bradley–Terry-style paired-comparison models and best-worst scaling. A short exploratory note by David Moss (Rethink Priorities) draws the key distinction: economics leans on discrete choice models, which recover orderings but not the scale of value differences; graded ratio comparisons of the QURI style cost more cognitive effort but capture the scaling information a portfolio’s weights require. His recommendation — a rigorous statistical model first, validated in small methodological studies — remains largely unexecuted.

The geometry is also already worked out. HodgeRank (Jiang, Lim, Yao & Ye 2011) treats pairwise log-ratios as flows on a comparison graph and decomposes them exactly into a gradient component — the best-fitting consistent utility function — plus cyclic components that measure inconsistency. Under this lens, the 400×-vs-33× transitivity failures above stop being anecdotes: each rater gets a measured cyclic-residual norm, the implied utility function is the gradient part, and the least-squares repair is canonical rather than ad hoc. (The decision to repair by projection is still an aggregation policy — the decomposition tells you what the consistent core is, not that the consistent core is right.)

What kind of utility function this is

Conceptual hygiene, via Distinctions when Discussing Utility Functions: a question portfolio is an operational utility function (a runnable artifact, not a description of anyone’s terminal values), largely on-demand (computed as needed rather than precomputed), and proximal rather than terminal. It is also strictly post-repair: raw comparisons are often intransitive, becoming a utility function only via an aggregation step (the path aggregation above) — a known degree of freedom. Claims about elicitation noise are claims about this modest object — not about whether the organization “really has” a utility function.

The weighting itself cannot be skipped. Sempere’s analysis of 200 Metaculus questions found most had little direct decision impact — “optimized for being fairly interesting to forecasters rather than directly valuable.” An unweighted portfolio inherits exactly this pathology; importance weights are where the mission actually enters. Weights also presuppose the question substrate: crisply resolvable questions (writing those is the scarce skill — the same analysis’s harder lesson) and a settled ex-ante-vs-ex-post choice, the 2022 underspecification.

The LLM upgrade

The 2022 experiments were bottlenecked on hours of expensive researcher attention, yielding sparse, inconsistent comparison matrices. LLMs could change each step — “could”: only the last part has support so far. Candidate comparisons and question variants become cheap to generate. Each rater’s sparse matrix can be densified — within-rater only; across raters who disagree structurally about scale, “humans anchor the scale” is ill-defined, and a dense matrix quietly bakes in an answer to the first open question below. And ratio matrices are exactly where Dutch-book and transitivity checks apply, making consistency enforcement cheap at scale — though the extractor already detected the 400×-vs-33× failures; the unsolved step is repair, a trust policy that is itself an aggregation choice, and coherence is a floor, not accuracy: a consistent-but-wrong $U$ is the more confidently optimized against. The judge caveat splits in two: variance — phrasing, model choice, and persona shift judgments substantially — which opinion fuzzing (sampling across prompts, models, and personas) mitigates; and correlated biases — position, verbosity, sycophancy toward the framing — which survive sampling and need validation against held-out human comparisons. Cheap elicitation with documented variance but undocumented bias is still a different regime from scarce elicitation with neither.

Open questions

How should we aggregate across raters whose disagreement is structural — 5 vs 12 orders of magnitude of value range — rather than noise around a shared signal?
How should portfolio weights drift over time, and who audits the drift? (The meta loop assumes this is solved.)
Can weights be elicited for not-yet-asked questions — valuing question discovery, the known blind spot of fixed portfolios?
How much elicitation precision does Epistemic Impact Analysis actually need? If rankings of $V$ — the information-value score it computes against $U$ — survive order-of-magnitude weight noise, the 2022 results suffice; nobody has checked. And noise bounds only passive evaluation: a metered system optimizing against $U$ selects its errors adversarially (Goodhart), so required weight precision grows with the optimization pressure applied.