WORKING PAPER · VERSION 1.0 · JULY 2026 · A live inquiry, released in versions as evidence accrues. Comments and case evidence welcome: info@impactthinking.co.uk
Institutions say they want good judgment. What they reward, observably, is the performance of certainty — the confident brief, the unhedged forecast, the leader who “projects decisiveness” while the question is still genuinely open. The preference is not cosmetic: a robust experimental literature shows that people select and retain confident advisers over accurate ones, and that advisers, knowing this, compete on confidence — a market that systematically punishes calibrated uncertainty. Meanwhile the forecasting literature shows the correlation running the other way: the most accurate judges of complex questions are precisely the hedgers, the updaters, the takers of many small positions — and the most confident public experts are among the least accurate. This working paper examines the institutional machinery that sustains the mismatch, its cost in premature closure and unexamined risk, and the question of what an institution would have to change — in its meetings, its promotion signals, and its leaders’ own capacity to stand in not-knowing — to stop paying for confidence and start paying for judgment.
Begin with the demand side. In controlled studies of advice-taking, people systematically prefer advisers who express high confidence — the confidence heuristic — using certainty as a proxy for competence even when accuracy information is available.1 The supply side responds as any market would: in competitive advice settings, advisers inflate their expressed confidence beyond their private beliefs, because calibrated hedging loses the client.2 Radzevick and Soll’s finding deserves its bluntness: competition makes advisers more overconfident, not more accurate, because confidence is what is being bought.
Now the accuracy side. Tetlock’s two-decade study of expert political judgment found the experts most in demand — the most famous, most quoted, most confident — performing worst against reality, while the best performers were the cognitively humble: foxes rather than hedgehogs, takers of provisional positions, relentless updaters.3 The superforecasting research sharpened the point: the distinguishing marks of the most accurate judges under genuine uncertainty are precisely the behaviours institutional theatre punishes — granular probabilities, visible revision, comfort saying “I don’t know yet.”4
Put the three findings together and the conclusion is uncomfortable: the confidence an institution rewards and the judgment it needs are not merely different quantities. Under uncertainty, they are inversely correlated — and the institution’s selection machinery is pointed at the wrong end.
Inside organisations the market for confidence operates through mechanisms mundane enough to escape notice. The meeting rewards the first structured answer: a genuinely open question, voiced in a senior forum, survives on average a few minutes before someone relieves the room of it — and the reliever is experienced as having contributed, the holder-open as having stalled. Career signalling compounds it: “decisive” appears on every promotion rubric; “knows when not to decide yet” appears on none. The asymmetry of accountability completes the machine: the confident wrong call, taken with the room’s approval, is a shared misfortune; the hedged right call is barely remembered; and the leader who held a question open under pressure and was later vindicated receives no credit at all, because the counterfactual disaster never happened and unhappened disasters have no constituency.
Janis documented the collective endpoint half a century ago: groups under pressure for consensus converge on premature certainty and suppress the doubt that would have saved them.5 The inquiry literature keeps re-finding it — the confident assessment that hardened too early, the dissenting analysis that was structurally unhearable (Working Paper: The Cost of Silence treats the voice side of this machinery). What the confidence market adds to groupthink is a career dimension: it is not only the group dynamic that closes questions early; it is that every individual’s incentive, read correctly, is to be the closer.
The alternative to performed certainty is not indecision — that is the false binary the confidence market trades on. It is the capacity this desk’s programme research calls standing: remaining steady in genuine not-knowing, holding the question open on purpose while the picture forms, and then authoring a real position — a declaration one stands behind — rather than laundering the choice through false inevitability. Standing is more demanding than confidence, not less: it requires tolerating the exposure of visible uncertainty in cultures that read it as weakness, and it requires knowing the difference between the decisions that need analysis, the decisions that need speed, and the decisions that need a stand — a triage the confidence performance never has to make, because it treats everything as already answerable.
The institutional question, then, is not how to make leaders humbler — humility campaigns meet the same fate as openness campaigns — but how to re-price the behaviours: what would have to change for calibrated uncertainty to be career-safe and premature certainty career-expensive?
Early practice suggests the levers are, as elsewhere in this research programme, conversational and evidential rather than exhortative. Decision forums that separate question-holding from answer-giving as explicit phases, so that holding open is a sanctioned activity with a name rather than a failure to contribute. Position papers that require probability language and named uncertainties, making calibration visible and hedging legible as rigour rather than weakness. Decision post-audits that score process under uncertainty — was dissent heard, were probabilities stated, was the closure timed or premature — rather than outcomes alone, which under uncertainty reward luck. And, hardest, senior modelling: one leader visibly saying “I don’t know yet, and here is when and how we will” — and being seen to prosper — re-prices the behaviour faster than any framework, for the same reason one punished speaker re-prices silence.
The strongest field evidence on what accurate judgment under uncertainty actually looks like comes from the IARPA forecasting tournaments run for the US intelligence community from 2011 — a rare setting where confident and calibrated judgment competed head-to-head on hundreds of real geopolitical questions with verifiable outcomes. The Good Judgment Project’s teams won the tournament decisively; its best forecasters — selected and trained for exactly the fox-like behaviours institutions under-reward — outperformed by margins that, by widely reported accounts, exceeded even intelligence analysts working with access to classified information.4 The winning behaviours were unglamorous and consistent: granular probabilities rather than verbal certainty, frequent small updates rather than defended positions, active soliciting of disconfirming argument, and teams whose norms made changing one’s mind cheap.
Read as an institutional finding rather than a psychological one, the tournament result is pointed: the behaviours that won are all observable and trainable, and every one of them is punished, mildly and continuously, by the ordinary confidence market of a senior forum — where the update reads as inconsistency, the probability as hedging, and the solicited counter-argument as weakness. The scarce input is not cognitive talent. It is an institutional setting in which calibrated behaviour is career-safe.
For executive teams, four near-term moves. One: split significant decision forums into named phases — framing and question-holding, then evidence and dissent, then closure — so that holding-open is a sanctioned activity rather than a failure to contribute. Two: require probability language and named uncertainties in decision papers, making calibration legible as rigour. Three: institute decision post-audits scored on process under uncertainty — dissent heard, probabilities stated, closure timed — not outcomes alone, which under uncertainty reward luck and punish honesty. Four: appoint a rotating challenge role with real standing (the pre-mortem made permanent7), so the counter-case arrives by design rather than by career risk.
For boards, the instrument is the closure question: for each major recommendation, “what would have to be true for this to be premature — and who in the organisation currently believes it is?” A management team that cannot produce a live dissenting view has not resolved the uncertainty; it has suppressed it, and the board is the last checkpoint at which that distinction is cheap.
For governments and advice functions, the tournament evidence carries a structural lesson: forecasting accuracy improved when it was scored — when advice carried probabilities that were later marked against reality. Advice functions that never score their calls are running the confidence market at national scale; the remedies (calibration training, scored forecasts, institutional red teams with protected status) are proven, cheap relative to a single premature-closure failure, and adopted almost nowhere.
We are tracking, in participating senior forums: the survival time of open questions before closure; the ratio of probability-language to certainty-language in decision papers; post-audit verdicts on closure timing; and career outcomes of leaders coded (by structured observation) as standers versus closers — the last being the slow, decisive variable. Open questions we hold honestly: whether standing can be developed at mid-career or is largely formed earlier (the programme evidence says developed, the selection evidence is mixed); whether re-priced forums revert under crisis, when the demand for performed certainty spikes; how to distinguish, observationally, principled question-holding from ordinary indecision wearing its costume; and whether AI counsel — the always-available confident answer examined in The Atrophy of Judgment — is about to make the confidence market catastrophically more liquid. Evidence and counter-cases are invited; the paper will be revised against them.
IMPACT THINKING RESEARCH · BY BEN BOTES · WORKING PAPER 04 · v1.0 · JULY 2026
The Decision-Making Diagnostic reads where you analyse, where you stall, and where you take a stand — in about three minutes.
Take the Decision-Making Diagnostic → Back to the research desk