WORKING PAPER · VERSION 1.0 · JULY 2026 · A live inquiry, released in versions as evidence accrues. Comments and case evidence welcome: info@impactthinking.co.uk
The organisational conversation about AI runs almost entirely in one direction: adoption — how fast, how safely, how completely. This working paper runs the other way and asks what sustained AI assistance does to the human capacity it is assisting. Two mechanisms concern us. First, generative systems are confidence machines: they produce their most fluent, most certain-sounding output precisely where certainty is least warranted — in the novel, ambiguous, precedent-poor situations where judgment matters most — and fluent confidence is exactly the signal humans use to decide when checking is unnecessary. Second, judgment is maintained by use: the discomfort of standing in not-knowing and authoring a position is not a cost of judgment but its gymnasium, and a technology that relieves the discomfort on demand removes the training load. Neither mechanism is speculative; both extend forty years of automation research to a technology whose surface, for the first time, is language itself. We set out the mechanisms, the early institutional signals, and the questions we are gathering evidence on.
Institutions are asking, correctly, what AI can do for their decision-making. Almost none are asking what it does to their decision-makers — and the asymmetry is itself the finding. Governance frameworks address the machine’s failures: bias, hallucination, security. The human side of the loop is assumed constant. Four decades of automation research say it is not.
Bainbridge named the pattern in 1983, in what remains the field’s most quoted short paper: the ironies of automation.1 Automate the routine portion of a task and the human is left with exactly the parts automation cannot do — the abnormal, the novel, the judgment calls — while simultaneously losing, through disuse, the practised skill those parts require, and losing the situational awareness that manual involvement maintained. The human is repositioned as monitor, a role humans are demonstrably poor at; vigilance decays, and reliance calibrates to the machine’s average performance rather than its worst.2 Aviation spent thirty years and a body count learning this curve — and responded, notably, not by removing automation but by deliberately protecting manual practice inside it.
Generative AI extends the pattern to cognitive work with two aggravations the cockpit never faced.
The confidence surface. Autopilots do not argue their case. Language models do — fluently, structured, in the register of a capable analyst — and, critically, their fluency does not degrade where their reliability does. On precedent-poor questions, the prose remains immaculate while the epistemic floor drops away. This inverts the natural warning system: with human advisers, hesitation and hedging signal thin ice; with generative systems, the signal is absent exactly where the ice is thinnest. Judge and adviser research shows humans lean heavily on expressed confidence when deciding whom to trust and when to verify3 — a heuristic that served tolerably among humans, whose confidence loosely tracks their competence, and fails against systems whose confidence tracks nothing.
The frontier problem. Dell’Acqua and colleagues’ consultant experiment made the shape empirical: substantial performance gains on tasks inside the model’s frontier of competence; significantly worse performance than unassisted peers on a task designed to sit just outside it — professionals accepting confident, wrong guidance, falling asleep at the wheel.4 Map that onto institutional decision-making and the exposure is precise: the routine analysis inside the frontier was never where institutions failed. They fail on the novel case — the pandemic, the market structure that has no precedent, the adversary who read the same playbook — which is definitionally outside any frontier built from precedent, and which is now flooded with fluent, structured, confident counsel.
The second mechanism is slower and, we suspect, larger. Judgment — the capacity this desk’s programme research calls standing: remaining steady in not-knowing long enough for the real picture to form, then authoring a position — is not a stock but a fitness. It is maintained by exactly the experience professionals least enjoy: the discomfort of an open question with stakes attached and no available answer. That discomfort is not a cost of judgment. It is the gymnasium in which judgment is built and kept.
A technology that dissolves the discomfort on demand — that converts any open question, within seconds, into a structured, confident, plausible answer — removes the training load without announcing itself as doing so. Each individual relief is rational; the compound effect is a professional population whose tolerance for standing in not-knowing quietly shortens, and whose reflex under uncertainty becomes consultation rather than formation. The endpoint is not dramatic incompetence. It is subtler: institutions that remain excellent at everything that resembles the past, staffed by leaders who have not held an open question unaided in years — encountering, eventually, the question that resembles nothing.
Because the argument can sound speculative, it is worth dwelling on the one domain that has run the full experiment. Commercial aviation automated cognitive-motor work decades ahead of the professions, harvested enormous safety gains — and then met the residue Bainbridge predicted. The loss of Air France 447 in 2009 became the canonical case: when the automation disengaged in cruise, a serviceable aircraft was flown into the sea by a crew whose manual high-altitude handling had, through years of normal automated operation, quietly atrophied8 — and the accident investigation’s findings on degraded manual skill and startle under automation surprise now anchor the field’s training doctrine. The regulatory response is the instructive part: the FAA formally encouraged operators to have pilots manually fly more often — deliberately reinserting the inefficiency that maintains the capability9 — and recency requirements treat recent hands-on practice, not accumulated qualification, as the unit of competence.
Translate the doctrine and the professional analogue writes itself: qualification in judgment decays without recency; the abnormal case arrives on the automation’s schedule, not the human’s; and an institution that wants judgment available in the exceptional moment must pay for its maintenance in the routine ones. No knowledge-work institution we are aware of currently has an equivalent of manual-flying policy for its decision-makers. Several now have the equivalent of full-autopilot fleets.
For executive teams, the near-term agenda: adopt a position-first protocol for significant decisions — the human position drafted and logged before AI counsel is taken, converting the tool from oracle to sparring partner; establish judgment-recency expectations for senior roles (when did this leader last work a precedent-poor question unaided, with stakes?); and run frontier drills — periodic scenario exercises deliberately constructed beyond the tools’ competence, scored on process (Working Paper 04) rather than outcome.
For AI programme owners, the implication is a widening of the risk register: alongside model risk (bias, hallucination, security) sits user-capability risk — the predictable drift of verification behaviour and the slow transfer of judgment load — which belongs in the same governance, with the same seriousness, and with its own indicators (the reflex and verification measures below).
For governments, the exposure compounds: national decision-making capability is a strategic asset with no owner, and the confidence gap lands hardest on exactly the questions states exist for — the crisis without precedent. Policy functions adopting AI counsel at scale need the aviation settlement explicitly: protected manual practice for analysts and advisers, adversarial red-team use of the tools institutionalised, and the capacity to stand in not-knowing treated as trainable senior capability rather than temperament.
Observables, across participating organisations: the share of significant decisions where AI counsel was obtained before any human position was drafted (the reflex measure); verification behaviour on AI-supported recommendations inside versus outside precedent-rich domains; performance of matched leadership cohorts on precedent-poor scenario exercises, with and without assistance; and self-report and observed measures of tolerance for unresolved questions in senior forums — how long a genuinely open question survives in a room before someone closes it with an answer.
Where the evidence could move us: Whether AI counsel can be configured as a judgment gymnasium rather than a judgment substitute — adversarial use, forced position-first protocols, the machine as sparring partner rather than oracle — and whether such configurations survive contact with deadline pressure. Whether the atrophy mechanism is real at the individual level or only at the cohort-formation level (the two have different remedies). Whether institutional safeguards from aviation — protected manual practice, recency requirements for judgment as for flying hours — transfer to cognitive work. And the uncomfortable one: whether the confidence gap is narrowing faster than the atrophy is compounding — in which case this paper’s concern has a horizon — or the reverse.
Evidence, counter-evidence, and natural experiments are actively sought; this paper will be revised against them.
IMPACT THINKING RESEARCH · BY BEN BOTES · WORKING PAPER 02 · v1.0 · JULY 2026
Leading Without Precedent reads the three capacities that operate beyond any frontier — in about three minutes.
Take the diagnostic → Back to the research desk