The Reliability Ledger
Ask most AI teams how much autonomy an agent should have, and the answer is a vibe. It feels reliable. It passed the demo. We’ll let it run for an hour and watch. Authority gets handed out the way a parent extends a curfew — on accumulated good behavior, loosely remembered, never written down. That is exactly the instinct two of the most consequence-heavy industries on Earth abolished, deliberately, because it kept killing people and losing money.
A twin-engine airliner is not allowed to fly far from a runway because a regulator likes the look of it. It is allowed to fly far from a runway because the airplane-and-engine combination has demonstrated a specific, low engine-failure rate across hundreds of thousands of fleet engine-hours. The permission is a line item. It is denominated in a number — the rate at which engines quit in flight — and the number is tracked continuously, for years, before the airline gets to fly the longer route. Autonomy, in aviation, is literally an entry on a reliability ledger.
Banking does the identical thing with a different vocabulary. A risk model does not earn the right to set a trading limit because its builders trust it; it earns that right by being backtested — its predicted loss bound measured against what actually happened, day after day, with authority contracting automatically when the failures pile up. Both fields converged, independently, on the same discipline: measure the rate of failure, demonstrate it is stable and low over real experience, and only then widen the budget. The move AI has not yet made is to recognize that an agent’s autonomy is the same kind of quantity — priced in failure-rate data, not asserted in a launch announcement.
Autonomy is not granted. It is priced. The unit is the measured rate of undetected failure on a task class, demonstrated stable and low over accumulated operating experience — and the price includes the tail.
How Aviation Prices Autonomy: The IFSD Rate
For decades, a twin-engine airliner could not legally fly more than sixty minutes’ flying time from the nearest adequate airport. The reasoning was crude but defensible: with only two engines, lose one and you are flying on the remaining one, and the further you are from a runway the longer you are exposed. The sixty-minute rule kept twins hugging the coastlines while four-engine jets took the direct ocean routes.
Then came ETOPS — Extended-range Twin-engine Operational Performance Standards — and the door to longer routes opened in 1985. But it did not open on confidence. It opened on data. The governing quantity is the in-flight shutdown rate: how often, per thousand engine-hours of fleet operation, an engine has to be shut down in flight. A longer diversion tier — the maximum single-engine flying time a twin may legally be from a runway — is granted only when the airplane-engine combination demonstrates an in-flight-shutdown rate at or below a target, and the target gets tighter as the minutes grow: on the order of 0.05 shutdowns per thousand engine-hours for the shorter tiers, tightening to 0.02 at 180 minutes and 0.01 for the longest-range ones. Higher authority demands a lower demonstrated failure rate. That is consequence-scaling, written into the rule.
And the watching never stops. The reporting requirement keeps the world fleet under monitored surveillance — tracking continues beyond 250,000 engine-hours of fleet operating experience until a stable shutdown rate is shown. A tier earned is not a tier owned forever; drift the rate upward and the authority is in question. The lesson for AI is almost embarrassingly direct. The autonomy budget is a function of a measured, stable, low failure rate over accumulated experience — not of a demo that went well once, and not of a rate you stopped measuring the day after you shipped.
Extended range is not a reward for good engineering. It is what a demonstrated, continuously monitored, low failure rate buys. Stop measuring the rate and you have stopped earning the minutes.
How Banking Prices the Same Thing: Backtested Exceptions
Banking reached the identical accounting from the other side of the world, under its own regulatory pressure. A bank that runs a risk model stating, say, a 99% one-day Value-at-Risk is making a precise probabilistic claim: on a normal day, losses should exceed this number only about one time in a hundred. Regulators do not take that claim on faith. They backtest it — they count the days the actual loss blew through the stated bound, expect roughly two and a half such exceptions a year out of 250 trading days, and run a traffic-light regime around it. Stay in the green zone and the model keeps its authority. Drift into yellow or red — too many exceptions over the rolling window — and the bank is forced to recalibrate and to hold more capital against the model it can no longer fully trust.
The structure is identical to ETOPS, term for term. A stated tolerance scaled to consequence. An empirical failure rate measured against it over a rolling body of real experience. Authority that widens only on a demonstrated low rate and contracts automatically on drift. A bound, in banking, is not a claim. It is a measured, regularly verified commitment, and the verification is performed by someone structurally independent of the people who built the model — effective challenge, in the supervisory language of SR 11-7 and its 2026 successor.
State the AI translation plainly. An AI vendor who asserts a low error rate without an ongoing regime to measure that rate against reality is, in banking terms, running an unvalidated model in production. That is not a selling point; it is an examiner finding. Autonomy you cannot backtest is autonomy you have not earned — you have merely asserted it, and an assertion is precisely what both aviation and banking spent decades learning never to accept.
The Dividend: Reliability Pays
It is tempting to read all of this as overhead — reliability as a tax that careful people pay and bold people skip. Aviation tells the opposite story. Provable reliability is not the cost; it is what unlocks the cheaper, faster, better operation. The reliability is the thing that pays.
On 1 December 2015, Air New Zealand became the first airline ever to fly a scheduled ETOPS-330 service — Auckland to Buenos Aires, a Boeing 777-200ER on Rolls-Royce Trent 800 engines, the longest extended-range authority then operating on the 777. The airline did not leap there. It had flown 240-minute ETOPS routes on that airframe-engine combination from October 2014, accumulating roughly a year of operating experience before receiving the 330-minute approval in November 2015. And the payoff for that earned budget was geometric: a 330-minute diversion allowance let the twin fly the direct great-circle line across the empty Southern Ocean instead of bending the route into a fuel-wasting dogleg to stay within reach of a runway it would almost certainly never need.
That straight line is money. A modern twin burns substantially less fuel than a four-engine jet over the same sector — the gap runs to tens of percent per seat — and because engine-related costs are a major share of maintenance, carrying two engines instead of four compounds the saving across fuel, spares, and overhaul. Every diversion-minute earned through reliability data converts directly into fuel saved, time saved, and carbon not burned. The reliability the airline demonstrated is what let it bank the efficiency.
The AI parallel is exact. A provably reliable agent earns a wider autonomy budget, and a wider budget is the efficiency dividend: fewer human checkpoints in the loop, more tasks completed end-to-end without a hand-back, lower supervision cost per unit of work. The teams that measure their agents’ failure rates and demonstrate them stable and low are not slowing themselves down. They are the only ones who get to fly the direct route. Reliability accounting is how you bank the savings.
Provable reliability is not a tax on autonomy. It is the thing that buys it. The direct route — fewer checkpoints, faster completion — is the dividend the failure-rate data pays out.
Account for the Tail
There is a way to do reliability accounting badly, and it is the most natural way: price the average and ignore the tail. A low mean failure rate can sit comfortably on top of a rare, catastrophic failure mode that the average quietly buries. Aviation has a hard case that makes the point, and it has to be told accurately.
On 7 October 2013, a Royal New Zealand Air Force No. 40 Squadron Boeing 757 — not a civilian Air New Zealand aircraft — flew callsign NZ7571 from Christchurch toward Pegasus Field on the Ross Ice Shelf, Antarctica, with 130 people aboard. The 757 could not return to Christchurch without refuelling at Pegasus, so a point of safe return was computed before departure: a hard line past which the designed fallback — turning around and going home — was no longer reachable. Forecasters assured the crew the weather at Pegasus would improve and cleared the flight past that point. Roughly twenty minutes later, observations showed a fog bank had enveloped the runway and its approaches. The crew flew three approaches; on the third, at about 110 feet, they acquired the approach lighting and landed below published minima in near-whiteout. There was no damage and no injuries. It was a successful recovery — and that is the point, not a tragedy to dramatize. (It should not be confused with the 1979 Mount Erebus disaster, an unrelated Air New Zealand DC-10 navigation accident that killed 257.)
The inquiry found the crew’s decisions appropriate. What it faulted was the original risk assessment: it had gaps — no alternative approach procedures or aerodromes suitable for a 757, and an under-weighting of how quickly early-season Antarctic weather can turn. In other words, the plan committed past an irreversible point on a forecast that then diverged from reality, with the designed fallback already foreclosed by fuel and range, and an inadequate set of alternates behind it. The average mission to Antarctica is uneventful. The tail mission is the one that defines the risk.
Map that onto autonomous agents and it stops being a flying story. The NZ7571 failure mode, rendered in software, is an agent that commits past a rollback horizon on predicted rather than observed conditions, with no reachable safe-harbour and an under-specified fallback set — a long-running tool-use chain that crosses an irreversible operation because the upstream signal it was told to expect probably holds, and finds out twenty steps later that it did not. A reliability ledger that counts only the average task will price that mode at roughly zero, right up until it dominates the entire loss. Banking made exactly this correction when it moved from Value-at-Risk, which marks the loss quantile but says nothing about how bad losses get beyond it, to Expected Shortfall, which averages the tail beyond that quantile. The discipline is the same in software: govern the tail, and abstain — refuse to cross the irreversible point — when the forecast the decision rests on cannot be verified at the moment of commitment.
That is reliability you can bank: an autonomy budget earned against a measured failure rate, monitored continuously, scaled to consequence, and priced at the tail rather than the mean. The in-depth companion develops the full account — the ETOPS in-flight-shutdown ledger, Basel backtesting and the move to Expected Shortfall, the selective-prediction machinery that lets an agent measure and bound its own undetected-failure rate, and the deployable discipline that ties them together. Read it at Priced in Failure-Rate Data .
A low average failure rate is not a safe one if it hides a rare, irreversible failure mode. Price the tail, not the mean — and abstain when you would commit past the point of safe return on a forecast you cannot verify.