Priced in Failure-Rate Data: The Reliability Accounting Behind Earned Autonomy

KellerAI

Section 01

Autonomy Is Priced in Failure-Rate Data

There is a persistent instinct in AI deployment to set an agent’s autonomy by feel. Let it run unattended for an hour. Let it touch staging but not production. Let it open pull requests but not merge them. Each of these is a judgment about how much independent authority a system should hold, and almost all of them are made the way an early-stage engineering team makes most decisions—by intuition, calibrated against a handful of demonstrations that went well. The demonstrations are real, and the intuitions are often good. But the procedure has no unit of account. It cannot say how much autonomy the evidence supports, and it cannot say what new evidence would justify widening the budget or force contracting it.

Aviation rejected that instinct sixty years ago, and the rejection is instructive because it was forced by physics rather than chosen by preference. A twin-engine jet does not get to fly five and a half hours from the nearest runway because a regulator judged it broadly safe. It earns those minutes against a number: the world-fleet in-flight-shutdown (IFSD) rate for that specific airframe-engine combination, tracked across hundreds of thousands of engine-hours of accumulated operating experience, and held at or below a target that tightens as the permitted distance grows. 1 3 The autonomy to operate far from a runway is a line item on a reliability ledger, and the entry on that line is a measured failure rate, not an assurance.

Banking arrived at the identical move under different pressure. A bank does not get to use an internal risk model to set its own regulatory capital because the model’s builders believe it is accurate. It earns that privilege by backtesting—by counting, over a rolling window, how often realized losses exceeded the model’s stated quantile, and demonstrating that the empirical exception rate matches the claimed tolerance. 13 The supervisory guidance that frames this in the United States, SR 11-7 and its 2026 successor SR 26-2, treats any consequential quantitative model—including traditional machine-learning systems used for credit, fraud, and risk—as something whose use must be earned through independent validation and ongoing measurement, not granted on the strength of the developer’s confidence. 11 12 SR 26-2 stops short of pulling generative and agentic AI inside that framework—a deliberate scope choice this paper returns to—but the accounting discipline it codifies for the models it does cover is exactly the one AI agent autonomy needs.

This paper is about the discipline those two industries share, and about why it is the right discipline for governing AI agent autonomy. We call it reliability accounting. A companion paper in this series established that hallucination is best understood as the absence of a certificate, and that the object of control is the rate of undetected false assertions—errors that reach a decision-maker without a warning. Reliability accounting takes that object and asks the operational question the certificate framing leaves open: how much autonomy is a system allowed, and on what evidence? The answer has four moving parts. The unit of analysis is a rate—undetected failures per unit of task. It is measured over accumulated operating experience, not inferred from a successful demo. Its tolerance is scaled to consequence—tighter for higher blast radius. And it carries a tail term, because a low average failure rate is not a safe one if the rare failure is catastrophic.

A wider autonomy budget is not asserted; it is priced. You earn it by measuring a stable, low rate of undetected failure on a task class over real operating experience—and you price the tail, not just the average.

The accounting move

The claim of this paper is that three fields—extended-range aviation operations, market-risk regulation, and the machine-learning literature on selective prediction—have independently converged on this same accounting. 1 13 23 ETOPS prices diversion minutes in IFSD-rate data. Basel prices model use in backtested exceptions. Conformal and selective-prediction methods price an answer-or-abstain decision in measured, distribution-free risk-coverage tradeoffs. They use different vocabularies and were built by people who mostly never spoke to each other, which is precisely why the convergence is evidence rather than fashion. The structure is forced by the problem, not by a community’s conventions.

It is worth saying at the outset what this discipline is not. It is not a demand for perfection, and it is not a claim that a sufficiently careful agent can be made to never fail. The opposite, in fact: reliability accounting begins by conceding that failures will occur and asks instead what their rate is, whether that rate is measured, whether it is stable, and whether the authority granted to the system is proportionate to it. Aviation does not ground twins because engines occasionally fail; it lets them fly precisely as far as the demonstrated shutdown rate permits, and no farther. 1 Banking does not forbid risk models because they are sometimes wrong; it permits exactly the model use that the backtested exception rate supports. 13 The analogous posture for AI is neither the credulous one—trust the agent because the demos impressed—nor the paralyzed one—forbid autonomy because the model can hallucinate—but the actuarial one: measure the rate, price the budget to it, and revise both as the evidence accumulates.

Section 02

ETOPS: The Original Reliability Ledger

Until the mid-1980s, a twin-engine airliner in the United States could not legally fly a route that took it more than sixty minutes’ flying time—single engine, still air—from an adequate airport. The constraint lived in FAR 121.161, and its logic was conservative in the most literal sense: with only two engines, the loss of one halves the available thrust, and the rule kept the airplane close enough to a runway that the remaining engine could be relied upon to reach it. 2 The cost was geographic. Twins were effectively barred from the long over-water and over-wilderness routes—the Atlantic narrows, the Pacific, the polar regions—that the most efficient airplanes most wanted to fly. Those routes belonged to three- and four-engine aircraft, which carried the extra engines precisely to buy the diversion range the rule demanded.

FAA Advisory Circular 120-42, issued in 1985, opened the door beyond sixty minutes, and TWA flew the first ETOPS-120 service—Boston to Paris on a Boeing 767—that February. 2 But the door did not open on engineering optimism; it opened on data. The advisory circular, and the 2007 final rule that codified ETOPS into 14 CFR, built the approval on a relative-risk model whose central input is a target IFSD rate. 1 2 The longer the diversion time an operator wants, the tighter that target. The rolling twelve-month world-fleet IFSD rate must sit at or below approximately 0.05 per 1,000 engine-hours to support diversion times up to 120 minutes, at or below approximately 0.02 per 1,000 engine-hours for up to 180 minutes, and at or below approximately 0.01 per 1,000 engine-hours beyond 180 minutes. 3 Autonomy from the runway is sold by the minute, and the price is quoted in shutdowns per thousand hours.

It is worth being precise about what “diversion time” means, because the precision is the point. It is the maximum time the airplane may be from an adequate alternate airport, computed for a single engine inoperative, in still air, at the approved one-engine-inoperative cruise speed. 1 It is not a distance and not an average; it is a worst-credible-case reachability budget. The tier names—ETOPS-120, -180, -240, -330, -370—are exactly these budgets in minutes, and the history of the tiers is the history of an earned ladder. The Boeing 777 entered service in 1995 certified for ETOPS-180 from day one, the first airliner to do so, which alone opened roughly ninety-five percent of the Earth’s surface to twin-engine operation. 4 The Airbus A330 reached 240 minutes; the 777 was approved for up to 330 minutes on its GE engines in December 2011; the Airbus A350 became the first new type cleared for up to 370 minutes before entry into service in 2014. 6 7 Each rung was a wider autonomy budget, and each was granted against a demonstrated, accumulated reliability record rather than a promise.

The most important feature of the ledger, for our purposes, is that the accounting never stops. ETOPS approval is not a certificate you earn once and file away. 14 CFR 21.4 requires the airframe and engine manufacturers to operate a continuing reporting system—monthly world-fleet reliability data, an IFSD-rate computed on a rolling basis, and a problem-tracking-and-resolution process that, by regulation, remains in effect beyond 250,000 engine-hours of fleet operating experience until a stable IFSD rate is demonstrated.3 Read that clause slowly. The autonomy budget is not unlocked by a single impressive number; it is unlocked by a number that has been shown to be stable over a quarter-million hours of real operation, and the obligation to keep watching persists until that stability is proven. If the fleet rate drifts upward, the approval can be constrained.

You cannot buy an ETOPS tier; you must fly the hours. The budget is a function of a measured, stable, low failure rate over accumulated experience—and the watching continues beyond a quarter-million engine-hours until that stability is shown.

The ledger's discipline

That is the whole shape of reliability accounting in one institution: a consequence-scaled tolerance (tighter as the minutes grow), an empirical rate measured over accumulated experience (the IFSD rate), authority widened only when the rate is demonstrably stable and low, and continuous monitoring that can contract authority on drift. 1 3 Everything that follows—banking, machine learning, AI agents—is this same structure wearing different clothes.

Section 03

Banking: Backtesting Is Reliability Accounting Under Another Name

Banking built the same ledger independently, for money rather than for engines, and the parallel is close enough to be uncanny. A market-risk model states a Value-at-Risk: a loss threshold that, the model claims, will not be exceeded on more than one percent of days—a 99% one-day VaR. That claim is not taken on faith. Under the Basel Committee’s 1996 backtesting framework, the bank counts, over a rolling 250-trading-day window, how many days the realized loss exceeded the stated threshold. 13 If the model is well-calibrated, roughly two and a half exceptions per year are expected. The framework then sorts the observed count into a traffic light: zero to four exceptions is the green zone and the model stands; five to nine is the yellow zone and the regulatory capital multiplier rises; ten or more is the red zone and the model is presumed broken until recalibrated. 13 The bound is not a claim a bank gets to make. It is a measured commitment a bank gets to keep, audited continuously, with rising costs as the evidence turns against it.

The statistical machinery underneath the traffic light is worth naming, because it is the same hypothesis-testing logic that the machine-learning literature would later rediscover. Kupiec’s 1995 proportion-of-failures test is a likelihood-ratio check of whether the empirical exception rate is consistent with the model’s stated tolerance—an unconditional coverage test. 14 Christoffersen’s 1998 conditional-coverage test adds the requirement that exceptions not cluster in time, because real losses bunch up when market regimes shift, and a model that passes the count test while failing the clustering test is miscalibrated precisely when miscalibration is most dangerous. 15 Together they formalize what the ETOPS reporting rule expresses in plainer language: it is not enough for the average rate to look acceptable; the rate must be stable, and its failures must not concentrate.

The supervisory spine that governs all of this in the United States is SR 11-7, the 2011 interagency guidance on model risk management, now superseded by SR 26-2, issued in April 2026. 11 12 SR 11-7’s definition of a model is deliberately broad—any quantitative method that turns input data into estimates—and it covers vendor-supplied and third-party models without exception, which is why it applies to deployed language models without modification. 11 Its central demand is effective challenge: the party that validates a model’s outputs must be structurally independent of the party that built it. A model cannot mark its own homework. SR 26-2 keeps that spine and makes the oversight explicitly risk-based—the intensity of validation scales with the model’s materiality. Traditional machine-learning models stay squarely in scope, but the 2026 guidance deliberately places generative and agentic AI outside its scope, treating them as too novel and fast-moving to govern under the model-risk framework as written, and instead directs institutions to apply their existing risk-management and governance practices to those systems while the agencies prepare further guidance. 12 That exclusion is itself the warning this paper answers: the obligation to measure an agentic system’s failure rate does not disappear because the supervisory letter declined to name it—it falls to the deploying institution to build the accounting the guidance left open.

Two features of the banking regime are easy to miss and load-bearing for the AI translation. The first is that the obligation to govern cannot be outsourced. SR 11-7 covers vendor and third-party models explicitly; a bank that buys a model it cannot validate, explain, or audit still owns the model risk, regardless of who built it. 11 The implication for an enterprise deploying a purchased AI agent is exact: procuring the agent does not transfer the duty to measure its failure rate. The buyer earns the autonomy budget, or no one does. The second is that the traffic light is not a one-time gate but a standing instrument. A model that passed backtesting last year is not thereby exempt this year; the exception count is recomputed on a rolling window, and a model that drifts into the yellow or red zone loses the capital treatment it had earned. 13 Authority granted by reliability accounting is always provisional, always contingent on the rate continuing to hold.

The translation to AI deployment is blunt. A vendor who asserts a low error rate for an AI system, and offers no ongoing regime that measures that rate against reality, is—in the exact language of bank supervision—running an unvalidated model in production. That is an examiner finding, not a product feature. 11 Autonomy you cannot backtest is autonomy you have not earned, because under this framework the right to use a model at a given level of consequence is contingent on a demonstrated, independently validated, continuously monitored measure of how often it is wrong.

Section 04

The Shared Accounting Move

Strip both systems to their skeleton and the same four-step procedure appears. First, state a tolerance scaled to consequence: an IFSD-rate ceiling that tightens with diversion minutes; a VaR exception budget that maps to a capital multiplier. 1 13 Second, measure the empirical failure or exception rate against that tolerance over real, accumulated experience—quarter-million-hour fleet monitoring in one case, a rolling 250-day backtest in the other. 3 13 Third, widen authority only when the measured rate is demonstrated stable and low. Fourth, monitor continuously and contract authority when the rate drifts—the ETOPS approval constrained on a rising fleet rate, the Basel multiplier climbing as exceptions accumulate. 3 13 One procedure, two institutions, no shared committee.

The currency in both systems is accumulated operating experience. This is the feature most easily lost in translation to software, and the most important to keep. You cannot buy an ETOPS tier; you must fly the engine-hours and let the fleet rate prove itself. You cannot assert a VaR model into regulatory acceptance; you must run it against live markets for enough days that the exception count means something. The autonomy budget is denominated in experience, and experience cannot be front-loaded. A demonstration that goes well once tells you almost nothing about a rate, because a rate is a property of a distribution, and a single draw is not a distribution.

The machine-learning literature supplies the third instance of the same move, in the framework of selective prediction. A selective predictor is a model paired with a gate that may either answer or abstain. The quantities of interest are risk—the error rate among the answers it does give—and coverage—the fraction of inputs on which it answers rather than abstains. 23 The two trade off against each other, and the discipline is to fix a tolerable risk and accept whatever coverage that buys. Conformal methods make this rigorous: given a held-out calibration set drawn exchangeably from the deployment distribution, a conformal abstention policy bounds the error rate among answered queries to a pre-specified tolerance, with a finite-sample, distribution-free guarantee that requires no parametric assumptions about the data. 21 FactTest casts the same problem as a hypothesis test with controlled Type-I error—the rate at which a false claim is wrongly certified—again with distribution-free finite-sample guarantees. 22

Map the three onto each other and the correspondence is structural, not merely thematic. The ETOPS IFSD-rate target, the Basel exception budget, and the conformal risk tolerance are the same object: a consequence-scaled ceiling on a measured failure rate. 1 13 21 The 250,000-engine-hour monitoring requirement and the rolling 250-day backtest window are the same object: an accumulated-experience base over which the rate is estimated and its stability judged. 3 13 The ETOPS approval-constraint-on-drift and the Basel traffic-light multiplier and conformal recalibration-under-shift are the same object: a continuous-monitoring rule that contracts authority when the measured rate degrades. 3 13 23 Three fields, one accounting.

There is a deeper reason the three coincide, and it connects this paper to the certificate framing of its parent. The ETOPS IFSD rate is a ledger of detected reliability events—an engine the crew shuts down is, by definition, a handled failure—and that detectability is exactly why it can be counted and bounded. The discipline aviation adds on top of that, and the one that maps most directly to AI, is the bounding of error that escapes detection—what aviation calls the failure of integrity. A navigation system’s integrity requirement is explicitly a bound on the probability of being wrong without a timely alert.20 Required Navigation Performance makes this numerical: the receiver’s autonomous integrity monitoring cross-checks redundant measurements and issues an alert—forcing a missed approach rather than a guess—when self-consistency cannot be confirmed. 20 That is the same abstention rule conformal prediction encodes, and the same one DO-178C’s design-assurance discipline encodes for systematic software faults: drive the detectable error to zero through verification, and bound the residual undetected error to a consequence-scaled tolerance. 19 The reliability ledger, in every one of these fields, is an accounting of undetected failure—the only kind that reaches a decision-maker as an unflagged assertion.

Section 05

Mapping to AI Agent Autonomy

Now the move can be applied to its target. To justify a wider autonomy budget for an agent operating on a task class, you must measure and demonstrate a stable, low rate of undetected failures on that class—the same procedure ETOPS applies to engines and Basel applies to risk models. The unit is not “Did the demo work?” but “What is the rate of undetected failure on this task class, measured over accumulated runs, and is it stable?” An agent that cannot answer that question quantitatively has not earned unattended operation, in exactly the sense that an airframe-engine combination without a demonstrated IFSD rate cannot earn a diversion tier. 1 3

Selective prediction supplies the formal machinery to make this operational rather than aspirational. The agent’s task outputs are partitioned, at the moment of action, into those it certifies and those on which it abstains—escalating to a human, deferring, or refusing. 23 The undetected-failure rate is then the selective risk: the rate of acted-on outputs that were wrong and not flagged. Conformal calibration bounds that selective risk to a target tolerance with a finite-sample, distribution-free guarantee, which is the software analogue of an IFSD-rate ceiling: a measured, accumulated-experience-backed ceiling on the rate of failures that escape detection. 21 22 Abstention is not a degradation of the agent; it is the governance signal that keeps the certified set inside its measured reliability budget. It is the “missed approach”—the principled refusal to commit when the certificate is absent.

The tier structure transfers directly. ETOPS tightens its IFSD-rate target as the diversion minutes grow; the autonomy ledger should tighten its undetected-failure tolerance as the blast radius grows. 1 An agent that drafts text for human review can operate at a looser tolerance than one that opens production pull requests, which in turn operates at a looser tolerance than one permitted to merge and deploy without a human in the loop. Each wider tier of authority should demand a tighter demonstrated undetected-failure rate, measured over more accumulated task-runs—the consequence-scaling that ETOPS encodes in its 0.05-to-0.01 step and that banking encodes in its rising capital multiplier. 3 13 Reversibility is the natural axis for the tiers: operations that can be cheaply rolled back sit in lower-consequence tiers, while irreversible operations—the ones past which the designed fallback is foreclosed—demand the tightest tolerances and the most conservative abstention. 24

The accumulated-experience requirement is the part most likely to be resisted in practice, and the resistance is worth confronting directly. The objection runs: software is not an engine; a new model version is not the same airframe that flew the hours, so why should an autonomy budget earned by one version carry to the next? The objection is correct, and it is exactly the right instinct. An ETOPS approval is granted to a specific airframe-engine combination, and a significant change to the engine resets the relevant reliability tracking; a material change to a bank’s model triggers re-validation under model-risk governance rather than grandfathering the old approval. 1 11 The autonomy ledger inherits the same rule: a material change to the model, the tooling, or the task distribution resets the accumulated experience that justified the budget. You do not get to carry a reliability record across a change that could have altered the failure rate. This is not a weakness of the analogy; it is the analogy working—the experience base is tied to the configuration that produced it.

The discipline this implies is the same one banking calls effective challenge and aviation calls independent monitoring. The estimator of an agent’s failure rate must be independent of the agent itself; a failure-rate measure built from the same model that produces the failures will share its blind spots and overstate reliability precisely where it is weakest. 11 24 An honest autonomy ledger requires an independent monitor, a defense-in-depth posture in which the checking layer does not inherit the generation layer’s assumptions. 24 The agent does not get to mark its own homework any more than a VaR model does. In the strongest form, the independent monitor is a different model on different evidence, or a deterministic verifier on the checkable subset of claims, or human review reserved for the highest-consequence tier—the same defense-in-depth layering by which aviation lets no single channel be the sole arbiter of integrity. 20 24

Section 06

The Dividend: What Provable Reliability Buys

The reason any of this is worth the cost is that provable reliability is not a tax. It is what unlocks the efficient operation. The clearest demonstration in aviation is also the most recent milestone in the ETOPS ladder. On 1 December 2015, Air New Zealand became the first airline ever to operate a scheduled ETOPS-330 flight—Auckland to Buenos Aires, a Boeing 777-200ER powered by Rolls-Royce Trent 800 engines. 5 8 The airline had flown the route under ETOPS-240 from October 2014, accumulating roughly a year of operating experience, before receiving the 330-minute approval in November 2015. 8 That sequence is reliability accounting made visible: the wider budget followed the accumulated experience, not the other way around.

The 330-minute diversion budget is what made the route economic. The Auckland–Buenos Aires great-circle line runs across the empty Southern Ocean, far from any adequate alternate. A narrower diversion budget would have forced a dogleg—a longer, fuel-wasting detour to stay within reach of a runway—or barred the twin from the route entirely in favor of a less efficient quad. 5 The 330-minute budget let the 777 fly the direct line. Every diversion minute the airline had earned through its demonstrated reliability record converted, on that route, into fuel saved, time saved, and CO2 not emitted.

The underlying economics is what makes the dividend large. A modern twin burns substantially less fuel than a four-engine aircraft over the same sector—the gap runs to tens of percent per seat—and engine count is a major driver of the cost difference, since a large share of an airframe’s maintenance expense is engine-related and scales with the number of engines to overhaul, spare, and crew-train around. 9 10 The ICCT’s analysis of widebody fuel efficiency cautions against the naive intuition that bigger is always thirstier—efficiency is driven by engine technology, fan diameter, and structural weight as much as by raw size—but the direction of the twin-versus-quad comparison is not in dispute: fewer, more modern engines on a route the reliability data has opened is the efficient configuration. 10 Trade-press figures putting specific dollar gaps on specific routes should be read as illustrative rather than audited, but the structural point stands: provable reliability is the asset that unlocks the efficient operation. 9

Air New Zealand did not earn a fuel saving and a wider autonomy budget as separate prizes. The wider budget was the saving. Reliability accounting is how the savings get banked.

The dividend

The AI translation is direct, and it is the load-bearing economic argument for treating autonomy as something measured. A provably reliable agent—one whose undetected-failure rate has been measured, demonstrated stable and low, and validated independently—earns a wider autonomy budget. 21 23 A wider budget is itself the dividend: fewer human checkpoints in the loop, more tasks carried directly to completion without escalation, and lower supervision overhead per unit of work. The dogleg an under-trusted agent flies is the chain of human approvals inserted around every consequential step because no one can quantify how often the agent is silently wrong. Reliability accounting removes the dogleg by replacing intuition about trust with a measured rate—and the efficiency that buys is the reason to do the accounting at all. The savings are earned by the data, never granted independently of it.

The economics of the human checkpoint are worth dwelling on, because they are where the dividend is realized or lost. A human approval gate inserted around every consequential agent action is the supervision analogue of carrying two extra engines: it buys safety, but it is expensive, and most of the time it is idle—the engine that never fails, the approval that always rubber-stamps. The cost is not only the reviewer’s time; it is the latency the gate adds, the throughput it caps, and the attrition of vigilance that sets in when a reviewer approves a thousand correct actions and stops reading carefully before the one that mattered. An organization that cannot quantify its agent’s undetected-failure rate has no principled way to remove any of these gates, so it keeps all of them, and the agent’s nominal autonomy is fictional—every step still routes through a human. Reliability accounting is what lets gates be removed selectively and defensibly: keep the gate where the measured tail risk warrants it, remove it where the demonstrated rate makes it dead weight. The dividend is not the absence of supervision; it is supervision allocated to where the data says it is needed. 23 24

Section 07

Account for the Tail: The NZ7571 Corrective

Reliability accounting that prices only the average is incomplete, and the incompleteness is dangerous in a specific way: the rare forecast-failure or envelope-edge event, not the typical day, dominates the real risk. The cautionary case for this pair is a New Zealand flight—but not the civilian airline of the preceding section, and the distinction matters. On 7 October 2013, a Royal New Zealand Air Force No. 40 Squadron Boeing 757-2K2, callsign NZ7571, flew from Christchurch toward Pegasus Field on the Ross Ice Shelf in Antarctica with 130 people aboard—117 passengers and 13 crew. 17 18 This was a military resupply flight, operated by the air force, not a scheduled Air New Zealand service. The two New Zealand threads in this paper—the civilian ETOPS-330 success and this military Antarctic flight—are entirely separate, and conflating them would invert the lesson.

The 757 could not return to Christchurch without refueling at Pegasus, so a point of safe return—a point of no return—was computed before departure: the last position from which the airplane retained the fuel to turn around and reach its origin. 17 The designed fallback, in other words, was a return to Christchurch, and it had an expiry. As the flight approached that point, forecasters assured the crew the weather at Pegasus would improve and cleared the flight to continue past it. Roughly twenty minutes after the airplane crossed the point of safe return—its fallback now foreclosed by fuel and range—observations showed that a fog bank had enveloped the runway and its approaches in near-whiteout conditions. 17 The forecast on which the irreversible commitment had been made diverged from the reality the airplane then had to fly into.

The crew flew three approaches. On the third, descending to about 110 feet, they acquired the approach lights and surface markings and landed—below the published minima, in near-whiteout. There was no damage and no injury. 17 18 This is essential to state plainly: NZ7571 was a successful recovery, not a tragedy. The crew handled an extraordinarily difficult situation well, and the Transport Accident Investigation Commission’s inquiry, AO-2013-009, found their in-flight decisions appropriate. 18 What the inquiry faulted was upstream: the original risk assessment had gaps. There were no alternative approach procedures or aerodromes suitable for the Boeing 757 in the region, and the likelihood of early-season Antarctic weather deterioration was underweighted. 18 The flight was committed past an irreversible point, on a forecast, with an under-specified set of fallbacks—and it was the quality of the crew, not the quality of the plan, that supplied the margin.

One distinction must be drawn explicitly, because the conjunction of “New Zealand,” “Antarctica,” and “whiteout” invites a fatal confusion. This is not the 1979 Mount Erebus disaster. That was Air New Zealand Flight 901, a civilian DC-10 sightseeing flight that flew into the slopes of Mount Erebus in whiteout conditions after a navigation-coordinate error, killing all 257 people aboard. NZ7571, thirty-four years later, was a military flight that landed safely. They share a continent and a weather phenomenon and nothing else, and treating one as the other would replace a lesson about tail-accounting with a lesson about navigation databases.

NZ7571’s average risk was acceptable; its tail was not priced. A low mean failure rate is not a safe one when the rare event—the forecast that fails after the fallback is foreclosed—is the one that can kill you.

Why the tail dominates

Banking made exactly this correction, independently, in its own ledger. The Fundamental Review of the Trading Book replaced Value-at-Risk with Expected Shortfall at the 97.5th percentile as the primary internal-models risk measure. 16 The motivation is precisely the NZ7571 lesson in quantitative form. VaR is a quantile threshold: it marks the boundary of acceptable loss but says nothing about how bad the loss is once that boundary is crossed. Expected Shortfall is the average of the losses beyond the threshold—it measures the tail directly. 16 A risk model can be well-calibrated on average and catastrophically miscalibrated in the tail, and VaR will not see it. FRTB’s move to Expected Shortfall is the banking analogue of pricing the fog bank that arrives after the point of safe return: govern by the worst credible failure, not by the comfortable mean. 16

The structure of the NZ7571 failure is precise enough to be worth isolating as a pattern, because it recurs in software far more often than the dramatic Antarctic setting suggests. Three conditions coincided. There was an irreversible commit point—the point of safe return, past which the designed fallback was foreclosed by fuel and range. There was a decision to cross that point on a forecast—a prediction of improving weather—rather than on an observation. And there was an under-specified fallback set—no 757-suitable alternate aerodrome or approach procedure, so that once Pegasus fogged in, the option space had collapsed to a single runway the crew had to make work. 18 Any system that commits irreversibly on a prediction, without a reachable alternative when the prediction fails, inherits this exact risk profile. The average behavior of such a system can be excellent for years; the tail is the day the forecast is wrong and the fallback is gone.

The AI mapping is now concrete. An autonomous agent that commits past a rollback horizon on predicted rather than observed conditions—executing an irreversible action because a model expected the environment to be favorable—with no reachable safe-harbour and an under-specified fallback set, is the NZ7571 failure rendered in software. 24 An agent whose autonomy was granted on a low average failure rate, with no tail term, has been certified the way NZ7571’s risk assessment was written: adequate for the typical case, silent on the rare one that foreclosed its own recovery. The corrective is the same one banking and aviation each made: price the tail, and abstain at the commit point when the forecast cannot be verified against observation and the fallback cannot be reached. 16 24

Section 08

The Autonomy Budget as an Earned, Backtested, Tail-Aware Quantity

The deployable discipline collapses the whole argument into five steps, each of which has a direct precedent in the institutions this paper has examined. First, define the task class and the undetected-failure event. An IFSD is a precisely defined event; a VaR exception is a precisely defined event; the undetected-failure event for an agent on a task class must be defined with the same precision, or the rate that follows is meaningless. 1 13 Second, instrument and measure the rate over accumulated runs. This is backtesting, and it is the step most often skipped: an autonomy claim without an ongoing measurement regime is, in banking terms, an unvalidated model in production. 11 13

Third, scale the autonomy tier to a demonstrated, stable, low rate, with a tighter tolerance for higher blast radius—the ETOPS 0.05-to-0.01 step and the DAL/materiality logic, applied to agent authority. 1 12 Fourth, price the tail. Govern by an Expected-Shortfall-style measure of the worst credible undetected failure, not by mean accuracy, because the mean hides the failure that matters. 16 Fifth, abstain or fall back at the commit point whenever the action would commit past a rollback horizon on predicted rather than observed conditions, or whenever no reachable safe-harbour exists—the NZ7571 rule, encoded as a hard precondition on irreversible action. 24

These five steps are not a metaphor borrowed from aviation and banking; they are the same accounting object that ETOPS reporting, Basel backtesting, and conformal abstention each instantiate. 3 13 21 The ETOPS IFSD ledger, the Basel traffic-light backtest, and the conformal risk-coverage bound are one structure: a consequence-scaled tolerance on a measured failure rate, estimated over accumulated experience, monitored continuously, and contracted on drift. 1 13 23 Govern AI agent autonomy the way an ETOPS tier is governed—earned by data, contracted on drift, tail-aware by design—and the autonomy you grant is autonomy you can defend to an examiner.

The honest limits deserve the same prominence as the discipline, because each maps to a way the accounting can quietly fail. The conformal guarantee rests on exchangeability between the calibration and deployment distributions; under distribution shift the bound degrades, and the only sound response is to treat shift as a monitored, recalibrated risk—measure the empirical failure rate in deployment continuously and recalibrate when it diverges, exactly as ETOPS tracks the fleet rate and Basel runs the traffic light, rather than asserting the bound holds. 21 3 13 Monitor independence is load-bearing: a failure-rate estimator built from the same model it watches inherits that model’s blind spots and overstates reliability where it is weakest, which is why effective challenge demands an independent validator. 11 24 And the tail is estimable only to the extent that rare events appear in the experience base—the NZ7571 gap was an unrepresented alternate-failure mode, a tail the risk assessment never sampled, and no amount of averaging over the represented cases would have surfaced it. 18 16

The honest scope statement, then, is this: reliability accounting bounds undetected failure relative to a measured experience base and a specification—not relative to ground truth. 19 20 It is the same guarantee aviation’s integrity monitoring and banking’s validated bounds provide, and it has the same shape and the same limits. It does not make an agent infallible. It makes the agent’s reliability a measured, accountable, tail-aware quantity—a number on a ledger that can be earned, audited, and, when it drifts, taken back. That is what aviation and banking each spent decades learning to do, and it is what AI autonomy now has to learn. The autonomy you can bank is the autonomy you can price.

End of paper↑ Back to top

Autonomy Is Priced in Failure-Rate Data

Context

The Finding

Autonomy Is Priced in Failure-Rate Data

ETOPS: The Original Reliability Ledger

Banking: Backtesting Is Reliability Accounting Under Another Name

The Shared Accounting Move

Mapping to AI Agent Autonomy

The Dividend: What Provable Reliability Buys

Account for the Tail: The NZ7571 Corrective

The Autonomy Budget as an Earned, Backtested, Tail-Aware Quantity