Earned Range: ETOPS and the Graduated Autonomy Envelope

KellerAI

Section 01

The Blanket Prior and Its Replacement

For most of the jet age, a twin-engine airliner could not legally fly more than sixty minutes’ flying time—measured at single-engine cruise speed, in still air—from an adequate airport. The rule lives in United States regulation at 14 CFR 121.161, and equivalents existed worldwide. 1 Its logic was simple and, for its era, defensible. If one of two engines failed over the ocean, the aircraft had to be able to reach a runway on the remaining engine before any plausible compounding failure—a second engine problem, a cabin fire, a medical emergency—could turn a manageable situation into a fatal one. Sixty minutes was the margin the regulator was willing to grant on faith.

The sixty-minute rule was a blanket prior. It applied one conservative number to every twin, regardless of which airframe, which engine, or how that combination had actually performed in service. A brand-new high-bypass turbofan with a demonstrated failure rate an order of magnitude below the piston-era engines that motivated the rule was held to exactly the same line as the worst plausible twin. The prior was simultaneously too permissive—it granted a flat sixty minutes to combinations that had never earned it—and too conservative—it capped proven, reliable aircraft at a boundary set by the failures of a previous generation. It was a number applied in the absence of evidence, which is precisely what a prior is.

In 1985 the Federal Aviation Administration replaced the blanket prior with something better. Advisory Circular AC 120-42 introduced Extended-range Twin-engine Operational Performance Standards—ETOPS—and with them a new question. 2 The regulator stopped asking “may a twin cross the ocean?” as a yes-or-no matter of category, and started asking “how far from a runway has this specific airframe-engine combination earned the right to fly, on the evidence of how it has actually performed?” The answer was no longer a single line drawn for all twins. It was a graduated envelope, granted in tiers, each tier unlocked by demonstrated reliability and revocable if that reliability decayed. 3

This paper argues that the AI field stands where aviation stood before 1985, asking the wrong question about agent autonomy. The dominant framing is binary: an agent is “autonomous” or “human-in-the-loop,” “agentic” or “supervised.” That binary is a blanket prior. It grants a flat operating range regardless of how the agent has performed on the exact task class in question, and it caps a proven configuration at the same line as an unproven one. The ETOPS move—from blanket prior to evidence-graded, tiered, revocable envelope—is the move the field has not yet made. The thesis is direct: an agent’s unsupervised operating range should be earned, graduated, and revocable, tied to a demonstrated low rate of undetected failure on that exact task class, exactly as ETOPS minutes are tied to a demonstrated in-flight shutdown rate.

Autonomy is not a switch. It is a range you earn—granted in tiers on demonstrated reliability, and revoked when the evidence decays. ETOPS made that move in 1985. The AI field has not made it yet.

The reframe

The history that follows is not decoration. ETOPS is the most carefully documented case in engineering of a regulator converting a conservative categorical prohibition into a graduated, evidence-earned permission—complete with the reliability mathematics that justified each widening, the monitoring that kept it honest, the economics that made earning it worthwhile, and one well-investigated incident that shows exactly what operating at the edge of the envelope on a bad forecast costs. Every one of those elements has a precise translation in agent governance.

Section 02

Diversion Time as Rollback Horizon

The load-bearing quantity in ETOPS is diversion time: the maximum time the aircraft may be from an adequate alternate airport, computed as the single-engine-inoperative flight time to that alternate, in still air, at the approved one-engine-inoperative cruise speed. 3 4 When an authority grants “ETOPS-180,” it is granting permission to fly routes on which the aircraft is never more than 180 minutes—on one engine, in still air—from somewhere it can land. Diversion time is not a distance and not a speed; it is a bound on how long the aircraft can be from a place of safety. That framing is the whole reason ETOPS maps onto agent autonomy so cleanly.

For an AI agent acting in the world, the analogue of diversion time is the rollback horizon: how long the agent may run unsupervised before a human or a deterministic gate can still re-take control without harm. ETOPS diversion time bounds the distance to a runway; the rollback horizon bounds the distance to a person who can intervene, or to a checkpoint that can halt the action, before consequences become irreversible. The adequate alternate—the diversion airport that must be reachable within the diversion time—maps to a reachable safe harbour: a human approver, an escalation queue, or a deterministic policy gate that can actually intervene within the horizon. The two ideas—a bound on unsupervised duration and the reachability of a place of safety within it—are the geometry of earned autonomy.

The AI literature has begun to assemble the vocabulary that ETOPS gives rigor to. Surveys of agent autonomy describe it as a graduated spectrum—from pre-execution approval of every action, through post-execution review, to bounded self-direction with escalation triggers at domain boundaries—rather than a single on/off capability. 19 Control-measure research classifies agent actions by reversibility and blast radius, gating high-consequence or irreversible actions behind stronger oversight than cheap, reversible ones. 20 These are the right instincts. What they lack is the discipline ETOPS supplies: a way to say how far the envelope may extend, on what evidence, and how it is revoked. Reversibility and blast radius tell you which actions are dangerous. Diversion time tells you how long you may run before you must be able to undo them.

Two parameters define the envelope, and both must hold. The first is the rollback horizon itself: the duration of unsupervised action the agent is authorized for. The second is safe-harbour reachability: the guarantee that within that horizon, an intervening authority is genuinely available—not nominally on-call, but able to act in time. ETOPS enforces both. It is not enough that an alternate exists on the map; it must be adequate (a runway long enough, with the right approach aids and services) and reachable within the diversion time given the actual winds and the actual fuel. An agent governance regime that grants a long rollback horizon while the “human in the loop” is asleep, batching ten thousand actions for after-the-fact review, has granted range without a reachable harbour. That is the configuration ETOPS exists to forbid.

Diversion time is a bound on how long the aircraft can be from a place of safety. An agent’s rollback horizon is a bound on how long it can run before a person or a gate can still undo what it did. Earn the range; keep the harbour reachable.

The core metaphor

Section 03

Earned, Graduated, Revocable: The Tier Ladder

ETOPS did not arrive as a single permission. It arrived as a ladder, each rung a wider diversion time, and in its original form each rung was unlocked only after the rung below it had been demonstrated in service. The shape of the ladder is the argument. Reliability is not asserted once and banked; it is demonstrated, granted a tier, demonstrated again at scale, and only then granted the next tier. The later Early-ETOPS process relaxed the “in service first” requirement for sufficiently engineered aircraft—substituting demonstrated engineering and fleet evidence for accumulated in-service hours at the lower rung, as the next paragraphs describe—but the principle that range is granted on demonstrated reliability rather than asserted held throughout.

The first rung was ETOPS-120, opened by AC 120-42 in 1985. 2 The original regulation allowed a qualifying twin to enter service at 120 minutes; ETOPS-180 was available only after a year of trouble-free 120-minute operation. 5 That one-year-of-clean-service gate is the ratchet in its purest form: you may not buy the wider envelope; you must earn it with operating hours that show the failure rate is what you claimed. Boeing later persuaded the FAA that a sufficiently engineered aircraft could be granted ETOPS-180 at entry into service—the “Early ETOPS” process—and the Boeing 777 became the first airliner introduced at a 180-minute rating. 5 The 180-minute envelope was transformative: it opened the great majority of the planet’s surface to twin-engine point-to-point routing, the threshold beyond which doglegging to stay near alternates stopped dominating long-haul economics.

The ladder continued upward, and each step was earned by a specific airframe-engine combination, not granted to a category. In November 2009 the Airbus A330 became the first aircraft to receive ETOPS-240 approval, its maximum diversion-time capability assessed at roughly 240 minutes in still air. 7 8 On 12 December 2011 Boeing received FAA type-design approval for up to 330-minute ETOPS on its GE-powered 777 family—the 777-300ER, 777-200LR, 777 Freighter, and 777-200ER—with Rolls-Royce and Pratt & Whitney variants to follow. 6 In 2014 the Airbus A350 XWB became the first airliner approved for ETOPS beyond 180 minutes before entry into service, certified by EASA for 180-, 300-, and up to 370-minute capabilities—a certification that today covers on the order of 99.7% of the Earth’s surface. 8 5

Read the ladder as a governance pattern and the AI translation writes itself. Each tier corresponds to a wider unsupervised operating range. Each widening is gated on demonstrated reliability for a specific configuration on a specific task class—ETOPS approves an airframe-engine combination, never “twins in general,” just as agent autonomy should be earned per (agent, task class) pair, never “agents in general.” And the ratchet runs in both directions: the envelope extends on accumulated clean evidence, and it contracts when the evidence regresses. A graduated AI autonomy ladder would grant an agent a short rollback horizon at first—every consequential action reviewed—and widen it tier by tier as the agent demonstrates a low, stable rate of undetected failure on that task class over enough operating hours, demoting it when the rate drifts back up.

Each ETOPS tier was unlocked only after the tier below it was demonstrated at scale. Autonomy should ratchet the same way: a short rollback horizon first, widened tier by tier on evidence, contracted the moment the evidence regresses.

The ratchet

The contrast with how agent autonomy is typically deployed is stark. The common pattern is a binary launch: an agent ships either gated on every action or trusted with a broad mandate, and the choice is made once, on a demo and a vibe, not on accumulated task-class evidence. ETOPS would call that granting ETOPS-330 to an airframe on its first flight. No regulator has ever done that, for a reason the next two sections make concrete: the wider envelope is where the value lives, and the edge of the envelope is where the failures concentrate.

Section 04

How the Range Is Earned: The Reliability Gate

What evidence earns a tier? In ETOPS, the central reliability metric is the in-flight shutdown (IFSD) rate—the rate at which engines are shut down in flight, per thousand engine-flight-hours, across the world fleet of that engine type. The 2007 FAA final rule that consolidated ETOPS for multi-engine airplanes ties the permissible diversion time to a demonstrated world-fleet IFSD rate: broadly, a target on the order of 0.02 shutdowns per thousand engine-hours for operations up to 180 minutes, tightening to roughly 0.01 per thousand engine-hours for operations beyond 180 minutes. 9 The number is not the point; the structure is. A wider envelope demands a lower demonstrated failure rate, measured across the fleet, before it is granted.

The measurement does not stop at approval. ETOPS reliability is monitored continuously. The framework tracks propulsion-system events across the fleet and continues that tracking—in the regulatory language, beyond hundreds of thousands of engine-hours of accumulated experience—until a stable IFSD rate is demonstrated. 11 Mandatory reporting of in-service events, required under 14 CFR 21.4 and related provisions, feeds the evidence base that the world-fleet rate is computed from. 10 If the rate rises—if a fleet starts shutting engines down more often than the tier assumes—the evidence that justified the tier has decayed, and the regulator has the standing to act on it. The permission is contingent on the data continuing to hold, not on the data having held once.

This is the single most important structural lesson for agent autonomy, and it is deliberately bounded here: how you compute and monitor that rate is a deep subject in its own right, and we touch it only as far as the earned-range argument requires. The point is the dependency, not the statistics. A wider unsupervised range for an agent must be justified by a measured, stable rate of undetected failure on the task class—false assertions or unsafe actions that escaped the agent’s own checks—held below a threshold scaled to the consequence of a miss. And that rate must be monitored in deployment, not certified once at launch, with the envelope demoted when the rate regresses.

The connection to undetected error is exact, and it links this paper to its parent in the series. The object of control in the parent argument is the rate of undetected false assertions: not raw accuracy, but the rate of wrong outputs that escape the monitor and reach a decision-maker without a warning. The IFSD rate is aviation’s version of exactly that quantity for engines—a measured, fleet-wide rate of the failure mode the envelope is sized against. Earned range is the governance envelope built on top of a measured undetected-failure rate. You cannot earn a wider range honestly without first being able to measure the rate that justifies it.

The arithmetic of the threshold is worth making explicit, because it shows why the tier ladder has the shape it does. A diversion-time bound is, in effect, a bet that no compounding failure will occur during the worst-case time the aircraft is committed to single-engine flight. The longer the diversion time, the longer that exposure window, and the lower the per-hour failure rate must be to hold the probability of a bad outcome inside the same overall tolerance. Halving the permissible IFSD rate roughly compensates for a doubling of exposure time; tightening from 0.02 to 0.01 per thousand engine-hours as the envelope crosses 180 minutes is the regulator buying back, in demonstrated reliability, the additional risk that the wider envelope spends. 9 The envelope and the rate are two sides of one inequality: range times rate, held under a fixed ceiling. You can have more range only by demonstrating less rate.

That inequality is the discipline an agent-autonomy regime should internalize. A longer rollback horizon is a longer window in which an undetected failure can compound before a human or gate intervenes. To hold the probability of a harmful uncaught failure under a fixed tolerance, a longer horizon demands a proportionally lower demonstrated undetected-failure rate on the task class. An agent does not earn a wider envelope by being generally impressive; it earns it by demonstrating that the specific rate—the one that, multiplied by the longer exposure window, must stay under the consequence-scaled ceiling—has actually come down. Range and rate trade against each other under a fixed ceiling, in agents exactly as in engines.

One caution travels with the metric. ETOPS works because the IFSD rate is measured by an authority structurally independent of the manufacturer that wants the wider envelope, fed by mandatory reporting that the operator cannot quietly suppress. 10 A reliability monitor that is correlated with the thing it monitors—an agent grading its own failure rate, or a monitor trained on the same data and inheriting the same blind spots—will systematically under-read the rate, and will under-read it precisely in the cases where the agent fails for a systematic reason. An undetected-failure rate measured by a dependent monitor is not evidence; it is the agent’s opinion of itself. Earned range demands an independent measure of the rate, for the same reason ETOPS demands fleet reporting rather than the manufacturer’s word.

Section 05

The Economics of a Wider Envelope (Air New Zealand ETOPS-330)

The reason to do the reliability work is not safety theatre. It is that the wider envelope is where the money is. On 1 December 2015, Air New Zealand became the first airline ever to operate a scheduled ETOPS-330 service—Auckland to Buenos Aires, a Boeing 777-200ER powered by Rolls-Royce Trent 800 engines—having received 330-minute approval in November 2015 and having first operated under ETOPS-240 from 2014. 12 13 The route runs across the empty Southern Ocean, one of the parts of the planet farthest from any adequate alternate. Without a wide diversion-time envelope, a twin cannot fly the direct great-circle track there; it must dogleg—bend its route toward whatever distant alternates exist—to stay within its permitted diversion time. The dogleg burns fuel and time on every flight, forever.

The 330-minute envelope let the twin fly the direct track instead of the dogleg. In Boeing’s and Air New Zealand’s framing of the broader 330-minute authorization, the wider envelope “allows more direct flights, burns less fuel, emits less carbon dioxide.” 6 As Air New Zealand’s chief pilot put it, the aircraft “is able to fly a straighter route between the city pairs…less fuel is burned and less carbon dioxide is emitted.” 6 The earned range did not merely make the route legal; it made it economic. The straight line over the ocean is the cheapest line, and only the earned envelope unlocked it.

The twin-engine economics underneath compound the point. Modern twin widebodies burn substantially less fuel per seat than the four-engine aircraft they displaced—trade analyses put the gap at roughly a quarter to a third on fuel efficiency per seat, and the per-hour operating-cost gap between a four-engine flagship and a comparable twin can run to several times over. 14 Engines are a major share of an airframe’s maintenance burden, so carrying two instead of four removes whole inspection and overhaul cycles from the cost base. 15 The twin was already the cheaper aircraft; ETOPS let the cheaper aircraft fly the cheaper route. The earned envelope is where those two savings meet.

The dogleg to stay near a runway is a tax paid on every flight. The earned envelope removes it. The reason to earn a wider range is not safety theatre—it is that the straight line over the ocean is the cheapest line, and only the earned range unlocks it.

Where the value lives

There is a timing lesson buried in the Air New Zealand sequence that the AI field should not miss. Air New Zealand did not leap to ETOPS-330. It operated ETOPS-240 from 2014, accumulated the clean service the wider envelope required, and only then took the 330-minute approval in November 2015 and flew the first scheduled 330-minute service that December. 13 12 The economic prize—the direct Southern Ocean track—was waiting at the top of the ladder the entire time, but the operator reached it by climbing the rungs, not by skipping them. The reliability work was the path to the prize, and the path ran through the intermediate tier. An organization that wants the throughput of long unsupervised agent runs has to climb the same way: earn the intermediate envelope, demonstrate the rate at that horizon, then extend.

Now translate. An agent that must hand back to a human at every consequential step is doglegging: every checkpoint is a detour away from the straight line of getting the work done, a tax paid on throughput on every run. A wider earned rollback horizon—longer autonomous runs between human checkpoints—removes the dogleg. The economic case for agent autonomy is the same as the economic case for ETOPS: the value of automation lives in the length of the unsupervised run, because that is what eliminates the human-checkpoint detour that throttles throughput. This reframes the reliability work entirely. The IFSD-style measurement, the monitoring, the staged tiers are not compliance overhead grudgingly tolerated. They are the price of admission to the wider envelope—and the wider envelope is where the return on automation actually accrues. You do the reliability work in order to earn the longer run, because the longer run is the asset.

Section 06

Operating at the Edge on a Forecast (RNZAF NZ7571)

The cautionary thread is a different aircraft, a different operator, and a different lesson—and it must be attributed accurately, because the temptation to reach for a famous Antarctic disaster is exactly the error to avoid. On 7 October 2013, a Royal New Zealand Air Force No. 40 Squadron Boeing 757-2K2—military callsign NZ7571, not a civilian Air New Zealand service—departed Christchurch for Pegasus Field on the Ross Ice Shelf, Antarctica, with 130 people aboard. 16 17 The 757 lacked the fuel to return to Christchurch without first refuelling at Pegasus, so a point of safe return was pre-computed: the last position on the route from which the aircraft could still turn around and reach Christchurch on the fuel remaining. Past that point, the only destination was Antarctica.

Forecasters assured the crew that the weather at Pegasus would improve, and the flight was cleared past the point of safe return on that forecast. Roughly twenty minutes after committing, observations showed a fog bank had enveloped the runway in near-whiteout. 17 The designed fallback—return to Christchurch—was already foreclosed by the fuel and range envelope. The crew flew three approaches; on the third, at about 110 feet, they acquired the approach lighting and runway markings and landed below the published minima in near-whiteout. There was no damage and no injuries. The Transport Accident Investigation Commission’s inquiry, AO-2013-009, found that the crew’s decisions on the day were appropriate—but that the original risk assessment had gaps: there were no 757-suitable alternate approach procedures, and only thin consideration of which Antarctic aerodromes were genuinely suitable for the 757. 16 17 This was a successful recovery, not a tragedy.

It must be distinguished sharply from the Mount Erebus disaster. That was Air New Zealand Flight 901, a McDonnell Douglas DC-10, on 28 November 1979, in which a navigation-coordinate error compounded by sector whiteout flew the aircraft into the mountain and killed all 257 aboard. 18 NZ7571 and Erebus share only a continent. They are different aircraft, different operators, different decades, and—most importantly for the argument here—different outcomes. Conflating them would substitute a navigation-data error that killed everyone for a forecast-commitment problem that everyone survived. The lessons are not interchangeable.

The NZ7571 lesson, stated precisely, is the most dangerous edge in the whole earned-range geometry: the crew committed past an irreversible point on predicted conditions that diverged from observed reality, with the designed fallback already foreclosed by the envelope and an inadequate set of alternates behind it. Everything that could be done right on the day was done right; the recovery was skilful and the outcome was clean. But the structural exposure was set before the point of safe return, by accepting a forecast as if it were an observation, and by entering an envelope whose fallback set was under-specified. The crew flew out of a hole that a better risk assessment would not have dug.

The crew did everything right on the day—and the recovery was clean. The exposure was set earlier, by committing past the point of safe return on a forecast, with the fallback already foreclosed and the alternates under-specified. The forecast, not the flying, was the failure.

The most dangerous edge

The agent translation is direct and uncomfortable. An autonomous agent that commits past its rollback horizon on predicted rather than observed conditions—a model’s estimate that the downstream system will be in a recoverable state, a planner’s assumption that an external dependency will respond as expected—with no reachable safe harbour behind it and an under-specified fallback set, is flying NZ7571’s profile. It may well recover, skilfully, most of the time, just as the crew did. But the exposure was created the moment it crossed the point of safe return on a forecast. The fix is not “never extend the range.” It is to gate the commitment past the rollback horizon on an observed condition, not a predicted one, and to never let the fallback set go under-specified.

Section 07

The AI Mapping, Made Precise

Pull the metaphor into engineering specifics and each ETOPS construct lands on an agent-governance primitive. The 60-minute rule maps to the “autonomous vs. supervised” binary—a flat range applied regardless of demonstrated task-class reliability. An ETOPS authorization, granted per airframe-engine combination, maps to an earned unsupervised range granted per (agent, task class) pair, never per agent in the abstract. Diversion time maps to the rollback horizon. The adequate alternate maps to a reachable safe harbour. The tier ladder maps to a graduated autonomy ladder. The IFSD-rate gate maps to a demonstrated undetected-failure rate on the task class, below a consequence-scaled threshold. The point of safe return maps to the moment the rollback option is foreclosed by the autonomy or resource envelope. And committing past that point on a forecast maps to acting past the rollback horizon on predicted rather than observed conditions—NZ7571’s exact failure.

The agent’s missed approach is abstention. When an ETOPS flight cannot acquire the runway within its minima, it goes around; it does not descend on hope. The formal machinery for that move in AI is selective prediction—the reject option. Geifman and El-Yaniv’s work on selective classification establishes the risk-coverage trade-off: a model equipped with a reject option can lower its error rate among the predictions it does make, at the cost of abstaining on the rest, and the trade-off is tunable to a target risk. 21 SelectiveNet integrates the reject option directly into the network, abstaining when a learned selector falls below a threshold rather than forcing an answer. 22 The reject option is the agent’s go-around: when the evidence inside the envelope will not support a confident, certifiable action, the sound move is to abstain and hand back, not to commit on a forecast.

The deeper guarantee—the one that makes the envelope an envelope rather than a hope—is a bound on the failure rate inside the granted range. Conformal methods supply exactly this. Conformal abstention for language models gives a distribution-free, finite-sample bound on the rate of hallucinations among answered queries: calibrate on a held-out set of exchangeable examples, compute a nonconformity score for each, set the abstention threshold at the appropriate quantile of those scores, and—by rank-based uniformity arguments alone, requiring no parametric assumptions and no knowledge of the data-generating distribution—the fraction of asserted claims that are wrong is provably held below the chosen tolerance. 23 That is the formal analogue of the IFSD bound. ETOPS guarantees that, inside the 330-minute envelope, the propulsion-system failure rate stays below the threshold the envelope was sized against. Conformal abstention guarantees that, inside the granted autonomy range, the undetected-failure rate stays below the tolerance the range was sized against. The envelope is only meaningful if there is a bound on what happens inside it.

The threshold is the knob that sets the tier. Tighten the abstention threshold and the agent asserts less, abstains more, and drives its undetected-failure rate among asserted actions down—at the cost of coverage, the fraction of work it completes without handing back. 21 Loosen it and coverage rises while the failure rate climbs toward the tolerance ceiling. This is the risk-coverage trade-off, and it is the exact dial ETOPS turns when it grants a tier: a wider envelope is a commitment to operate at a lower demonstrated failure rate, which a tighter internal threshold is what produces. An agent earning its way up the autonomy ladder is, mechanically, an agent whose abstention threshold has been calibrated tight enough that the measured failure rate at the new, longer horizon still clears the consequence-scaled bar. The tier is not granted by fiat; it is the observable consequence of a calibrated threshold meeting a measured rate.

This pair’s angle sits deliberately alongside the parent series, not on top of it. The parent argument concerns the correctness of a single emitted claim—hallucination as the absence of a certificate, the split between what can be deterministically checked and what must be probabilistically bounded. Earned range is the governance envelope around that certified-output machinery. It does not replace the per-claim certificate; it answers a different question—not “is this claim certified?” but “how far may the agent run, accumulating certified and abstained actions, before it must hand back, and how was that distance earned?” The certificate governs the output. The earned range governs the scope of action between human checkpoints. A deployed agent needs both: a certificate on each consequential claim, and an earned, monitored envelope on how long it may go between hand-backs.

Section 08

Earning, Monitoring, Demoting: The Operating Discipline

The deployable posture has five parts, each a direct lift from ETOPS practice. First, define the task class narrowly. ETOPS approves an airframe-engine combination, not “twins”; an agent’s earned range must attach to a narrowly specified (agent, task class) pair, not to “the agent” in general. A configuration that has earned a wide envelope on one task class has earned nothing on a different one—the same way a 777-GE’s ETOPS-330 says nothing about a different engine variant until that variant is separately demonstrated.

Second, earn the tier on measured reliability before widening the range. The rollback horizon starts short—consequential actions reviewed—and widens only on accumulated evidence of a low, stable undetected-failure rate on that exact task class. No tier is granted on a demo; it is granted on operating hours. Third, keep the diversion option reachable, and gate on observed not forecast conditions. This is NZ7571’s lesson made operational: never let the rollback horizon close on a prediction. At the point where the agent would commit past the point of safe return—past the last position from which a human or gate could still undo the action—require a check against an observed condition, not a model’s forecast that the downstream state will be recoverable. A safe harbour that exists only in a forecast is the fog bank that enveloped Pegasus.

Fourth, specify the alternate set adequately. NZ7571’s investigated gap was an under-specified fallback set—no 757-suitable alternate procedures, thin consideration of suitable aerodromes. 16 An agent’s fallback set—what happens when it abstains, escalates, or hits an error past the rollback horizon—must be specified with the same rigour as the happy path, not left as an afterthought. Fifth, monitor continuously and demote on regression. The undetected-failure rate is tracked in deployment, by an independent monitor, and the envelope contracts when the rate drifts back up—exactly as ETOPS continues to track the fleet IFSD rate and has standing to act when it rises. 11

This discipline is not invented from nothing; it is the same shape regulated industries have already institutionalized, which is the point of borrowing it. Banking’s model-risk regime under SR 11-7 demands effective challenge—independent review of a model’s development, assumptions, and outputs by a party structurally separate from the developer—and refuses to let an institution discharge that obligation by buying a vendor model. 24 That is the independent-monitor requirement in another vocabulary. The 1996 Basel backtesting framework—the traffic-light test that counts exceptions over a rolling window and escalates capital requirements when the count rises—is the continuous-monitoring, demote-on-regression analogue in banking, applied to risk models rather than agents. 26 The earned-range posture is asking for the same two things those regimes already require: an independent measure of the failure rate, and a contraction of permission when that rate regresses.

One honest caveat governs the borrowing. SR 11-7 was superseded in 2026 by SR 26-2, which carries forward the materiality-based, risk-proportionate spine—but SR 26-2 explicitly places generative and agentic AI outside its current scope, treating it as a novel and rapidly evolving area not yet covered by the revised guidance. 25 The model-risk discipline is therefore borrowed here as an engineering pattern, not invoked as a regulatory mandate that already binds agent deployments. The same is true of the aviation source: ETOPS is the design pattern, not a rule that governs software agents. The argument is that the pattern is sound and field-proven, not that any regulator has yet applied it to agents.

Section 09

Honest Limits

The metaphor earns its keep only if its failure modes are named with the same candour as its guarantees. Four limits deserve prominence.

The first is task-class drift. The conformal bound on the failure rate inside the granted range holds when the deployment distribution remains exchangeable with the calibration distribution on which the rate was measured. 23 When the task class quietly shifts—new kinds of inputs, a changed downstream system, an adversary probing the edges—the calibration stops matching the deployment, and the bound that justified the envelope no longer holds. ETOPS has the same exposure: a fleet IFSD rate measured in one operating regime does not automatically transfer to a harsher one. The mitigation is identical in both worlds—treat drift as a monitored risk, measure the live rate against the calibrated tolerance continuously, and demote the envelope when they diverge. 11 Drift is not a solved problem; it is a managed one.

The second, and the most dangerous, is the forecast-versus-observed gap—NZ7571’s exact failure. The single most reliable way to turn an earned envelope into a hazard is to let the agent cross its point of safe return on a prediction that the conditions past it will be recoverable, rather than on an observation. A forecast that the downstream system will accept a rollback, that a dependency will respond, that the world will be in the state the planner assumed—each is a fog forecast at Pegasus. The discipline that closes this gap is narrow and non-negotiable: the gate at the point of safe return must read an observed condition, never a forecast. An earned range never licenses operating past a foreclosed fallback on faith.

The third is monitor independence. The entire earned-range argument rests on a measured undetected-failure rate, and a measurement is only as trustworthy as the independence of the monitor from the agent it watches. 24 A monitor correlated with the agent—same model, same training data, same blind spots—under-reads the failure rate, and under-reads it precisely in the systematic cases that matter most, manufacturing an unearned envelope on a flattering self-assessment. ETOPS avoids this by measuring the fleet rate through an independent authority fed by mandatory reporting. 10 Agent governance must do the structural equivalent: the monitor that earns the range must not be the agent grading itself.

The fourth is the boundary of the claim itself. Earned range governs the scope of action between checkpoints; it does not certify any individual action—that is the parent series’ job, and the two must be deployed together. A wide earned envelope around an agent whose per-action certificates are weak is a long unsupervised run accumulating uncertified actions. The envelope is necessary, not sufficient. And the deepest limit is the one NZ7571 makes vivid: an earned range never authorizes operating past a foreclosed fallback. No accumulated reliability, no tier, no clean operating history converts a forecast into an observation or a foreclosed return into a reachable one.

Earn the range on measured, independently-verified reliability for the exact task class. Widen it tier by tier and demote it on regression. Keep the diversion option reachable—and never let the rollback horizon close on a forecast.

The discipline, in one line

Aviation did not earn the wider envelope by flying less. It earned it by measuring the failure rate that justified each tier, monitoring it forever, and refusing to grant the next tier until the evidence held. The economic prize—the straight line over the ocean—was real, and it went to the operators who did the reliability work, not to the ones who asserted the range. The AI field can have the same prize, the longer unsupervised run that removes the human-checkpoint dogleg, on the same terms. Autonomy is not a switch. It is a range you earn—and keep earning—and the discipline is the whole point.

End of paper↑ Back to top

Autonomy is not a switch — it is a range you earn

Context

The Finding