Skip to main content
kellerai.blog

A Great Demo Quarter Is Not a Backtest

Basel backtesting, the LTCM lesson, and the escape-rate machinery that lets an agent earn, aggregate, and auto-revoke its own authority.

KellerAI White Paper · In-Depth · Engineering Discipline & Verification · Jun 2026 · ~24 min read

Context

AI teams routinely grant agents wider autonomy after a run of impressive results — clean pull requests, no escalations, a smooth demo quarter. Each of these judgments treats accuracy on cases shown as authority over cases to come.

The Finding

That judgment is wrong in a specific and expensive way. Authority is priced in measured failure data over enough trials to exclude luck — the same traffic-light discipline banking has run on risk models for thirty years. A long demo is still a demo; only a count earns authority.

Tags:
agent autonomybacktestingescape ratetraffic light regimeBasel VaRLTCMOBL-RES-001cumulative blast radius
Cite this paper

KellerAI. (2026, June 21). Backtested, Not Demoed: A Traffic-Light Regime for Agent Autonomy. KellerAI. https://kellerai.blog/backtested-not-demoed-in-depth

Paper Details
CategoryEngineering Discipline & Verification
AudienceEngineering leads, AI governance teams, and risk officers evaluating when and how to widen agent autonomy in production systems.
MethodAnalytical · evidence-based
Length~5800 · ~24 min
Reading levelTechnical
References20
PublishedJun 2026
Key Takeaways
  • Authority over an autonomous agent must be priced in a backtested escape rate — not a demo. The conformance rule reads the upper confidence bound, not the point estimate, and requires at least 149 trials at CT2 and 598 at CT3 before a claim is even demonstrable.
  • OBL-AGG-001 requires that the escape budget be measured against cumulative blast radius across all actions and agents in a governance unit over a rolling window — a sequence of individually-passing sub-threshold actions can still roll the unit into red, exactly as LTCM's diversified positions collapsed together.
  • The discipline is the dividend: only teams that can count their agents' escapes over enough trials can remove human checkpoints defensibly. The autonomy budget is what the backtesting discipline buys — and it is auto-revoked the moment the count turns, without anyone having to argue about it.
Related
Placeholder — pending analytics
Section 01

Accuracy Is Not Authority

There is a persistent instinct in AI deployment to treat a good run as proof of reliability. The agent shipped a hundred clean pull requests; let it merge. It handled the quarter's tickets without an escalation; widen its scope. Each of these is a judgment that accuracy on the cases shown is authority over the cases to come. That judgment is wrong in a specific and expensive way: a great demo quarter is not a backtest. A run of favorable results is a single draw from a single regime, and a single draw is not a distribution. It tells you almost nothing about a rate, because a rate is a property of a distribution and a property you have not measured cannot be the basis of authority you grant.

Authority is a different quantity altogether, and banking spent decades learning to price it precisely. Authority is priced in measured failure data over enough trials to exclude luck, scaled to the consequence of the action, and it contracts automatically the moment the data turns against it. The bound is not a claim a system gets to make on the strength of its builders' confidence; it is a measured commitment it gets to keep, audited continuously, with rising costs as the evidence sours. 1 This paper develops that discipline — banking's daily backtesting regime, its traffic light, and its move to govern the tail — and translates it, in concrete units, into a regime for governing how much an autonomous agent is allowed to do. It is the second article in KellerAI's Banking Stack: the first, Effective Challenge, establishes that the entity counting the failures must be structurally separate from the one producing them; the third, You Cannot Outsource the Obligation to Govern, establishes that the count belongs to the deployer regardless of who built the model. This article sits between them and supplies the instrument they both presuppose — the traffic light that converts a count of failures into a revocable grant of authority. 20

A great demo quarter is not a backtest. Accuracy on the cases you showed is not authority over the cases you will face. Authority is priced in measured failure data over enough trials to exclude luck — and it contracts automatically when the data turns.

The load-bearing inversion

The posture this argues for is neither credulous nor paralyzed. It is not the demand that an agent never fail, and it is not the refusal to grant any autonomy because a model can be wrong. It is the actuarial posture: concede that failures will occur, measure their rate, price the authority budget to that rate, aggregate the exposure, and revise both as the evidence accumulates. That is exactly what a bank does to a risk model every trading day, and it is what AI autonomy has not yet learned to do.

Section 02

Backtesting: The Discipline That Won't Be Taken on Faith

A market-risk model states a Value-at-Risk: a loss threshold that, the model claims, will not be exceeded on more than one percent of days — a 99% one-day VaR. 2 That claim is a precise probabilistic assertion, and supervisors do not accept it on the strength of the model's design. They count. Under the Basel Committee's 1996 backtesting framework, the bank counts, over a rolling 250-trading-day window, how many days the realized loss exceeded the stated threshold. 3 If the model is well-calibrated, roughly two and a half exceptions per year are expected out of 250 trading days. The framework then sorts the observed count into a traffic light.

Zero to four exceptions is the green zone and the model stands. 3 Five to nine is the yellow zone and the regulatory capital multiplier rises — the model is allowed to keep operating, but its use gets more expensive in direct proportion to how far the evidence has drifted. Ten or more is the red zone and the model is presumed broken until recalibrated. 3 The authority to use an internal model to set your own regulatory capital is not granted once and filed away; it is recomputed on every rolling window, and it is revoked by the count itself. The bound is a measured commitment a bank gets to keep, not a claim a bank gets to make.

The statistical machinery underneath the traffic light is worth naming, because it is the same hypothesis-testing logic AI assurance has to adopt. Kupiec's 1995 proportion-of-failures test is a likelihood-ratio check of whether the empirical exception rate is consistent with the model's stated tolerance — an unconditional-coverage test. 4 Christoffersen's 1998 conditional-coverage test adds the requirement that exceptions not cluster in time, because real losses bunch up when market regimes shift, and a model that passes the count test while failing the clustering test is miscalibrated precisely when miscalibration is most dangerous. 5 Together they formalize a single demand: it is not enough for the average rate to look acceptable; the rate must be stable, and its failures must not concentrate.

A model that passed last year is not exempt this year. The exception count is recomputed on a rolling window, and a model that drifts into yellow or red loses the treatment it had earned. Authority priced in failure data is always provisional.

What the traffic light actually enforces

Two features of the regime are easy to miss and load-bearing for everything that follows. The first is that the count is taken over enough trials to mean something — a 250-day window, not a good week. The second is that the response to a breach is automatic and graduated: yellow does not require a committee to decide the model is suspect; the multiplier simply rises. 3 The system does not wait for human judgment to notice a problem the count has already proven.

The traffic light does not float free of a governance framework; it sits inside one, and the framework is what makes the count consequential. In the United States that framework is SR 26-2 — the interagency model-risk standard that in April 2026 superseded SR 11-7 after fifteen years as the canon — built on effective challenge, credible independent review with the authority to change the model, implemented through a validation lifecycle of conceptual soundness, ongoing monitoring, and outcomes analysis by backtesting, with risk-tiering by materiality and explicit vendor coverage. 1 Backtesting is the outcomes-analysis leg of that lifecycle: the traffic light is how ongoing monitoring is operationalized, and the capital multiplier is how a yellow finding is priced rather than merely noted. A model that posts five exceptions does not generate a memo for a future review cycle; it changes the bank's capital requirement on the next reporting date. The cost is the enforcement.

There is a specific reason supervisors fixed the conformance rule to a count against a threshold rather than to a regulator's after-the-fact judgment of whether the losses "felt" excessive. A count is adversarially robust in a way a judgment is not. A bank cannot argue its way out of ten exceptions, and a supervisor cannot be lobbied into forgiving them; the window closes, the number is what it is, and the zone follows mechanically. 3 The same property is what makes the regime portable to software: an escape count over a fixed window is a fact a vendor cannot reframe and a deployer cannot wish away, where "the agent seemed reliable in the pilot" is exactly the reframeable judgment the count is designed to replace.

Section 03

The Named Losses: Where the Demo Diverged From the Tail

The discipline of counting exceptions over enough trials was not invented in the abstract. It was priced in retrospect by losses whose in-sample records were immaculate the morning before they failed.

Long-Term Capital Management is the canonical case, and its failure is worth tracing in sequence rather than in summary, because the sequence is the lesson. 6 The fund was run by principals including two Nobel laureates in economics, and its models carried a pristine in-sample track record — years of returns that validated the risk framework on every regime the data had seen. The first thing that failed was not a model output but a category error about what the track record certified. The returns were a draw from the regimes the 1990s had actually produced; the models inferred from that draw a distribution, and from that distribution a tail, that no observation had ever tested. The in-sample record was, in the vocabulary of this paper, a demo of extraordinary length — and a long demo is still a demo, because length in a favorable regime does not sample the unfavorable one.

The second thing that failed was the independence assumption. LTCM's positions were diversified on paper across instruments, geographies, and strategies whose historical correlations were low, and the risk framework priced them as largely independent bets. Then, in August 1998, Russia defaulted on its domestic debt and devalued the ruble, and a flight to liquidity swept every risky market at once. The correlations the model had measured at perhaps 0.1 in calm regimes went to nearly 1.0 in the crisis: positions that were supposed to offset each other moved together, and the diversification that backstopped the leverage evaporated in days. The tail event the in-sample record had never sampled blew through the fund's risk thresholds simultaneously and across positions the models had treated as separate. The losses ran to roughly US$4.6 billion, and in late September the Federal Reserve organized a recapitalization by a consortium of major banks to prevent the disorderly unwinding from cascading through the counterparties that had lent against those positions. 6 The track record had backtested the body of the distribution and never the tail; the demo was excellent and the backtest of the rare event did not exist.

The agent translation of this sequence is exact, and it is why two of this paper's obligations are necessary rather than redundant. The category error — mistaking a long favorable run for a measured rate — is what OBL-RES-001 forbids by demanding a count over a sample-size floor against a tier tolerance. The independence collapse — failures that were supposed to be uncorrelated arriving all at once — is what OBL-AGG-001 addresses by measuring the cumulative blast radius across actions rather than action by action, on the premise that a single root cause can fail many things together exactly when the per-action statistics look benign.

The 2008 crisis reinforced the same lesson at fleet scale, and it reinforced specifically the part of the regime that the count alone does not catch. Across the industry, banks breached their daily VaR far more often than the roughly one-percent-of-days — about two and a half exceptions a year — that a 99% model implied; exceptions clustered exactly as Christoffersen's conditional-coverage test warns they will when a regime shifts. 7 The clustering is the tell. A model can run for years inside its green zone, accumulating an exception count that an unconditional test reads as perfectly acceptable, and then post its entire year's budget of exceptions in a single fortnight — which is precisely the pattern of a system whose failures are correlated by a common cause the calm-regime data never exposed. Goldman Sachs's then–chief financial officer captured the disconnect in a now-famous remark that the firm was seeing "25-standard-deviation moves, several days in a row." 8 A 25-sigma daily move is, under the model's own assumptions, an event that should not occur in many times the age of the universe; observing several in a row is not a run of bad luck but proof that the model's tail was fiction. The demo — the in-sample calibration — had certified a distribution whose tail the market then refused to honor.

The clustering finding is what makes Christoffersen's test, not merely Kupiec's, load-bearing for the agent regime. An escape-rate harness that checks only the average rate over a window is the unconditional test; it will pass a model whose escapes are bunching under a single emerging root cause — a prompt-injection family, a tool API that silently changed, a distribution shift in the inputs — right up until the burst arrives. The honest agent traffic light has to watch for concentration as well as count, because the most dangerous failure mode is the one that is invisible to the average and obvious only to the clustering test, and it is invisible to the average for exactly as long as it takes to become catastrophic.

A pristine in-sample record is not a backtest of the tail. LTCM and the 2008 exceedances were models that priced the average beautifully and the rare event not at all. Only counted exceptions, over enough trials, backtest the regime you have not yet seen.

The forecast-versus-observed lesson

The structure recurs whenever a system commits on the strength of a record drawn from a favorable regime, with no count of how often the regime has actually held. It is the exact failure shape an autonomous agent inherits when its autonomy is granted on a demo rather than on a measured, stable, low rate of undetected failure.

Section 04

The Shared Accounting Move

Strip the banking regime to its skeleton and a four-step procedure appears, and it is the same procedure aviation reached independently for governing how far a twin-engine airliner may fly from a runway. First, state a tolerance scaled to consequence — a VaR exception budget mapped to a capital multiplier; in aviation, an in-flight-shutdown-rate ceiling that tightens as the permitted diversion minutes grow. 3 9 Second, measure the empirical failure rate against that tolerance over real, accumulated experience — a rolling 250-day backtest; in aviation, world-fleet reliability tracked across hundreds of thousands of engine-hours. 3 9 Third, widen authority only when the measured rate is demonstrated stable and low. Fourth, monitor continuously and contract authority when the rate drifts — the Basel multiplier climbing as exceptions accumulate; the ETOPS approval constrained on a rising fleet rate. One procedure, two industries, no shared committee.

The currency in both systems is accumulated operating experience, and it cannot be front-loaded. You cannot buy an ETOPS tier; you must fly the engine-hours and let the fleet rate prove itself. 9 You cannot assert a VaR model into regulatory acceptance; you must run it against live markets for enough days that the exception count carries information. The KellerAI ETOPS-for-AI stack develops the aviation half of this convergence in full — the in-flight-shutdown ledger, the earned-and-revocable diversion budget, and the tail correction — and a companion paper in this very stack establishes that the four-step move is the right discipline for AI autonomy in general. 10 This paper's contribution is narrower and sharper: the traffic light itself — the graduated, automatic, count-driven contraction of authority — and its translation into a regime an agent runs on its own actions.

That two industries reached the identical structure without a shared committee is the strongest available evidence that the structure is forced by the problem rather than chosen by a culture. Consider what the two share once the surface vocabulary is stripped away. Both grant authority as a budget rather than a binary license — diversion minutes in one, capital multiplier in the other — so that the response to deterioration can be graduated rather than all-or-nothing. Both denominate that budget in a measured rate over accumulated experience rather than a design claim, so the grant is falsifiable by observation. Both tighten the tolerance as the consequence grows — ETOPS's roughly 0.05-to-0.01 step as the diversion minutes lengthen, Basel's multiplier rising as the exception count climbs — so that the systems whose failures matter most are held to the strictest rate. And both keep the accounting standing rather than one-time: the ETOPS reporting obligation persists across hundreds of thousands of engine-hours until a stable rate is shown, and the Basel count is recomputed on every rolling window, so authority earned once is never authority owned forever. A binary license, a design claim, a flat tolerance, and a one-time gate would each have been simpler. Two independent regulators rejected all four, under different pressures, because each is a way for unmeasured risk to accumulate behind a stale grant.

Banking made one further correction that aviation made under its own name, and it is essential to the agent translation. A regime that prices only the average is dangerous in a specific way: the rare event, not the typical day, dominates the real risk. The Fundamental Review of the Trading Book replaced Value-at-Risk with Expected Shortfall at the 97.5th percentile as the primary internal-models risk measure. 11 Value-at-Risk is a quantile threshold — it marks the boundary of acceptable loss but says nothing about how bad the loss is once the boundary is crossed. Expected Shortfall is the average of the losses beyond the threshold; it measures the tail directly. 11 FRTB's move is the LTCM lesson rendered in rule: govern by the worst credible failure, not by the comfortable mean. For an agent, the analogue is to budget not against the escape rate alone but against the blast radius of the escapes that do occur — which is exactly why the aggregation obligation measures cumulative consequence rather than a bare count, and why a half-percent escape rate on irreversible CT4 actions is intolerable where the same rate on reversible CT2 ones is fine.

Section 05

Mapping to AI Agent Autonomy

Now the move can be applied to its target, in the concrete units the LLM-Agent Assurance corpus already defines. The governance unit is the (agent, task-class) pair, gated by a Consequence Tier from CT0 to CT4 derived from the action's observed effect surface — its actual reversibility, scope, and consequence — never from the agent's self-report. 12 CT0 is read-only and merely traced. CT1 is a reversible internal write with a self-check. CT2 is a reversible or low-value external effect. CT3 is a hard-to-reverse or material external effect. CT4 is irreversible or high-consequence. 12 The traffic light maps onto this tier ladder directly, and the quantity it counts is the escape rate: the rate at which a wrong result survives every applicable check and is acted upon, reported where a higher-is-better figure is wanted as integrity = 1 − escape_rate. 13

The escape-rate tolerance is the VaR exception budget, scaled to consequence exactly as the capital multiplier is. The CT2 tolerance is 0.02 — two percent. The CT3 tolerance is 0.005 — half a percent. The CT4 tolerance is zero. 14 The zero is not a rhetorical flourish; it is the design's sharpest edge. A binomial upper bound over any finite sample is strictly greater than zero, so a backtest can never pass a zero tolerance — which is correct, because CT4 is exactly the human-gated, abstention-default tier, and an irreversible high-consequence action is not something a count of past successes was ever entitled to authorize. 14 The traffic light's red zone, at CT4, is the entire zone.

The "enough trials to exclude luck" clause becomes a sample-size floor, and it is the part most often skipped in practice. A tolerance is only demonstrable if the achievable upper confidence bound at zero observed escapes can fall at or below it. By the rule-of-three at 95% confidence, that requires at least 149 samples at CT2 and at least 598 at CT3. 14 Below the floor, the result is not a pass; it is indeterminate for insufficient sample size. A demo of twenty clean runs is not a backtest of a half-percent tolerance any more than a good week is a backtest of a 99% VaR — the sample cannot, even in principle, carry the claim. And the conformance rule is deliberately conservative in the same way Basel's is: an action passes only if the upper bound of the confidence interval is at or below tolerance, never the point estimate, because validating the point estimate would ignore sampling error and wave through underpowered sets. 14

A worked illustration makes the floor's bite concrete. Suppose an agent is to be granted CT3 authority — a hard-to-reverse external effect, say committing a refund or filing a downstream record — on a task class where it has run 300 evaluation cases with zero observed escapes. The point estimate of the escape rate is 0%, which looks like a flawless backtest. But the half-percent CT3 tolerance demands that the upper 95% confidence bound sit at or below 0.005, and the rule-of-three says that at zero escapes the upper bound is approximately 3/n: with n = 300, that bound is about 0.010 — twice the tolerance. The 300-case run is not a CT3 pass; it is indeterminate, because the sample is too small to exclude a true escape rate that would breach the tier even though not one escape was seen. Only at n ≥ 598 does the zero-escape upper bound fall to 0.005 and the claim become demonstrable. Now suppose that over 600 cases the agent posts three escapes rather than zero: the upper bound climbs well above 0.005, the backtest fails, and authority does not contract from CT3 — it never reaches CT3 in the first place, and any standing CT3 grant auto-contracts on the breach. The point estimate of 0.5% would have read as exactly on tolerance; the conformance rule, reading the upper bound, correctly refuses it. This is the same discipline that stops a bank from claiming a 99% VaR model is validated on a quiet quarter with one exception: the count must be powered to exclude luck before its silence means anything.

Measure the escape rate over at least the sample-size floor — 149 at CT2, 598 at CT3 — against the tier tolerance. Green stands. A breach contracts the tier automatically. No committee, no argument; the count revokes the authority, exactly as the Basel multiplier rises on its own.

The agent's own traffic light

Two further units complete the regime. The first is cumulative blast-radius aggregation: the escape budget is not spent one action at a time. It is measured against the cumulative blast radius across actions and across agents within a governance unit over a rolling window — the same windowed aggregate that, in the gate, re-tiers a sequence of sub-threshold transfers so they cannot launder a high-tier effect. 15 Aggregation can move a unit from green to red even when every single action passed, just as a string of individually-tolerable VaR breaches sums into a red-zone count. This is the direct countermeasure to the LTCM independence collapse: a thousand small reversible writes that each pass CT2 in isolation can, in aggregate over the window, constitute a hard-to-reverse material change of state — and the aggregate, not the action, is what the tier must be derived from. The blast radius is the consequence scale of an action; the cumulative blast radius is that scale summed across the governance unit, and it is the quantity the escape budget is actually spent against.

The second unit is the revocable envelope, and it brings the rollback horizon with it. Authority is granted per tier on measured evidence and auto-contracts on a backtest breach. 16 The rollback horizon — the bounded time and work within which a committed action can be undone — is what separates the tiers in the first place: a CT2 effect is reversible within a short, rehearsed horizon, a CT3 effect is hard to reverse, and a CT4 effect is irreversible, its rollback horizon effectively infinite. Authority and rollback horizon move together: as the horizon lengthens toward irreversibility, the tolerance tightens toward zero and the abstention default engages. The diversion budget an airframe earns is contracted on a rising fleet rate; the capital treatment a model earns is contracted on a rising exception count; the autonomy an agent earns is contracted on a rising escape rate. The envelope is earned, and it is revocable, and the contraction is automatic — it does not wait for an operator to reauthorize what the count has already revoked.

Section 06

The LAAS Obligations This Regime Defines

The discipline above is enforced by two new obligations, which we introduce here with stable identifiers. The LLM-Agent Assurance Standard's in-depth treatment defines the obligation framework — the no-self-grading invariant, the escape-rate metric, the backtest harness, the deterministic precedence ordering — but never enumerates stable identifiers for the obligations themselves. 17 We name them, define them inline, and cite the LAAS framework as their source.

Authority at a tier requires a backtested escape rate at or below the tier tolerance — CT2 2%, CT3 0.5%, CT4 0% — measured over at least the sample-size floor for that tier, with the upper bound of the confidence interval (not the point estimate) at or below tolerance. Authority auto-contracts on a breach. Accuracy on a demonstration is not authority. Tier floor: CT2.

OBL-RES-001 — Backtested Residual (Escape Rate)

OBL-RES-001 is the traffic light written as a conformance predicate. It is the obligation that makes "backtested, not demoed" mechanical: an agent claiming authority at CT3 on the strength of a favorable run, with no backtest meeting the 598-sample floor at the half-percent tolerance, is non-conformant — in the exact sense in which a bank running a VaR model with no rolling backtest is, to a supervisor, running an unvalidated model in production. 17 The auto-contract clause is the Basel multiplier and the ETOPS approval-constraint in one: a breach does not require a human to relitigate the autonomy decision; the envelope contracts on the count.

The escape budget is measured against the cumulative blast radius across actions and across agents within a governance unit over a rolling window — not per action in isolation. Aggregation can move a unit from green to red even when every single action passed. Tier floor: CT2.

OBL-AGG-001 — Cumulative Blast-Radius Aggregation

OBL-AGG-001 is the defense against the laundering of a high-tier effect through a sequence of sub-threshold ones — the same structuring countermeasure the gate enforces by maintaining a windowed aggregate per principal, session, and effect class and re-tiering to the aggregate's tier once a cumulative threshold is crossed. 18 Aggregation can only ever raise the tier, never lower it. It is why the traffic light is run over a rolling window of counted exceptions rather than over single actions: the unit, not the action, is what earns or loses authority.

These two obligations lean on others they do not own, and the dependency is the spine that ties this article to the rest of the Banking Stack. An honest exception count requires an independent counter. OBL-IND-001 — Independent Verification holds that a CT3-or-above action must be verified by a challenger structurally separate from the actor — separate model lineage, separate context, separate owning function — that did not build the thing it grades. 17 A backtest run by the same model that produces the failures shares its blind spots and undercounts exactly the escapes that matter most. But structural separation on the org chart is necessary and not sufficient: a verifier whose errors correlate with the actor's — beyond the ceiling of ρ ≤ 0.2, the phi coefficient measured on a frozen, versioned, ≥1,000-case adversarial held-out set with the upper bound of the 95% CI at or below 0.2 — adds no real independence no matter how the boxes are drawn, because the two will be blind to the same cases at the same time. 19 That correlation ceiling is the formal content of independence, and it is what stops a "second model" that is merely a sibling of the first from masquerading as effective challenge. The verifier must also be qualified for the tier it gates — its own competence on the task class measured before it may grade — or an unqualified challenger waves through escapes while appearing to count them. 17

A2's escape-rate measurement is the backtest of A1's challenger: the traffic light is only honest if the entity counting the exceptions is structurally incapable of grading its own homework and demonstrably uncorrelated with the actor it grades. This is banking's effective challenge — credible, independent review with the authority to change or reject the model — rendered as a measured ceiling rather than an org-chart claim. It is why "you cannot self-grade" (A1) and "you cannot outsource governance" (A3) are the same refusal that "accuracy is not authority" depends on: a count produced by the actor is not a backtest, and a count produced by a captured or correlated reviewer is the same fiction wearing a second name. 20 The LTCM principals, after all, were not grading their own models in any naive sense — they were the most credentialed validators available; what they lacked was a counter whose view of the tail was independent of the framework that had defined the tail away.

Section 07

Discipline Is the Asset, Not the Tax

The reason any of this is worth the cost is that the discipline is not overhead; it is what unlocks the wider operation. Provable reliability is the asset, not the tax. The clearest demonstration is the one banking and aviation both make: the institutions that can measure a failure rate are the only ones that can safely widen authority, because they are the only ones who can defend the widening to an examiner.

The mechanism is the selective removal of the human checkpoint. An organization that cannot quantify its agent's escape rate has no principled way to remove any gate, so it keeps all of them, and the agent's nominal autonomy is fictional — every consequential step still routes through a human approval that buys safety the way carrying two spare engines does: expensive, mostly idle, and corrosive of the vigilance it depends on, because a reviewer who rubber-stamps a thousand correct actions stops reading carefully before the one that mattered. 10 A backtested escape rate is what lets gates be removed selectively and defensibly: keep the gate where the measured tail risk and the aggregated blast radius warrant it; drop it where the demonstrated rate over enough trials makes it dead weight. The dividend is not the absence of supervision. It is supervision allocated to exactly where the data says it is needed.

The wider autonomy budget is not a separate prize from the backtesting discipline. The budget is what the discipline buys. The teams that can count their agents' escapes over enough trials are the only ones who get to fly the direct route — fewer checkpoints, more tasks carried to completion — and they get to keep flying it only as long as the count holds.

The economic inversion

The economics of that checkpoint are worth dwelling on, because they are where the dividend is realized or lost. A human approval gate around every consequential agent action is the supervision analogue of carrying two extra engines: it buys safety, but it is expensive, mostly idle, and corrosive of the vigilance it depends on. The cost is not only the reviewer's time; it is the latency the gate adds, the throughput it caps, and the attrition of attention that sets in when a reviewer approves a thousand correct actions and stops reading carefully before the one that mattered. 10 The count is what converts that flat, expensive, all-or-nothing supervision into a priced and graduated one. Where the measured escape rate is low and the aggregated blast radius is bounded, the gate is dead weight and can be dropped; where the tail risk or the cumulative exposure warrants it, the gate stays and is defensible to an examiner precisely because a number, not an intuition, placed it there. The contraction clause is the other half of the bargain: an operator can afford to drop a gate only because the envelope will auto-contract the moment the count turns, so the saving is never a permanent bet against the data — it is a bet that lasts exactly as long as the data supports it.

This is the same dividend banking realizes when a well-backtested model earns lighter capital treatment, and the same one aviation realizes when an airframe with a demonstrated shutdown rate earns the direct great-circle line across an empty ocean instead of a fuel-wasting dogleg to stay within reach of a runway. 9 In each case the discipline and the dividend are the same object viewed from two sides: the bank does not earn lighter capital and a validated model as separate prizes — the validation is the lighter capital; the airline does not earn a fuel saving and a wider diversion budget — the budget is the saving. The autonomy you can defend is the autonomy you can price; the autonomy you can price is the autonomy you can widen; and the only teams that get to widen it are the ones whose count holds.

Section 08

Honest Limits

The discipline deserves its limits stated as plainly as its claims. Backtesting does not bound failure relative to ground truth. It bounds undetected failure relative to a measured experience base and a specification — and the distinction is the whole honesty of the method. The traffic light counts exceptions against the model's stated threshold; it certifies that the realized rate is consistent with the claimed tolerance, not that the claimed tolerance is the right one for a regime the experience base never sampled. 1 LTCM's backtest of the body of its distribution was sound; the tail it had never observed was not in the sample, and no amount of counting over the represented cases would have surfaced it. 6 The traffic light certifies consistency with the stated tolerance; it does not certify that the tolerance is the right one for a regime the experience base never reached.

This is why the FRTB tail correction is necessary and still insufficient on its own. Expected Shortfall prices the tail that appears in the experience base; it cannot price a failure mode the base never sampled, only one the data has begun to show. 11 The LTCM tail was not under-weighted; it was un-sampled, and an Expected-Shortfall measure computed over the regimes the 1990s produced would have been as silent on the Russian default as the VaR it replaced. The honest claim is therefore narrow: the tail correction governs the observed tail well and the unobserved one not at all, and the only defense against the unobserved tail is to treat the experience base as perpetually incomplete and the abstention default as the response when the base cannot speak to the case at hand.

A second limit is statistical and specific to the agent setting. The binomial model under the escape-rate harness assumes independent trials, so a prompt-injection family or a silently-changed tool API that causes a burst of correlated escapes would make the confidence interval optimistic — the count would look green while a single root cause was failing many actions at once. 14 This is the LTCM correlation collapse and the 2008 clustering, transposed into software, and it is exactly why OBL-AGG-001's windowed aggregate and Christoffersen-style concentration monitoring are not optional refinements but load-bearing corrections to a count that the independence assumption would otherwise read too kindly.

A third limit is the integrity of the labels. The backtest is only as honest as its ground truth: labels must be produced independently of the verifier under test, or the harness measures agreement between two correlated parties rather than escape against reality — the same way a VaR backtest scored against the model's own predicted losses would certify nothing. 17 The scope statement that survives all three limits is this. Conformance asserts that the right checks ran, by an independent and uncorrelated party, over enough trials, with evidence — never that no error can occur. 20 A traffic-light regime bounds the rate of undetected failure relative to a measured base and a specified tolerance, scaled to consequence, aggregated across the governance unit, and contracted automatically on breach. It does not make the agent infallible. It makes the agent's authority a measured, accountable, revocable quantity — a count on a ledger that can be earned, audited, and, when the data turns, taken back. That is what banking spent decades learning to do to a risk model, and it is what AI autonomy now has to learn to do to an agent.

Section 09

The Engineering Posture

The posture an operator must adopt before letting an agent act follows directly, and it is the discipline of this paper in one line. Do not grant authority on a demo, however impressive; grant it on a count. Price autonomy in measured failure data over enough trials to exclude luck; aggregate the blast radius across every action and every agent in the governance unit; and contract the envelope automatically on breach. The agent's traffic light is not a report you read after the fact — it is the gate that holds the authority, recomputed on a rolling window, green only while the data says green, and red the moment it does not.

The brief companion to this paper — Backtested, Not Demoed — introduces the core argument in a shorter form. Related papers: Effective Challenge establishes that the entity counting the failures must be structurally separate from the one producing them; and The Obligation to Govern establishes that the count belongs to the deployer regardless of who built the model.

End of paper

↑ Back to top

References
  1. 1Federal Reserve / OCC, SR 26-2 — Interagency Guidance on Model Risk Management (April 2026), superseding SR 11-7 (2011): effective challenge, validation lifecycle (conceptual soundness, ongoing monitoring, outcomes analysis by backtesting), risk-tiering by materiality, and explicit vendor coverage.
  2. 2Basel Committee on Banking Supervision, definition of 99% one-day Value-at-Risk as a loss threshold not to be exceeded on more than one percent of days.
  3. 3Basel Committee on Banking Supervision, Supervisory Framework for the Use of "Backtesting" in Conjunction with the Internal Models Approach to Market Risk Capital Requirements (1996): rolling 250-trading-day exception count; green zone 0–4, yellow zone 5–9 (capital multiplier rises), red zone 10+ (model presumed broken).
  4. 4Paul H. Kupiec, "Techniques for Verifying the Accuracy of Risk Measurement Models," Journal of Derivatives 3, no. 2 (1995): the proportion-of-failures unconditional-coverage likelihood-ratio test.
  5. 5Peter F. Christoffersen, "Evaluating Interval Forecasts," International Economic Review 39, no. 4 (1998): the conditional-coverage test for non-clustering of exceptions.
  6. 6Roger Lowenstein, When Genius Failed: The Rise and Fall of Long-Term Capital Management (2000): the 1998 Russian-default tail event, the roughly US$4.6 billion loss, and the Federal Reserve–organized recapitalization.
  7. 7Bank for International Settlements / industry post-mortems on 2007–08 VaR exceedances: daily-VaR breaches far in excess of the implied ~1%-of-days (≈2.5 exceptions per year), with exceptions clustering at regime shift.
  8. 8David Viniar, CFO of Goldman Sachs, on the August 2007 market turmoil: "We were seeing things that were 25-standard-deviation moves, several days in a row" (Financial Times, August 2007).
  9. 9FAA Advisory Circular 120-42 and 14 CFR (ETOPS): the in-flight-shutdown-rate ceiling tightening with diversion minutes (on the order of 0.05 / 0.02 / 0.01 per 1,000 engine-hours) and continuing world-fleet reliability monitoring; authority earned by accumulated experience and contracted on drift.
  10. 10KellerAI ETOPS-for-AI stack — Reliability You Can Bank / Priced in Failure-Rate Data and Earned Range (reliability-you-can-bank-in-depth, earned-range): the four-step reliability-accounting move, the earned-and-revocable autonomy budget, and the economics of the removable human checkpoint.
  11. 11Basel Committee on Banking Supervision, Fundamental Review of the Trading Book (FRTB): replacement of 99% Value-at-Risk with Expected Shortfall at the 97.5th percentile as the primary internal-models risk measure — the average of losses beyond the threshold, measuring the tail directly.
  12. 12KellerAI, The LLM-Agent Assurance Standard in-depth (llm-agent-assurance-standard-in-depth), Section 6: the CT0–CT4 lattice and gate-derived tiering from the observed effect surface; the (agent, task-class) governance unit.
  13. 13KellerAI LAAS in-depth, Section 3 and Section 8: the escape rate as the governing metric — the rate at which a wrong result survives every applicable check and is acted upon — reported as integrity = 1 − escape_rate.
  14. 14KellerAI LAAS in-depth, Section 8: Bucket-B tolerances (CT2 0.02, CT3 0.005, CT4 0), the upper-confidence-bound conformance rule, and the rule-of-three sample-size floors (149 at CT2, 598 at CT3) below which the result is indeterminate.
  15. 15KellerAI LAAS in-depth, Section 6: the windowed aggregate per principal, session, and effect class, re-tiering sub-threshold actions to the aggregate's tier so a sequence cannot launder a high-tier effect; aggregation can only raise the tier.
  16. 16KellerAI Earned Range (earned-range): authority granted per tier on measured evidence and auto-contracting (revoking) on a backtest breach — the revocable envelope.
  17. 17KellerAI LAAS in-depth, Section 11 (and Sections 8–9): the cross-cutting obligation framework — no-self-grading invariant, the backtested-escape-rate measurement, independent ground-truth labeling, and deterministic precedence — cited as the framework source for OBL-RES-001 and OBL-AGG-001.
  18. 18KellerAI LAAS in-depth, Section 6: the structuring countermeasure — cumulative-effect re-tiering across a rolling window — underpinning OBL-AGG-001.
  19. 19KellerAI LAAS in-depth, Section 9: verifier independence and the error-correlation ceiling ρ ≤ 0.2 (phi coefficient, 95%-CI upper bound) measured on a frozen, versioned, ≥1,000-case adversarial held-out set.
  20. 20KellerAI Banking Stack — Effective Challenge (effective-challenge-in-depth): OBL-IND-001 (Independent Verification) and the principle that an honest exception count requires a structurally independent counter; "you cannot self-grade" and "you cannot outsource governance" as the same refusal authority depends on.