Backtesting: The Discipline That Won't Be Taken on Faith
A market-risk model states a Value-at-Risk: a loss threshold that, the model claims, will not be exceeded on more than one percent of days — a 99% one-day VaR. 2 That claim is a precise probabilistic assertion, and supervisors do not accept it on the strength of the model's design. They count. Under the Basel Committee's 1996 backtesting framework, the bank counts, over a rolling 250-trading-day window, how many days the realized loss exceeded the stated threshold. 3 If the model is well-calibrated, roughly two and a half exceptions per year are expected out of 250 trading days. The framework then sorts the observed count into a traffic light.
Zero to four exceptions is the green zone and the model stands. 3 Five to nine is the yellow zone and the regulatory capital multiplier rises — the model is allowed to keep operating, but its use gets more expensive in direct proportion to how far the evidence has drifted. Ten or more is the red zone and the model is presumed broken until recalibrated. 3 The authority to use an internal model to set your own regulatory capital is not granted once and filed away; it is recomputed on every rolling window, and it is revoked by the count itself. The bound is a measured commitment a bank gets to keep, not a claim a bank gets to make.
The statistical machinery underneath the traffic light is worth naming, because it is the same hypothesis-testing logic AI assurance has to adopt. Kupiec's 1995 proportion-of-failures test is a likelihood-ratio check of whether the empirical exception rate is consistent with the model's stated tolerance — an unconditional-coverage test. 4 Christoffersen's 1998 conditional-coverage test adds the requirement that exceptions not cluster in time, because real losses bunch up when market regimes shift, and a model that passes the count test while failing the clustering test is miscalibrated precisely when miscalibration is most dangerous. 5 Together they formalize a single demand: it is not enough for the average rate to look acceptable; the rate must be stable, and its failures must not concentrate.
A model that passed last year is not exempt this year. The exception count is recomputed on a rolling window, and a model that drifts into yellow or red loses the treatment it had earned. Authority priced in failure data is always provisional.
Two features of the regime are easy to miss and load-bearing for everything that follows. The first is that the count is taken over enough trials to mean something — a 250-day window, not a good week. The second is that the response to a breach is automatic and graduated: yellow does not require a committee to decide the model is suspect; the multiplier simply rises. 3 The system does not wait for human judgment to notice a problem the count has already proven.
The traffic light does not float free of a governance framework; it sits inside one, and the framework is what makes the count consequential. In the United States that framework is SR 26-2 — the interagency model-risk standard that in April 2026 superseded SR 11-7 after fifteen years as the canon — built on effective challenge, credible independent review with the authority to change the model, implemented through a validation lifecycle of conceptual soundness, ongoing monitoring, and outcomes analysis by backtesting, with risk-tiering by materiality and explicit vendor coverage. 1 Backtesting is the outcomes-analysis leg of that lifecycle: the traffic light is how ongoing monitoring is operationalized, and the capital multiplier is how a yellow finding is priced rather than merely noted. A model that posts five exceptions does not generate a memo for a future review cycle; it changes the bank's capital requirement on the next reporting date. The cost is the enforcement.
There is a specific reason supervisors fixed the conformance rule to a count against a threshold rather than to a regulator's after-the-fact judgment of whether the losses "felt" excessive. A count is adversarially robust in a way a judgment is not. A bank cannot argue its way out of ten exceptions, and a supervisor cannot be lobbied into forgiving them; the window closes, the number is what it is, and the zone follows mechanically. 3 The same property is what makes the regime portable to software: an escape count over a fixed window is a fact a vendor cannot reframe and a deployer cannot wish away, where "the agent seemed reliable in the pilot" is exactly the reframeable judgment the count is designed to replace.
The Named Losses: Where the Demo Diverged From the Tail
The discipline of counting exceptions over enough trials was not invented in the abstract. It was priced in retrospect by losses whose in-sample records were immaculate the morning before they failed.
Long-Term Capital Management is the canonical case, and its failure is worth tracing in sequence rather than in summary, because the sequence is the lesson. 6 The fund was run by principals including two Nobel laureates in economics, and its models carried a pristine in-sample track record — years of returns that validated the risk framework on every regime the data had seen. The first thing that failed was not a model output but a category error about what the track record certified. The returns were a draw from the regimes the 1990s had actually produced; the models inferred from that draw a distribution, and from that distribution a tail, that no observation had ever tested. The in-sample record was, in the vocabulary of this paper, a demo of extraordinary length — and a long demo is still a demo, because length in a favorable regime does not sample the unfavorable one.
The second thing that failed was the independence assumption. LTCM's positions were diversified on paper across instruments, geographies, and strategies whose historical correlations were low, and the risk framework priced them as largely independent bets. Then, in August 1998, Russia defaulted on its domestic debt and devalued the ruble, and a flight to liquidity swept every risky market at once. The correlations the model had measured at perhaps 0.1 in calm regimes went to nearly 1.0 in the crisis: positions that were supposed to offset each other moved together, and the diversification that backstopped the leverage evaporated in days. The tail event the in-sample record had never sampled blew through the fund's risk thresholds simultaneously and across positions the models had treated as separate. The losses ran to roughly US$4.6 billion, and in late September the Federal Reserve organized a recapitalization by a consortium of major banks to prevent the disorderly unwinding from cascading through the counterparties that had lent against those positions. 6 The track record had backtested the body of the distribution and never the tail; the demo was excellent and the backtest of the rare event did not exist.
The agent translation of this sequence is exact, and it is why two of this paper's obligations are necessary rather than redundant. The category error — mistaking a long favorable run for a measured rate — is what OBL-RES-001 forbids by demanding a count over a sample-size floor against a tier tolerance. The independence collapse — failures that were supposed to be uncorrelated arriving all at once — is what OBL-AGG-001 addresses by measuring the cumulative blast radius across actions rather than action by action, on the premise that a single root cause can fail many things together exactly when the per-action statistics look benign.
The 2008 crisis reinforced the same lesson at fleet scale, and it reinforced specifically the part of the regime that the count alone does not catch. Across the industry, banks breached their daily VaR far more often than the roughly one-percent-of-days — about two and a half exceptions a year — that a 99% model implied; exceptions clustered exactly as Christoffersen's conditional-coverage test warns they will when a regime shifts. 7 The clustering is the tell. A model can run for years inside its green zone, accumulating an exception count that an unconditional test reads as perfectly acceptable, and then post its entire year's budget of exceptions in a single fortnight — which is precisely the pattern of a system whose failures are correlated by a common cause the calm-regime data never exposed. Goldman Sachs's then–chief financial officer captured the disconnect in a now-famous remark that the firm was seeing "25-standard-deviation moves, several days in a row." 8 A 25-sigma daily move is, under the model's own assumptions, an event that should not occur in many times the age of the universe; observing several in a row is not a run of bad luck but proof that the model's tail was fiction. The demo — the in-sample calibration — had certified a distribution whose tail the market then refused to honor.
The clustering finding is what makes Christoffersen's test, not merely Kupiec's, load-bearing for the agent regime. An escape-rate harness that checks only the average rate over a window is the unconditional test; it will pass a model whose escapes are bunching under a single emerging root cause — a prompt-injection family, a tool API that silently changed, a distribution shift in the inputs — right up until the burst arrives. The honest agent traffic light has to watch for concentration as well as count, because the most dangerous failure mode is the one that is invisible to the average and obvious only to the clustering test, and it is invisible to the average for exactly as long as it takes to become catastrophic.
A pristine in-sample record is not a backtest of the tail. LTCM and the 2008 exceedances were models that priced the average beautifully and the rare event not at all. Only counted exceptions, over enough trials, backtest the regime you have not yet seen.
The structure recurs whenever a system commits on the strength of a record drawn from a favorable regime, with no count of how often the regime has actually held. It is the exact failure shape an autonomous agent inherits when its autonomy is granted on a demo rather than on a measured, stable, low rate of undetected failure.
Mapping to AI Agent Autonomy
Now the move can be applied to its target, in the concrete units the LLM-Agent Assurance corpus already defines. The governance unit is the (agent, task-class) pair, gated by a Consequence Tier from CT0 to CT4 derived from the action's observed effect surface — its actual reversibility, scope, and consequence — never from the agent's self-report. 12 CT0 is read-only and merely traced. CT1 is a reversible internal write with a self-check. CT2 is a reversible or low-value external effect. CT3 is a hard-to-reverse or material external effect. CT4 is irreversible or high-consequence. 12 The traffic light maps onto this tier ladder directly, and the quantity it counts is the escape rate: the rate at which a wrong result survives every applicable check and is acted upon, reported where a higher-is-better figure is wanted as integrity = 1 − escape_rate. 13
The escape-rate tolerance is the VaR exception budget, scaled to consequence exactly as the capital multiplier is. The CT2 tolerance is 0.02 — two percent. The CT3 tolerance is 0.005 — half a percent. The CT4 tolerance is zero. 14 The zero is not a rhetorical flourish; it is the design's sharpest edge. A binomial upper bound over any finite sample is strictly greater than zero, so a backtest can never pass a zero tolerance — which is correct, because CT4 is exactly the human-gated, abstention-default tier, and an irreversible high-consequence action is not something a count of past successes was ever entitled to authorize. 14 The traffic light's red zone, at CT4, is the entire zone.
The "enough trials to exclude luck" clause becomes a sample-size floor, and it is the part most often skipped in practice. A tolerance is only demonstrable if the achievable upper confidence bound at zero observed escapes can fall at or below it. By the rule-of-three at 95% confidence, that requires at least 149 samples at CT2 and at least 598 at CT3. 14 Below the floor, the result is not a pass; it is indeterminate for insufficient sample size. A demo of twenty clean runs is not a backtest of a half-percent tolerance any more than a good week is a backtest of a 99% VaR — the sample cannot, even in principle, carry the claim. And the conformance rule is deliberately conservative in the same way Basel's is: an action passes only if the upper bound of the confidence interval is at or below tolerance, never the point estimate, because validating the point estimate would ignore sampling error and wave through underpowered sets. 14
A worked illustration makes the floor's bite concrete. Suppose an agent is to be granted CT3 authority — a hard-to-reverse external effect, say committing a refund or filing a downstream record — on a task class where it has run 300 evaluation cases with zero observed escapes. The point estimate of the escape rate is 0%, which looks like a flawless backtest. But the half-percent CT3 tolerance demands that the upper 95% confidence bound sit at or below 0.005, and the rule-of-three says that at zero escapes the upper bound is approximately 3/n: with n = 300, that bound is about 0.010 — twice the tolerance. The 300-case run is not a CT3 pass; it is indeterminate, because the sample is too small to exclude a true escape rate that would breach the tier even though not one escape was seen. Only at n ≥ 598 does the zero-escape upper bound fall to 0.005 and the claim become demonstrable. Now suppose that over 600 cases the agent posts three escapes rather than zero: the upper bound climbs well above 0.005, the backtest fails, and authority does not contract from CT3 — it never reaches CT3 in the first place, and any standing CT3 grant auto-contracts on the breach. The point estimate of 0.5% would have read as exactly on tolerance; the conformance rule, reading the upper bound, correctly refuses it. This is the same discipline that stops a bank from claiming a 99% VaR model is validated on a quiet quarter with one exception: the count must be powered to exclude luck before its silence means anything.
Measure the escape rate over at least the sample-size floor — 149 at CT2, 598 at CT3 — against the tier tolerance. Green stands. A breach contracts the tier automatically. No committee, no argument; the count revokes the authority, exactly as the Basel multiplier rises on its own.
Two further units complete the regime. The first is cumulative blast-radius aggregation: the escape budget is not spent one action at a time. It is measured against the cumulative blast radius across actions and across agents within a governance unit over a rolling window — the same windowed aggregate that, in the gate, re-tiers a sequence of sub-threshold transfers so they cannot launder a high-tier effect. 15 Aggregation can move a unit from green to red even when every single action passed, just as a string of individually-tolerable VaR breaches sums into a red-zone count. This is the direct countermeasure to the LTCM independence collapse: a thousand small reversible writes that each pass CT2 in isolation can, in aggregate over the window, constitute a hard-to-reverse material change of state — and the aggregate, not the action, is what the tier must be derived from. The blast radius is the consequence scale of an action; the cumulative blast radius is that scale summed across the governance unit, and it is the quantity the escape budget is actually spent against.
The second unit is the revocable envelope, and it brings the rollback horizon with it. Authority is granted per tier on measured evidence and auto-contracts on a backtest breach. 16 The rollback horizon — the bounded time and work within which a committed action can be undone — is what separates the tiers in the first place: a CT2 effect is reversible within a short, rehearsed horizon, a CT3 effect is hard to reverse, and a CT4 effect is irreversible, its rollback horizon effectively infinite. Authority and rollback horizon move together: as the horizon lengthens toward irreversibility, the tolerance tightens toward zero and the abstention default engages. The diversion budget an airframe earns is contracted on a rising fleet rate; the capital treatment a model earns is contracted on a rising exception count; the autonomy an agent earns is contracted on a rising escape rate. The envelope is earned, and it is revocable, and the contraction is automatic — it does not wait for an operator to reauthorize what the count has already revoked.
The LAAS Obligations This Regime Defines
The discipline above is enforced by two new obligations, which we introduce here with stable identifiers. The LLM-Agent Assurance Standard's in-depth treatment defines the obligation framework — the no-self-grading invariant, the escape-rate metric, the backtest harness, the deterministic precedence ordering — but never enumerates stable identifiers for the obligations themselves. 17 We name them, define them inline, and cite the LAAS framework as their source.
Authority at a tier requires a backtested escape rate at or below the tier tolerance — CT2 2%, CT3 0.5%, CT4 0% — measured over at least the sample-size floor for that tier, with the upper bound of the confidence interval (not the point estimate) at or below tolerance. Authority auto-contracts on a breach. Accuracy on a demonstration is not authority. Tier floor: CT2.
OBL-RES-001 is the traffic light written as a conformance predicate. It is the obligation that makes "backtested, not demoed" mechanical: an agent claiming authority at CT3 on the strength of a favorable run, with no backtest meeting the 598-sample floor at the half-percent tolerance, is non-conformant — in the exact sense in which a bank running a VaR model with no rolling backtest is, to a supervisor, running an unvalidated model in production. 17 The auto-contract clause is the Basel multiplier and the ETOPS approval-constraint in one: a breach does not require a human to relitigate the autonomy decision; the envelope contracts on the count.
The escape budget is measured against the cumulative blast radius across actions and across agents within a governance unit over a rolling window — not per action in isolation. Aggregation can move a unit from green to red even when every single action passed. Tier floor: CT2.
OBL-AGG-001 is the defense against the laundering of a high-tier effect through a sequence of sub-threshold ones — the same structuring countermeasure the gate enforces by maintaining a windowed aggregate per principal, session, and effect class and re-tiering to the aggregate's tier once a cumulative threshold is crossed. 18 Aggregation can only ever raise the tier, never lower it. It is why the traffic light is run over a rolling window of counted exceptions rather than over single actions: the unit, not the action, is what earns or loses authority.
These two obligations lean on others they do not own, and the dependency is the spine that ties this article to the rest of the Banking Stack. An honest exception count requires an independent counter. OBL-IND-001 — Independent Verification holds that a CT3-or-above action must be verified by a challenger structurally separate from the actor — separate model lineage, separate context, separate owning function — that did not build the thing it grades. 17 A backtest run by the same model that produces the failures shares its blind spots and undercounts exactly the escapes that matter most. But structural separation on the org chart is necessary and not sufficient: a verifier whose errors correlate with the actor's — beyond the ceiling of ρ ≤ 0.2, the phi coefficient measured on a frozen, versioned, ≥1,000-case adversarial held-out set with the upper bound of the 95% CI at or below 0.2 — adds no real independence no matter how the boxes are drawn, because the two will be blind to the same cases at the same time. 19 That correlation ceiling is the formal content of independence, and it is what stops a "second model" that is merely a sibling of the first from masquerading as effective challenge. The verifier must also be qualified for the tier it gates — its own competence on the task class measured before it may grade — or an unqualified challenger waves through escapes while appearing to count them. 17
A2's escape-rate measurement is the backtest of A1's challenger: the traffic light is only honest if the entity counting the exceptions is structurally incapable of grading its own homework and demonstrably uncorrelated with the actor it grades. This is banking's effective challenge — credible, independent review with the authority to change or reject the model — rendered as a measured ceiling rather than an org-chart claim. It is why "you cannot self-grade" (A1) and "you cannot outsource governance" (A3) are the same refusal that "accuracy is not authority" depends on: a count produced by the actor is not a backtest, and a count produced by a captured or correlated reviewer is the same fiction wearing a second name. 20 The LTCM principals, after all, were not grading their own models in any naive sense — they were the most credentialed validators available; what they lacked was a counter whose view of the tail was independent of the framework that had defined the tail away.
Discipline Is the Asset, Not the Tax
The reason any of this is worth the cost is that the discipline is not overhead; it is what unlocks the wider operation. Provable reliability is the asset, not the tax. The clearest demonstration is the one banking and aviation both make: the institutions that can measure a failure rate are the only ones that can safely widen authority, because they are the only ones who can defend the widening to an examiner.
The mechanism is the selective removal of the human checkpoint. An organization that cannot quantify its agent's escape rate has no principled way to remove any gate, so it keeps all of them, and the agent's nominal autonomy is fictional — every consequential step still routes through a human approval that buys safety the way carrying two spare engines does: expensive, mostly idle, and corrosive of the vigilance it depends on, because a reviewer who rubber-stamps a thousand correct actions stops reading carefully before the one that mattered. 10 A backtested escape rate is what lets gates be removed selectively and defensibly: keep the gate where the measured tail risk and the aggregated blast radius warrant it; drop it where the demonstrated rate over enough trials makes it dead weight. The dividend is not the absence of supervision. It is supervision allocated to exactly where the data says it is needed.
The wider autonomy budget is not a separate prize from the backtesting discipline. The budget is what the discipline buys. The teams that can count their agents' escapes over enough trials are the only ones who get to fly the direct route — fewer checkpoints, more tasks carried to completion — and they get to keep flying it only as long as the count holds.
The economics of that checkpoint are worth dwelling on, because they are where the dividend is realized or lost. A human approval gate around every consequential agent action is the supervision analogue of carrying two extra engines: it buys safety, but it is expensive, mostly idle, and corrosive of the vigilance it depends on. The cost is not only the reviewer's time; it is the latency the gate adds, the throughput it caps, and the attrition of attention that sets in when a reviewer approves a thousand correct actions and stops reading carefully before the one that mattered. 10 The count is what converts that flat, expensive, all-or-nothing supervision into a priced and graduated one. Where the measured escape rate is low and the aggregated blast radius is bounded, the gate is dead weight and can be dropped; where the tail risk or the cumulative exposure warrants it, the gate stays and is defensible to an examiner precisely because a number, not an intuition, placed it there. The contraction clause is the other half of the bargain: an operator can afford to drop a gate only because the envelope will auto-contract the moment the count turns, so the saving is never a permanent bet against the data — it is a bet that lasts exactly as long as the data supports it.
This is the same dividend banking realizes when a well-backtested model earns lighter capital treatment, and the same one aviation realizes when an airframe with a demonstrated shutdown rate earns the direct great-circle line across an empty ocean instead of a fuel-wasting dogleg to stay within reach of a runway. 9 In each case the discipline and the dividend are the same object viewed from two sides: the bank does not earn lighter capital and a validated model as separate prizes — the validation is the lighter capital; the airline does not earn a fuel saving and a wider diversion budget — the budget is the saving. The autonomy you can defend is the autonomy you can price; the autonomy you can price is the autonomy you can widen; and the only teams that get to widen it are the ones whose count holds.
Honest Limits
The discipline deserves its limits stated as plainly as its claims. Backtesting does not bound failure relative to ground truth. It bounds undetected failure relative to a measured experience base and a specification — and the distinction is the whole honesty of the method. The traffic light counts exceptions against the model's stated threshold; it certifies that the realized rate is consistent with the claimed tolerance, not that the claimed tolerance is the right one for a regime the experience base never sampled. 1 LTCM's backtest of the body of its distribution was sound; the tail it had never observed was not in the sample, and no amount of counting over the represented cases would have surfaced it. 6 The traffic light certifies consistency with the stated tolerance; it does not certify that the tolerance is the right one for a regime the experience base never reached.
This is why the FRTB tail correction is necessary and still insufficient on its own. Expected Shortfall prices the tail that appears in the experience base; it cannot price a failure mode the base never sampled, only one the data has begun to show. 11 The LTCM tail was not under-weighted; it was un-sampled, and an Expected-Shortfall measure computed over the regimes the 1990s produced would have been as silent on the Russian default as the VaR it replaced. The honest claim is therefore narrow: the tail correction governs the observed tail well and the unobserved one not at all, and the only defense against the unobserved tail is to treat the experience base as perpetually incomplete and the abstention default as the response when the base cannot speak to the case at hand.
A second limit is statistical and specific to the agent setting. The binomial model under the escape-rate harness assumes independent trials, so a prompt-injection family or a silently-changed tool API that causes a burst of correlated escapes would make the confidence interval optimistic — the count would look green while a single root cause was failing many actions at once. 14 This is the LTCM correlation collapse and the 2008 clustering, transposed into software, and it is exactly why OBL-AGG-001's windowed aggregate and Christoffersen-style concentration monitoring are not optional refinements but load-bearing corrections to a count that the independence assumption would otherwise read too kindly.
A third limit is the integrity of the labels. The backtest is only as honest as its ground truth: labels must be produced independently of the verifier under test, or the harness measures agreement between two correlated parties rather than escape against reality — the same way a VaR backtest scored against the model's own predicted losses would certify nothing. 17 The scope statement that survives all three limits is this. Conformance asserts that the right checks ran, by an independent and uncorrelated party, over enough trials, with evidence — never that no error can occur. 20 A traffic-light regime bounds the rate of undetected failure relative to a measured base and a specified tolerance, scaled to consequence, aggregated across the governance unit, and contracted automatically on breach. It does not make the agent infallible. It makes the agent's authority a measured, accountable, revocable quantity — a count on a ledger that can be earned, audited, and, when the data turns, taken back. That is what banking spent decades learning to do to a risk model, and it is what AI autonomy now has to learn to do to an agent.
The Engineering Posture
The posture an operator must adopt before letting an agent act follows directly, and it is the discipline of this paper in one line. Do not grant authority on a demo, however impressive; grant it on a count. Price autonomy in measured failure data over enough trials to exclude luck; aggregate the blast radius across every action and every agent in the governance unit; and contract the envelope automatically on breach. The agent's traffic light is not a report you read after the fact — it is the gate that holds the authority, recomputed on a rolling window, green only while the data says green, and red the moment it does not.
The brief companion to this paper — Backtested, Not Demoed — introduces the core argument in a shorter form. Related papers: Effective Challenge establishes that the entity counting the failures must be structurally separate from the one producing them; and The Obligation to Govern establishes that the count belongs to the deployer regardless of who built the model.
End of paper
↑ Back to top