Skip to main content
kellerai.blog

A Great Demo Quarter Is Not a Backtest

A demo quarter is not a backtest. Authority is priced in failure data, not favorable runs.

KellerAI White Paper · Engineering Discipline & Verification · Jun 2026

Context

AI teams routinely grant agents wider autonomy after a run of impressive results — clean pull requests, no escalations, a smooth demo quarter. Each of these judgments treats accuracy on cases shown as authority over cases to come.

The Finding

That judgment is wrong in a specific and expensive way. Authority is priced in measured failure data over enough trials to exclude luck — the same traffic-light discipline banking has run on risk models for thirty years. A long demo is still a demo; only a count earns authority.

Tags:
agent autonomybacktestingescape ratetraffic light regimeBasel VaR
Paper Details
CategoryEngineering Discipline & Verification
AudienceEngineering leads, AI governance teams, and risk officers evaluating when and how to widen agent autonomy in production systems.
MethodAnalytical · evidence-based
Length~550 · 2 min
Sections0
DateJun 2026
AuthorsKellerAI
Read the full paper
Section 01

A Demo Quarter Is Not a Backtest

The most expensive misreading in AI deployment is that a great demo quarter is evidence of reliability. It is not. A run of impressive results is a sample of one favorable regime, and a sample of one favorable regime is exactly what every famous blow-up had on the morning before it blew up. Accuracy on the cases you showed is not authority over the cases you will face — and the gap between the two is where the losses live.

Authority is a different kind of quantity. It is priced in measured failure data over enough trials to exclude luck — and it contracts automatically the moment the data turns. Banking learned this the hard way, wrote it into rule, and runs it every trading day. AI deployment has not yet made the move, and is spending the difference in silent, unaccounted failures.

Section 02

How Banking Prices the Same Thing It Won't Take on Faith

A market-risk model that states a 99% one-day Value-at-Risk is making a precise probabilistic claim: realized losses should breach the stated threshold on no more than one day in a hundred. Regulators do not accept that claim because the model looks good. They count. Over a rolling 250-trading-day window they tally how many days the realized loss exceeded the threshold, and they run a traffic light around the count.

Zero to four exceptions is the green zone — the model stands. Five to nine is the yellow zone — the regulatory capital multiplier rises, and the model's use gets more expensive. Ten or more is the red zone — the model is presumed broken until it is recalibrated. The authority to set your own capital is not asserted; it is earned in exceptions, recomputed continuously, and revoked by the count itself. A bound is not a claim a bank gets to make. It is a measured commitment a bank gets to keep.

Section 03

The Tail, Priced Directly

A model can be well-calibrated on the average day and catastrophically wrong about the rare one. Value-at-Risk marks the boundary of acceptable loss but says nothing about how bad the loss is once that boundary is crossed. So banking moved the goalposts: the Fundamental Review of the Trading Book replaced VaR with Expected Shortfall at the 97.5th percentile — the average of the losses beyond the threshold. Govern by the worst credible failure, not by the comfortable mean.

Models with pristine in-sample track records are precisely the ones whose tails were never tested. The traffic light and the move to the tail are the same refusal: a demo is not a backtest of the rare event, and only counted exceptions over enough trials are.

The tail correction

That correction was written in blood and money. Models with pristine in-sample track records are precisely the ones whose tails were never tested. The traffic light and the move to the tail are the same refusal: a demo is not a backtest of the rare event, and only counted exceptions over enough trials are.

Section 04

The Agent's Own Traffic Light

The translation to autonomous agents is exact. The unit is not "Did the demo work?" but "What is the rate of undetected failure on this task class, measured over enough runs to exclude luck, and is it still holding?" Price the autonomy budget in that rate. Tighten the tolerance as the blast radius grows — a looser bound for reversible internal writes, a tighter one for hard-to-reverse external effects, and zero for the irreversible tier that no backtest can ever pass.

Then aggregate. The escape budget is not spent one action at a time; it is spent across every action and every agent in a governance unit over a rolling window. A sequence of individually-passing actions can still move the unit from green to red. And when the count turns, the authority contracts on its own — the agent's own traffic light, switching without anyone having to argue about it.

Section 05

Discipline Is the Asset, Not the Tax

It is tempting to read all of this as overhead — measurement as a cost that careful teams pay and bold teams skip. The opposite is true. The teams that can quantify their agents' failure rates are the only ones who can remove a human checkpoint defensibly: keep the gate where the measured tail risk warrants it, drop it where the data shows it is dead weight. The wider autonomy budget is not a separate prize from the discipline. The discipline is what buys it.

The in-depth companion develops the full argument — the Basel traffic-light regime and the move to Expected Shortfall, the named losses that priced the tail in retrospect, and the escape-rate machinery that lets an agent earn, aggregate, and auto-revoke its own authority.

Read the in-depth companion →

End of paper↑ Back to top