Abstract
The KellerAI corpus on model upgrades governs the acute event: the deliberate, customer-visible version bump that arrives with a release note and a new alias. This paper addresses the prior and harder problem — the behavior of a production model on which no named upgrade occurred. Between two labeled versions, a hosted model can change how it behaves through safety fine-tuning, an inference-engine revision, a routing or caching policy, a context-window change, or a server-side system-prompt edit, none of which the operator sees and none of which changes the version string the operator validated.
The argument is that behavioral fingerprinting — the periodic re-execution of a fixed probe set against a stored baseline — belongs in the production telemetry stack as a first-class signal, alongside availability, latency, and cost, rather than living only in a pre-deployment test suite. The standard observability stack is structurally blind to this failure: it measures whether the endpoint answered, how fast, and at what cost, and whether the input distribution shifted, but never whether the output distribution moved on an input that did not.
The paper is honest about evidence stratification. It leads with documented incidents — Anthropic's March–April 2026 postmortem and the GateScope gateway measurement study — and treats the widely cited “91% of models degrade” figure as the vendor survey it is, not as independent measurement of vendor-side ambient drift. Behavioral fingerprinting is presented as a discipline this paper proposes, assembled from existing research prototypes and standards; no commercial, production-grade, vendor-agnostic implementation existed as of this writing. The argument's strength is that it does not depend on the survey number: the documented incidents and the structural blindness of conventional monitoring carry it on their own.
For the leadership-level version of this argument — the load-bearing claim, the worked incident, and the four-line checklist — read the companion brief, The Drift You Cannot See Until It Costs You .
Your cost dashboard is green. Your latency P99 is flat. Your model version is unchanged. The behavior is not the same model you validated last quarter.
Two Classes of Drift, One Governance Gap
It helps to separate two failures that get conflated. The first is the acute upgrade: a vendor ships a new named version, the alias moves, a release note exists, and a disciplined team runs a change review before adopting it. That event is the subject of the companion frontier-model paper, and it has a natural governance handle — there is an event to gate. 1 The second is chronic drift: the behavior of a model changes while its named version does not. There is no event, no release note, and therefore nothing for a change-review process to gate. This is the subject of this paper, and it is the harder problem precisely because the usual governance handle is absent.
Chronic, vendor-silent drift has at least four distinct sources, and it is worth naming them because they fail observability in different ways. The first is the safety and alignment fine-tuning cadence: providers adjust model behavior continuously, and the cadence is not published. We flag this source as the most speculative of the four — the existence of continuous tuning is well established, but a precise, documented cadence is not, and we do not assert one.
The second source is documented and concrete: inference-engine and infrastructure changes. Anthropic's own engineering postmortem for March 4 through April 20, 2026 traced behavioral regressions on a stable API surface to changes including a reduction in reasoning effort and a caching defect that degraded the model's use of session context. 2 Independent telemetry tracking through the same window corroborated a substantial drop in observed reasoning depth on the affected surface. 3 None of these changes touched the version string.
The third source is the API-gateway layer. The 2026 GateScope study, a large-scale measurement of LLM API gateways, examined more than a thousand gateways and reported silent model substitutions, degraded conversational memory, and pricing deviations — behavioral changes introduced by the routing and caching infrastructure beneath the model, invisible at the alias the operator targets. 4 The fourth source is server-side behavioral tuning, exemplified by OpenAI's rollback of a GPT-4o update that had become noticeably sycophantic; the customer-facing alias did not change across either the regression or the rollback. 5
The governance gap is the common thread. Cost monitoring, latency monitoring, and input-distribution monitoring — the three pillars most production teams actually run — all observe axes orthogonal to this failure. Each of the four sources changes the output distribution on a fixed input while leaving cost, latency, and input distribution within their normal envelopes. The gap is not a tooling oversight; it is a category error, monitoring the wrong dimension for the failure that occurs.
The four sources of silent drift share one property: each moves the output on a fixed input while leaving every metric you already watch inside its normal range.
The Scale: What “91% Degrade” Actually Measures
A figure circulates in practitioner writing on this topic: that the large majority of machine-learning models — often quoted as 91% — degrade within roughly twelve months of deployment. Because the number is doing rhetorical work across the field, it deserves to be handled carefully rather than repeated. It has two distinct sources, and neither measures the specific phenomenon this paper is about.
The first source is a DataRobot industry report on the state of AI in production. It is a vendor-sponsored practitioner survey, and what it measures is broad, self-reported machine-learning model degradation across deployments — not a controlled, longitudinal measurement of vendor-side ambient drift on hosted frontier models specifically. 6 We cite it as exactly that: a vendor survey, useful for establishing that practitioners experience degradation as a common problem, and not load-bearing as independent ground truth for the narrower claim.
The second source is the academic literature on model aging, which documents the temporal decay of deployed ML model performance and the monitoring practices built to catch it — surfaced in industry largely through drift-monitoring vendors' summaries. 7 This work is real and relevant, but it measures the general phenomenon of model performance declining over time, typically driven by data and concept drift on the input side, not the specific case of a hosted model's behavior changing while its version is held constant. The distinction matters: classical model aging is in principle catchable by input-distribution monitoring, whereas vendor-side ambient drift, by construction, is not — the input is identical and the model changed underneath it.
Practitioner writing has begun to call this class of failure the defining operational risk of the year, and that framing captures the mood accurately. 8 We carry it for voice, not for factual weight: it is a vendor and practitioner characterization, and the load-bearing evidence remains the documented incidents and the gateway study, not the survey numbers or the editorial framing.
The 91% figure measures broad ML degradation in a vendor survey, not vendor-side ambient drift. The argument does not need it. The documented incidents carry the weight on their own.
The Instrument: Behavioral Fingerprinting
Behavioral fingerprinting is the periodic re-execution of a fixed, deterministic probe set against a stored baseline, with the output scored and the deviation tracked over time. It is defined as much by what it is not. It is not latency or cost monitoring, which observe the transport and the bill rather than the content. It is not input-distribution monitoring, which watches what goes in rather than what comes out. And it is not a one-time regression suite run at deploy, because vendor-side drift is not tied to the operator's deploy calendar — a check that runs only when the operator ships cannot catch a change the operator did not ship.
The mechanism is demonstrated in research. A black-box endpoint fingerprinting prototype — described in the literature as a stability monitor — re-runs fixed probes against a hosted endpoint and detects within-provider stability differences even when the model alias is held constant, exactly the regime this paper is concerned with. 9 A separate experiment is instructive for how small a change can flip a behavioral verdict: a three-word edit to a YAML configuration flipped two of four automated evaluators from pass to fail with zero accompanying signal on any site-reliability dashboard, a result that shows both the sensitivity required and the invisibility of the failure to conventional monitoring. 10
Scoring the output is the part that distinguishes a fingerprint from a spot check. The probe set must pair each prompt with a rule-based or model-based evaluator so that “the output changed” becomes a quantity rather than an impression. The research community has produced relevant scoring primitives, including an intent-drift score reported to correlate strongly with human judgment of whether a model has changed its behavior on a task. 11 Practitioner taxonomies likewise distinguish several drift types — for instance, separating a change in the model's competence from a change in its style or its intent — which is useful for deciding which deviations warrant a page and which are tolerable vendor noise. 12
A three-word configuration change flipped two of four evaluators from pass to fail. None of the SRE dashboards would have caught it. The probe set did.
The Anthropic April 2026 Incident, Worked
The March–April 2026 Anthropic incident is the clearest worked example available, because the vendor published a timeline against which the argument can be checked. Beginning around March 4, behavioral regressions appeared on a model alias that did not change. The vendor's postmortem, published April 23, identified the contributing changes; a community-tracked issue and third-party telemetry had surfaced the degradation well before that date. 2 13 Roughly seven weeks separated the first change from the public confirmation.
Three degradation modes are useful to examine against the standard monitoring stack, because each defeats it differently. The first was a reduction in reasoning effort — the model thought less before answering. This is the mode most likely to surface eventually in an aggregate quality metric, if a team happens to run one, though even then it would appear as a slow trend rather than a dated event.
The second was a caching defect that degraded the model's use of session context. This one is invisible to any monitor that evaluates single turns in isolation; it shows only if the instrument follows a conversation's lineage and checks whether earlier context is still being used downstream. A fingerprint built from single-shot probes would miss it; a fingerprint that includes multi-turn probes catches it.
The third was a verbosity cap: the model returned less than it had before. This is the most insidious of the three, because it is a regression in what the model did not return — an absence rather than an error. No availability monitor, error-rate monitor, or latency monitor observes a correct-looking response that is merely less complete than it used to be. Only a baseline comparison — this probe used to produce a seven-step answer and now produces four — surfaces it. The lesson of the three modes together is that no single conventional signal catches all of them, and a baseline behavioral comparison catches all three. 13
For seven weeks the endpoint returned 200, the bill was normal, and the version was constant. The only thing that changed was the behavior — and only third-party telemetry was watching it.
The Tool Landscape, 2026
If the discipline is sound, the natural question is whether a product already implements it. The honest answer is that the pieces exist and the assembled whole does not. Surveying the 2026 landscape clarifies what a team would have to build versus buy.
The established ML-observability platforms — the category that includes input-distribution and data-drift tooling — are well-developed but aimed at the wrong axis for this failure. Their core competence is detecting that the data flowing into a model has shifted, which is the classical drift problem and a real one, but it is not the case of a hosted model changing behavior on identical inputs. 14 Several evaluation-focused vendors market drift and quality monitoring for LLM outputs, and these are closer in spirit, but as a class they stop short of continuous, scheduled, vendor-agnostic fingerprinting against a frozen baseline as a production telemetry stream. 15
On the open-standards side, the ground is being prepared. The OpenTelemetry semantic conventions added a generative-AI evaluation result attribute — gen_ai.evaluation.result — which gives the ecosystem a standard place to record the outcome of an evaluation in the telemetry pipeline, though it is marked as a development-stage convention rather than a stable one. 16 The agent-evaluation tooling demonstrated at industry conferences in early 2026 shows the practice moving toward continuous evaluation in production rather than one-time testing — the directional shift this paper argues for — but it does not yet constitute a turnkey vendor-side drift monitor. 17
The structural gap is organizational as much as technical. As of this writing, the relevant standards bodies had not converged on a shared evaluation format that would let a fingerprint be portable across providers, and no production-grade vendor-agnostic implementation of continuous behavioral fingerprinting was identified. 18 The components — probe execution, scoring, a telemetry attribute to record the result — exist. The assembled, scheduled, baseline-comparing instrument, run as a first-class production signal, is what a team must still build for itself.
Minimum Viable Fingerprinting (Proposed)
We propose a minimum viable discipline assembled from the components surveyed above. It is deliberately modest: four practices that a team can stand up without a dedicated platform, and that together convert an unmeasured behavioral surface into a monitored one. We describe it as a discipline this paper proposes, not a product KellerAI ships.
- 01A fixed probe set. Assemble a small set of deterministic prompts drawn from your real task distribution, each paired with a rule-based evaluator that scores the output, and version the set together with its baseline. The probe set is the measuring stick: it must be stable when the model is not, and it must reflect the tasks you actually depend on rather than synthetic benchmarks.
- 02A scheduled cadence. Re-run the probe set on a clock — daily at minimum, more often for high-stakes paths — rather than only at deploy time. Clock-driven execution is the non-negotiable design choice, because vendor-side drift is decoupled from your release calendar; a deploy-gated check is structurally incapable of catching a change you did not deploy.
- 03A layered alert policy. Warn on a small deviation from baseline (roughly 5–10%), page on a large one (roughly 15–20%), and segment thresholds per task so a noisy category cannot mask a critical one. The objective is to separate tolerable vendor noise from a genuine behavioral break before it reaches a user, without generating so many alerts that the signal is ignored.
- 04A behavioral trail. Store every run — probe, output, score, and baseline delta — as a dated, immutable record. The trail is the governance artifact. It is what lets a team state, with a date and a magnitude attached, that the model in production today is not the model that was validated, which is precisely the evidence an auditor or a regulator asks for.
One refinement is worth adding because it addresses the most common objection — that continuous fingerprinting will drown a team in re-evaluation storms. A layered behavior hash, demonstrated in practitioner experiments, separates probes into tiers: a behavior-critical tier whose deviation forces a full re-evaluation, a runtime tier gated by a tolerance band, and an observed tier treated as vendor noise and excluded from alerting altogether. 10 The tiering is what makes the cadence affordable: not every deviation deserves a page, and the discipline is as much about suppressing noise as about catching signal.
The cost-side relationship to a sibling concern is worth flagging rather than re-arguing. A tokenizer revision — a change in how the provider chunks text — is both a billing event and a drift vector, because different chunking produces a different output distribution. The economics of that change belong to the hidden-cost analysis in the broader catalog; here it is simply one more source the probe set will detect, and the fingerprint does not need to know which of the four sources caused a deviation to flag it. 19
Regulatory Overlay
A regulated team is already obligated to monitor behavior in production, whether or not any framework names this specific failure. The framing throughout this section is deliberate and inferential: these are obligations that apply to systems running hosted models, and fingerprinting is one concrete way to discharge them — not a determination that any specific rule names continuous fingerprinting.
The Federal Reserve's model-risk guidance, SR 11-7, distinguishes model development and validation from ongoing monitoring, and treats ongoing monitoring as an essential, continuing control over a model in production. 20 This is the cleanest regulatory home for the argument, because it separates exactly the two controls this series separates: the change review that governs an upgrade is the validation-on-change control, while fingerprinting is the ongoing-monitoring control. The two are complementary, not substitutes, and SR 11-7 expects both.
The NIST AI Risk Management Framework's Manage function, together with its generative-AI profile, asks an organization to monitor deployed systems for changes in behavior and performance over time, which is the operational content of a fingerprinting program; the framework's emphasis on continuous, post-deployment measurement maps directly onto a scheduled probe set. 21 ISO/IEC 42001, the AI management-system standard, similarly carries performance-evaluation and continual-improvement obligations in its clauses on monitoring and review that an ongoing behavioral measurement program helps satisfy. 21
The EU AI Act addition is the most consequential and the most interpretive. A vendor-side silent update may not trigger the provider's obligations on substantial modification under Article 25, because from the provider's point of view nothing was released. But the deployer's post-market monitoring obligation under Article 72 plausibly still applies: the deployer is responsible for monitoring the system in operation, and a system whose behavior has silently drifted is one the deployer is arguably obligated to have noticed. 22 We flag this mapping explicitly as interpretive — a reading of how the obligation falls when the change originates upstream and invisibly — and one that requires counsel sign-off rather than a conclusion this paper can draw on a deployer's behalf.
SR 11-7 already separates validation-on-change from ongoing monitoring. Upgrade governance is the first control. Fingerprinting is the second. The guidance expects both.
Honest Limits
A paper arguing for measurement discipline owes its reader an honest account of its own evidence. Here is what this paper does not establish.
The headline degradation figure is not independent ground truth. The widely cited “91% of models degrade” number traces to a DataRobot vendor survey of broad ML degradation, not to a controlled longitudinal study of vendor-side ambient drift on hosted frontier models, and the academic model-aging literature it is sometimes paired with measures general performance decay rather than the held-version case. 6 7 We have used the documented incidents as the load-bearing evidence and the survey only to establish that practitioners find degradation common.
The problem is documented through incidents, not through a longitudinal measurement of how often it occurs. The Anthropic postmortem, the GPT-4o rollback, and the GateScope study together establish that vendor-silent drift is real and recurrent, but they do not establish a base rate. 2 5 4 The absence of a longitudinal vendor-update-cadence study is the single largest thin spot in the evidence base, and we name it as such. The remedy does not wait on that study, but the field would be better served by it.
Fingerprinting detects drift; it does not explain it. A probe-set deviation tells a team that behavior moved, by how much, and when. It does not tell them which of the four sources — fine-tuning, infrastructure, gateway, or system prompt — caused the move, because all four are opaque from the operator's side. The instrument is a smoke detector, not a fire inspector; it earns its place by being the only detector pointed at the right room.
The safety-fine-tuning cadence remains speculative. We have framed continuous alignment tuning as an open question rather than an asserted fact, because while it is well established that providers tune models continuously, a precise published cadence is not available, and we have not assumed one. No commercial, production-grade, vendor-agnostic fingerprinting platform was identified as of this writing; the discipline is assembled from existing parts, and presenting it otherwise would overclaim. 18
Finally, the scope boundaries are deliberate and worth restating. This paper does not re-argue the governance of a deliberate, named-version upgrade — that is the companion frontier-model paper's subject, and it owns the acute event with a release note. 1 It does not develop the economics of a tokenizer change, which belong to the hidden-cost analysis. 19 And the regulatory mappings, especially the EU AI Act Article 72 deployer-obligation reading, are interpretive and require verification against current text and counsel review. 22 What this paper does claim is narrow and, we believe, well supported: that chronic vendor-silent behavioral drift is real and documented, that conventional production monitoring is structurally blind to it, and that continuous behavioral fingerprinting against a fixed baseline is the instrument that sees it.
For the short, leadership-level version of this argument — the load-bearing claim, the worked incident, and the four-line checklist — read the companion brief, The Drift You Cannot See Until It Costs You .
Not that we can tell you why the model changed — the sources are opaque from your side — but that you can know, with a date attached, the moment it did. That is the difference between governing the model and hoping it held.