The Drift You Cannot See Until It Costs You

KellerAI

Section 01

The Version Did Not Change

A production team validates a model against a frozen alias, ships it, and moves on. The alias does not change. Months later the same prompt produces a meaningfully different answer — shorter, less careful, missing a step it used to take. No release note announced it. No deploy of theirs caused it. This is the problem the upgrade governance literature does not reach: behavior moves on a version that, by every label the team can see, has not.

Two documented incidents in 2026 make the failure concrete. Between March 4 and April 20, 2026, Anthropic's own postmortem traced a cluster of behavioral regressions on a stable API surface — including a reasoning-effort change and a caching bug that degraded session memory — that surfaced publicly only after weeks of third-party reports. Separately, OpenAI rolled back a GPT-4o update after it became noticeably sycophantic; the model alias customers called did not change across the regression and the rollback. In both cases the version string was a constant and the behavior was not.

The infrastructure that delivers these models adds a second silent surface. A 2026 measurement study of LLM API gateways, GateScope, examined more than a thousand gateways and reported silent model substitutions, degraded conversational memory, and pricing deviations — behavioral changes introduced below the model, invisible at the alias. The drift can originate in safety fine-tuning, an inference-engine change, a routing or caching policy, or a system-prompt edit on the provider's side. The operator sees none of these. They see only the output, and only if they are looking at it.

Your cost dashboard is green. Your latency P99 is flat. Your model version is unchanged. The behavior is not the same model you validated last quarter.

The load-bearing claim

Section 02

One Story: Seven Weeks of Silent Drift

The clearest worked example is Anthropic's March–April 2026 incident, because the vendor published the timeline. Beginning around March 4, a series of changes degraded output quality on a model alias that did not change. The reported modes were distinct: reasoning effort was reduced; a caching defect degraded the model's use of session context; and verbosity was capped so the model returned less than it had before. A community-tracked issue and third-party telemetry surfaced the regression well before the vendor confirmed it — roughly six weeks elapsed between the first change and the public postmortem on April 23.

Throughout that window, every conventional health signal was nominal. The endpoint returned HTTP 200. Latency and cost moved within normal bounds. The model version string was constant. A team monitoring availability, spend, and input distribution would have seen nothing, because none of those instruments observes the one thing that changed: the distribution of the model's output on a fixed input.

The three modes are instructive because they fail observability differently. A reasoning-effort drop might eventually show in aggregate quality metrics. A caching defect that degrades session memory shows only if you track behavior across a conversation's lineage. A verbosity cap is a regression in what the model did not return — invisible to any monitor that checks for errors rather than for completeness. Only a baseline comparison on a fixed probe catches all three at once.

Section 03

Why Cost and Latency Miss It

The reason this drift goes unseen is structural, not negligent. The standard production observability stack measures the wrong axes for this failure. Availability monitoring asks whether the endpoint answered. Cost monitoring asks how many tokens it billed. Latency monitoring asks how fast. Input-distribution monitoring — the mainstay of classical ML drift detection — asks whether the data coming in has shifted. None of these asks whether the output distribution changed on an input that did not.

Vendor framing has tried to size the problem with a headline figure: that a large share of models degrade within a year of deployment. That number traces to a DataRobot industry survey, and it is worth being precise about what it does and does not establish. It is a vendor-sponsored practitioner survey of broad machine-learning degradation; it is not an independent, longitudinal measurement of vendor-side ambient drift on hosted frontier models. Adjacent academic work on model aging documents broad ML performance decay over time, but likewise does not isolate the silent inter-version drift this paper is about. We lead with the documented incidents, not the survey number, because the incidents are what we can stand behind.

The honest statement is narrow and strong at once: the failure is real, repeatedly documented in named-vendor postmortems and a peer-reviewed gateway study, and structurally invisible to the metrics most teams already run. What it lacks is a longitudinal study of how often it happens. The remedy does not wait on that study, because the remedy is to instrument for it directly.

Cost, latency, and input-distribution monitors all measure axes orthogonal to the failure. Only a probe that re-runs a fixed input and compares the output to a baseline observes drift on an unchanged version.

The instrument gap

Section 04

The Checklist: Fingerprint the Behavior

The discipline that catches this drift is behavioral fingerprinting: periodically re-running a fixed probe set and comparing the output to a stored baseline. It is distinct from latency, cost, and input-distribution monitoring, and distinct from a one-time regression suite run at deploy. We describe it as a discipline this paper proposes, assembled from existing parts; no commercial, production-grade, vendor-agnostic implementation existed as of this writing. Research prototypes demonstrate the core mechanism — black-box endpoint fingerprinting that detects within-provider stability differences under a stable alias. Four practices make it minimally viable.

01Fix a probe set. Assemble a small, fixed set of deterministic prompts drawn from your real task distribution, paired with rule-based evaluators that score each output, and version the set alongside its baseline. This is the measuring stick; it must not move when the model does.
02Schedule the cadence. Re-run the probe set on a clock — daily at minimum — rather than only at deploy time. Vendor-side drift is not tied to your release calendar, so a deploy-gated check cannot catch it. Clock-driven re-execution is what turns a one-time regression test into continuous behavioral telemetry.
03Layer the alerts. Warn on a small deviation from baseline (roughly 5–10%), page on a large one (roughly 15–20%), and segment the thresholds by task so a noisy category does not drown a critical one. The point is to distinguish vendor noise from a behavioral break before it reaches a user.
04Keep the trail. Store each run as a dated behavioral record — probe, output, score, baseline delta — so that when behavior moves you have evidence of when it moved and by how much. The trail is the governance artifact: it is what lets you say, with a date attached, that the model you run today is not the model you validated.

The companion in-depth paper develops each practice against its evidence base — the verified Anthropic timeline, the GateScope gateway study, the fingerprinting research prototypes, the 2026 tool landscape, and the regulatory frameworks that already oblige a regulated team to do this work. For the instrument's full design, the layered behavior-hash that avoids re-evaluation storms, and the regulatory overlay, read the in-depth companion .

Section 05

The Point

A regulated team does not need a new mandate to be obligated here. The Federal Reserve's model-risk guidance, SR 11-7, asks for ongoing monitoring of a model in production — a control distinct from, and complementary to, the change review that governs a deliberate upgrade. For systems in scope of the EU AI Act, the post-market monitoring obligation in Article 72 falls on the deployer; a vendor-side silent update that may not trip the provider's own change obligations can still bear on the deployer's duty to monitor the system in operation. We flag that mapping as interpretive and one for counsel to confirm, not a settled determination.

The scope of this argument is deliberately bounded. Its companion paper on frontier-model upgrades owns the acute, deliberate, customer-visible version bump — the named event with a release note. This paper owns the chronic, ambient, vendor-silent drift between those events, where there is no event to govern and the only available control is to watch the output continuously. The cost dimension of a related change — a tokenizer revision that alters chunking and billing — belongs to the hidden-cost paper; here the same change is simply one more drift vector.

Behavioral drift on an unchanged version is not exotic. It is the ordinary consequence of running production on a model you do not control, observed through metrics that were never designed to see it. Fingerprint the behavior, and the drift you could not see until it cost you becomes the drift you catch the day it starts.

Upgrade governance covers the event. Fingerprinting covers everything between the events. Run both, and the model you ship stays the model you validated — or you find out the day it stops.

The point