Skip to main content
kellerai.blog

The Drift You Cannot See Until It Costs You

The model version did not change. The behavior did.

KellerAI White Paper · Observability & Drift · May 2026

Context

Production teams validate a model against a frozen alias and move on, assuming the behavior stays constant. Between named versions, hosted models drift silently through safety fine-tuning, inference-engine changes, gateway routing, and system-prompt edits — documented in Anthropic's March–April 2026 postmortem and the peer-reviewed GateScope gateway study. Cost dashboards stay green. Latency dashboards stay green. Version strings stay constant. The behavior does not.

The Finding

The remedy is behavioral fingerprinting: periodically re-run a fixed probe set against a stored baseline, layer alerts on deviation (5–10% warn, 15–20% page), and keep a dated trail of each run. This is not new infrastructure—it is a discipline assembled from existing parts, invisible to cost and latency monitors but load-bearing for any team running production on a model it does not control. The companion paper develops the full design, the regulatory overlay, and the fingerprinting research prototypes.

Tags:
Model ObservabilityBehavioral DriftProduction Governance
Paper Details
CategoryObservability & Drift
AudienceEngineering teams, operations leads, and compliance officers responsible for production AI systems
MethodIncident postmortem analysis (Anthropic March–April 2026) + peer-reviewed gateway study (GateScope) + observability discipline framework
Length~1,500 · 6 min
Sections5
DateMay 2026
AuthorsKellerAI
Read the full paper
Related
Placeholder — pending analytics
Section 01

The Version Did Not Change

A production team validates a model against a frozen alias, ships it, and moves on. The alias does not change. Months later the same prompt produces a meaningfully different answer — shorter, less careful, missing a step it used to take. No release note announced it. No deploy of theirs caused it. This is the problem the upgrade governance literature does not reach: behavior moves on a version that, by every label the team can see, has not.

Two documented incidents in 2026 make the failure concrete. Between March 4 and April 20, 2026, Anthropic's own postmortem traced a cluster of behavioral regressions on a stable API surface — including a reasoning-effort change and a caching bug that degraded session memory — that surfaced publicly only after weeks of third-party reports. Separately, OpenAI rolled back a GPT-4o update after it became noticeably sycophantic; the model alias customers called did not change across the regression and the rollback. In both cases the version string was a constant and the behavior was not.

The infrastructure that delivers these models adds a second silent surface. A 2026 measurement study of LLM API gateways, GateScope, examined more than a thousand gateways and reported silent model substitutions, degraded conversational memory, and pricing deviations — behavioral changes introduced below the model, invisible at the alias. The drift can originate in safety fine-tuning, an inference-engine change, a routing or caching policy, or a system-prompt edit on the provider's side. The operator sees none of these. They see only the output, and only if they are looking at it.

Your cost dashboard is green. Your latency P99 is flat. Your model version is unchanged. The behavior is not the same model you validated last quarter.

The load-bearing claim
Section 02

One Story: Seven Weeks of Silent Drift

The clearest worked example is Anthropic's March–April 2026 incident, because the vendor published the timeline. Beginning around March 4, a series of changes degraded output quality on a model alias that did not change. The reported modes were distinct: reasoning effort was reduced; a caching defect degraded the model's use of session context; and verbosity was capped so the model returned less than it had before. A community-tracked issue and third-party telemetry surfaced the regression well before the vendor confirmed it — roughly six weeks elapsed between the first change and the public postmortem on April 23.

Throughout that window, every conventional health signal was nominal. The endpoint returned HTTP 200. Latency and cost moved within normal bounds. The model version string was constant. A team monitoring availability, spend, and input distribution would have seen nothing, because none of those instruments observes the one thing that changed: the distribution of the model's output on a fixed input.

The three modes are instructive because they fail observability differently. A reasoning-effort drop might eventually show in aggregate quality metrics. A caching defect that degrades session memory shows only if you track behavior across a conversation's lineage. A verbosity cap is a regression in what the model did not return — invisible to any monitor that checks for errors rather than for completeness. Only a baseline comparison on a fixed probe catches all three at once.

Section 03

Why Cost and Latency Miss It

The reason this drift goes unseen is structural, not negligent. The standard production observability stack measures the wrong axes for this failure. Availability monitoring asks whether the endpoint answered. Cost monitoring asks how many tokens it billed. Latency monitoring asks how fast. Input-distribution monitoring — the mainstay of classical ML drift detection — asks whether the data coming in has shifted. None of these asks whether the output distribution changed on an input that did not.

Vendor framing has tried to size the problem with a headline figure: that a large share of models degrade within a year of deployment. That number traces to a DataRobot industry survey, and it is worth being precise about what it does and does not establish. It is a vendor-sponsored practitioner survey of broad machine-learning degradation; it is not an independent, longitudinal measurement of vendor-side ambient drift on hosted frontier models. Adjacent academic work on model aging documents broad ML performance decay over time, but likewise does not isolate the silent inter-version drift this paper is about. We lead with the documented incidents, not the survey number, because the incidents are what we can stand behind.

The honest statement is narrow and strong at once: the failure is real, repeatedly documented in named-vendor postmortems and a peer-reviewed gateway study, and structurally invisible to the metrics most teams already run. What it lacks is a longitudinal study of how often it happens. The remedy does not wait on that study, because the remedy is to instrument for it directly.

Cost, latency, and input-distribution monitors all measure axes orthogonal to the failure. Only a probe that re-runs a fixed input and compares the output to a baseline observes drift on an unchanged version.

The instrument gap
Section 04

The Checklist: Fingerprint the Behavior

The discipline that catches this drift is behavioral fingerprinting: periodically re-running a fixed probe set and comparing the output to a stored baseline. It is distinct from latency, cost, and input-distribution monitoring, and distinct from a one-time regression suite run at deploy. We describe it as a discipline this paper proposes, assembled from existing parts; no commercial, production-grade, vendor-agnostic implementation existed as of this writing. Research prototypes demonstrate the core mechanism — black-box endpoint fingerprinting that detects within-provider stability differences under a stable alias. Four practices make it minimally viable.

  1. 01Fix a probe set. Assemble a small, fixed set of deterministic prompts drawn from your real task distribution, paired with rule-based evaluators that score each output, and version the set alongside its baseline. This is the measuring stick; it must not move when the model does.
  2. 02Schedule the cadence. Re-run the probe set on a clock — daily at minimum — rather than only at deploy time. Vendor-side drift is not tied to your release calendar, so a deploy-gated check cannot catch it. Clock-driven re-execution is what turns a one-time regression test into continuous behavioral telemetry.
  3. 03Layer the alerts. Warn on a small deviation from baseline (roughly 5–10%), page on a large one (roughly 15–20%), and segment the thresholds by task so a noisy category does not drown a critical one. The point is to distinguish vendor noise from a behavioral break before it reaches a user.
  4. 04Keep the trail. Store each run as a dated behavioral record — probe, output, score, baseline delta — so that when behavior moves you have evidence of when it moved and by how much. The trail is the governance artifact: it is what lets you say, with a date attached, that the model you run today is not the model you validated.

The companion in-depth paper develops each practice against its evidence base — the verified Anthropic timeline, the GateScope gateway study, the fingerprinting research prototypes, the 2026 tool landscape, and the regulatory frameworks that already oblige a regulated team to do this work. For the instrument's full design, the layered behavior-hash that avoids re-evaluation storms, and the regulatory overlay, read the in-depth companion .

Section 05

The Point

A regulated team does not need a new mandate to be obligated here. The Federal Reserve's model-risk guidance, SR 11-7, asks for ongoing monitoring of a model in production — a control distinct from, and complementary to, the change review that governs a deliberate upgrade. For systems in scope of the EU AI Act, the post-market monitoring obligation in Article 72 falls on the deployer; a vendor-side silent update that may not trip the provider's own change obligations can still bear on the deployer's duty to monitor the system in operation. We flag that mapping as interpretive and one for counsel to confirm, not a settled determination.

The scope of this argument is deliberately bounded. Its companion paper on frontier-model upgrades owns the acute, deliberate, customer-visible version bump — the named event with a release note. This paper owns the chronic, ambient, vendor-silent drift between those events, where there is no event to govern and the only available control is to watch the output continuously. The cost dimension of a related change — a tokenizer revision that alters chunking and billing — belongs to the hidden-cost paper; here the same change is simply one more drift vector.

Behavioral drift on an unchanged version is not exotic. It is the ordinary consequence of running production on a model you do not control, observed through metrics that were never designed to see it. Fingerprint the behavior, and the drift you could not see until it cost you becomes the drift you catch the day it starts.

Upgrade governance covers the event. Fingerprinting covers everything between the events. Run both, and the model you ship stays the model you validated — or you find out the day it stops.

The point