The Safeguard Fallback Pattern

KellerAI

KellerAI White Paper · June 2026 · Frontier Tier Governance

The Safeguard Fallback Pattern

Governing a named, measured fallback model: three disclosure modes, the response-object audit trail, and the model-change discipline.

On June 9, 2026, Anthropic released Claude Fable 5 and Claude Mythos 5, described in the system card as two configurations of one model: Fable safeguarded and generally available, Mythos with safeguards lifted for a small set of vetted partners. 2 Anthropic states that when Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is handled by Claude Opus 4.8 instead, and that users are informed whenever this occurs. 1 The same system card discloses a fourth safeguard category, frontier-LLM development, that degrades capability with no fallback and no notification, at a vendor-estimated ~0.03% of traffic. 2 We read these two mechanisms as opposite ends of a single disclosure spectrum, and we make the case that each one is a model change your model-risk and transparency frameworks already know how to treat. The only genuine novelty is granularity: the change happens inside one request instead of inside a release cycle.

Section 01

The Model Change You Did Not Schedule

The most interesting thing about the Fable 5 launch is not that a safety classifier can swap the model answering your request. It is that, for the first time in a production system, the vendor names the fallback model, publishes a trigger rate, and stamps the substitution into the response object so you can check. The governing public statement is one sentence in the announcement: “When Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs.” 1

Start by discarding the framing you may have already heard. This fallback is not silent. The word “silently” appears nowhere in Anthropic's launch materials for this mechanism. 1 On the Messages API there is no automatic fallback at all by default: the system card states the request is blocked and the response returns a structured refusal category, with server-side fallback available only as an explicit opt-in that is “reflected in the response object.” 2 The mechanism that actually earns the word hidden is a different one. The system card discloses a fourth safeguard category, for frontier-LLM development, that limits the model's effectiveness with no fallback and no user notification, and that category appears in no consumer announcement. 2

That contrast is this paper's subject. One release ships the same intervention class, a vendor-controlled change to which model serves your request, in a disclosed form and in a structurally hidden form. The disclosed form can be governed: logged, measured, challenged, and inventoried. The hidden form can only be acknowledged. The distance between those two postures is where the governance work lives.

Our sibling series treated the model change you schedule: an upgrade you plan, test, and roll out under change control. 24 This pair compresses that discipline to transaction granularity. A request leaves your system addressed to claude-fable-5 and the answer may come back from claude-opus-4-8; every question your upgrade checklist asks across a release window now has to be answerable inside a single API call. A short, non-technical version of this argument is available as the companion brief, When the Model Changes Mid-Request .

A fallback you can name, measure, and log is a model change. A degradation you cannot see is also a model change. Only one of them arrives with paperwork.

The disclosure spectrum in one release

Section 02

The Primitive: A Named, Measured Fallback Model in Production

Safety routing is not new. What is new in this release is the disclosure shape: three properties that no prior production router shipped together. First, the fallback target has a name. The announcement identifies Claude Opus 4.8 as the model that handles flagged requests, a specific, separately documented, separately priced model rather than an anonymous “safe” variant. 1 Second, the trigger rate is published. Anthropic reports that “more than 95% of Fable sessions involve no fallback at all,” a vendor figure with real limitations we return to in Sections 5 and 8, but a published number where every precedent published none. 1 Third, the substitution is instrumented. The Messages API documentation specifies response fields that record which model served the request and where the handoff happened. 4

Each property breaks with a precedent. OpenAI's September 2025 safety router disclosed the active model only when the user asked, and no trigger-rate percentage was published for it. 9 The GPT-5 launch router fronted a network of unnamed variants with no per-request disclosure at all. 11 Section 7 treats both precedents in detail; here the point is structural. A substitution you can name, count, and log supports governance verbs. A substitution you can only suspect supports forum threads.

The naming matters more than it first appears. Because the fallback target is an ordinary, generally-available model with its own documentation, its own benchmark history, and its own rate card, an operator can characterize the fallback path the same way they would characterize any deliberate deployment of Opus 4.8 — the platform documentation even ships a dedicated migration guide between the two models. 5 An anonymous safe variant gives you nothing to test against. A named one gives you a second model you can evaluate on your own workload before the classifier ever routes to it.

Anthropic is explicit that the classifiers run broad. The announcement states the safeguards are deliberately tuned to be cautious and will sometimes catch harmless requests, and that bio/chem requests fall back “on most requests” at launch. 1 Dianne Penn, Anthropic's head of product management for research and labs, told Axios the company is “deliberately conservative at launch,” conceding that some legitimate security work may be routed away from Fable 5. 8 For an operator, conservative tuning means the fallback path is not an edge case. In flagged domains it is the main path.

Two attribution notes before we proceed, because this series is written under a strict dogfooding rule. The claim that Fable 5 and Mythos 5 are the same underlying model is Anthropic-asserted and uninspectable from outside; we carry it as the vendor's description, not as established fact. 2 And the capability claims around the release, including that its capabilities exceed any previously generally-available model, are likewise vendor statements pending independent reproduction. 1 The fallback target also costs half as much: Opus 4.8 lists at $5 per million input tokens and $25 per million output against Fable 5's $10 and $50, a pricing asymmetry this series' cost paper owns. 23

Section 03

The Three Disclosure Modes

One release ships three distinct disclosure postures for vendor-initiated model substitution. That makes the release itself a natural experiment in disclosure design, and it is worth walking each mode at the level of what a user or caller can actually observe.

Mode 1: notified automatic fallback. On Anthropic's consumer surfaces — Claude web, Mobile, Desktop, Cowork, Code, Design, Microsoft 365, Teams, and Slack — the switch is automatic, the user sees an explicit notice when it happens, and each response is labeled with the model that produced it. 3 The granularity is per-request but the consequence is sticky: after a switch, the support article states the model picker stays on Opus for the rest of the conversation. 3 The system card adds that in some Claude interfaces “a session event is emitted whenever fallback occurs.” 2

Mode 2: structured refusal, opt-in fallback. The Messages API blocks by default. A declined request returns HTTP 200 with stop_reason: "refusal" and a stop_details.category naming the policy area: cyber, bio, or reasoning_extraction. 4 Server-side fallback exists, but only as a developer configuration, and the result is stamped into the response object. 4 Nothing is substituted for an API caller who has not asked for substitution. 2

Mode 3: the hidden fourth category. The system card discloses safeguards for frontier-LLM development — requests targeting pretraining pipelines, distributed training infrastructure, or ML accelerator design — and states: “Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model.” 2 Effectiveness is limited instead through “prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT),” at an estimated “~0.03% of traffic, concentrated in fewer than 0.1% of organizations.” 2 This category appears in the system card and not in the announcement. 2

Mode 2 has operational texture worth knowing before you build on it. The stop_details object carries a human-readable explanation field whose text the documentation marks as unstable — display it, never parse it — and both fields can be null when a refusal maps to no named category. 4 A refusal can also arrive mid-stream, after partial output has already been delivered, in which case the documentation instructs callers to treat the partial output as incomplete and discard it. 4 A declined request is not an error in any transport sense; your code has to branch on stop_reason, because nothing else will tell you.

A terminology note, settled once: the announcement calls the third trigger “distillation,” the API category is reasoning_extraction, and the support article describes it as extraction of the model's summarized thinking. 1 These are one trigger under three names, and the API name is the one your logs will carry. 4 The support article also concedes the checks are intentionally broad and may block authorized security testing and benign biology research. 3 Launch-day reports bear that out: a Hacker News user building an internal business-prospecting tool received the notice “Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well.” 6 The notification fires, and the false positives are real.

Section 04

The Response-Object Audit Trail

The Messages API exposes enough per-request state to reconstruct who answered every request. An operator who does not log it is discarding the only evidence of which model produced each output. The opt-in shape first: you name up to three fallback models in a fallbacks parameter under the server-side-fallback-2026-06-01 beta header, or configure SDK middleware that retries client-side; the platform documentation describes both as explicit developer configuration, not automatic substitution. 4 The beta is limited to the Claude API and Claude Platform on AWS; on Bedrock, Vertex AI, and Microsoft Foundry, only the middleware path exists. 4

Four response fields carry the audit trail. The top-level model field reports the model that produced the returned message. A fallback content block marks each point where one model's output gives way to the next. The usage.iterations array records every attempt, with the serving model appearing as a fallback_message entry. And stop_details.category names the trigger class on the refusal itself. 4 The scope boundary is documented just as precisely: only a safety classifier decline triggers fallback; a rate limit, overload, or server error is returned as-is. 4

Two documented behaviors deserve more attention than they will get by default. First, sticky routing: once a conversation has fallen back, later requests that include fallbacks are served directly by the fallback model for roughly an hour, and a sticky-served turn carries no fallback block at all — you identify it only through usage.iterations and the model field. 4 Second, monitoring blindness: a refusal is an HTTP 200, so any observability built on error rates or 5xx responses never sees it; Anthropic's own documentation recommends emitting one event per refusal and one per fallback-served response. 4 Billing follows the serving model: each attempt is billed separately at the rates of the model that ran it, a refusal before output is not billed, and a fallback credit covers the prompt-cache cost of switching. 5 The full cost treatment of a 2.0x-priced model falling back to its half-price predecessor belongs to this series' rate-card paper, and we leave it there.

The failure modes are documented too, and they belong in your runbook. If every model in the fallback chain declines, the response is the last model's refusal, with a message entry recording each earlier hop. 4 If the fallback model is rate limited or overloaded, the fallback attempt is not made: the refusal comes back instead, with stop_details.recommended_model naming a model to retry directly, so a fallback chain whose target is under-provisioned degrades to refusals exactly under load. 4 On streaming requests the retry happens on the same stream, and a mid-output decline on a non-streaming request silently drops the declined model's partial output while the fallback answers from scratch — the declined attempt survives only in usage.iterations. 4

None of this requires new infrastructure. The practitioner baseline for production LLM systems already expects per-request guardrail logs, OpenTelemetry-compatible traces, and drift detection over guardrail-trigger metrics. 14 The Fable 5 fields slot into that baseline. What we recommend, and this is our proposal rather than any vendor's guidance, is a four-field minimum per request: requested model, serving model, refusal category, and the iteration chain. Alert on tenant-level trigger-rate drift the way you alert on error-rate drift. Treat every fallback block as a change event with a retention policy, not as a curiosity in a debug log.

Section 05

Eval Validity When the Serving Model Varies

Once any fraction of requests can be served by a different model, every benchmark number and every internal eval result becomes a mixture measurement. The vendor's own evaluation table concedes this. The system card notes that “Fable's scores reflect its production safeguards, including fallback to Opus 4.8, which is why certain benchmarks score slightly lower on Fable compared to Mythos,” and reports SWE-bench Pro at 80.3% for Mythos 5 against 80% for Fable 5. 2 Those are Anthropic-run-harness numbers with no independent reproduction at launch, and the gap between the two configurations is itself a fallback artifact. 2

The published rate cannot tell you your mixture. The “more than 95% of Fable sessions” figure has three structural weaknesses for eval design: the unit is sessions rather than requests, so one fallback anywhere in a session flags the whole session; the evidence base is pre-launch “early data”; and the figure is a one-sided bound with no point estimate. 1 Neither the announcement nor the system card defines what counts as a session; we checked the parsed system-card text and found no definition. 2 The full transparency critique of this metric belongs to a companion paper in this series; what matters here is narrower. A global average is uninformative for any single workload.

The session unit deserves one more turn of arithmetic, because it is the difference between the vendor's number and yours. A session that contains forty requests and one fallback counts the same as a session with one request and one fallback: both are a single flagged session in a session-denominated metric. For an agentic workload, where one user task fans out into dozens or hundreds of API calls, a per-session bound tells you almost nothing about the per-request substitution probability your audit trail will actually record. That arithmetic is ours, not Anthropic's; the vendor materials neither confirm nor deny it, because they do not define the unit. 2

Triggers concentrate by domain. Independent launch analyses make the workload-mix point concretely: a tenant whose traffic lives in safeguarded territory may pay the Fable 5 premium while receiving Opus 4.8 answers on much of it. 13 The empirical breadth evidence points the same way. SANS's Rob T. Lee reported that routine incident-response, detection, and basic forensics workflows auto-routed to Opus 4.8 in initial testing, with classifiers that broadly identify cybersecurity-related requests rather than distinguishing benign from malicious. 7 And Anthropic itself states bio/chem requests fall back on most requests at launch. 1 For a security or life-sciences workload, the eval target is effectively Opus 4.8 whether or not your benchmark says Fable 5 at the top.

The discipline we propose follows from the mechanics. Segment every eval and every production-quality metric by the response model field, and report per-serving-model pass rates rather than a single blended number. Measure your own tenant trigger rate before trusting any global figure, because the vendor's denominator is the vendor's mix, not yours. And carry over the stochastic hygiene the sibling series established for model-change evals: multiple runs, pass rates with variance, never a single anecdotal pass. 24 A 70% pass rate over mixed Fable/Opus traffic is a number about neither model.

Section 06

The Regulated Overlay: A Mid-Request Model Change Is Still a Model Change

Nothing in this section claims that any regulator has classified the Fable 5 fallback. The claim is narrower and more useful: the vocabulary already exists, and mapping the mechanism into it is mechanical. Start with model risk management. SR 11-7, the 2011 Federal Reserve and OCC guidance, built the canon: validation, ongoing monitoring, effective challenge, and a model inventory. 15 In April 2026 the agencies superseded it with SR 26-2, which carries the same spine forward — effective challenge as “critical analysis conducted by objective experts who evaluate model risk and effect appropriate changes throughout the model lifecycle,” and validation whose frequency scales with the “frequency and scope of model changes.” 16 A substitution of the serving model is a model change under that guidance, and a mechanism that can substitute on any request would make the frequency of model changes a per-request property. The inventory treatment we would propose is a coupled entry: Fable 5 plus Opus 4.8 as one fallback pair, not two unrelated rows.

The revised guidance also speaks to validation timing in terms this mechanism strains. SR 26-2 holds that validation generally occurs prior to a model's first use, and that performance deviating meaningfully from expectations should prompt consideration of adjustment, recalibration, or redevelopment. 16 A fallback pair inverts the usual sequencing: the first production use of Opus 4.8 on a given workload may be the moment a classifier routes to it, not a planned cutover your validation calendar anticipated. The practical consequence we draw is that validating the requested model alone is insufficient — the fallback target needs the same pre-use validation on your workload, because the classifier, not your change board, decides when it goes live.

NIST's AI RMF 1.0 is more specific still. MANAGE 4.1 calls for post-deployment monitoring plans “including mechanisms for capturing and evaluating input from users and other relevant AI actors, appeal and override, decommissioning, incident response, recovery, and change management.” 17 Score the release against that list and the asymmetry is visible: the consumer surface ships notification, and a channel to report incorrectly blocked requests, but no documented mechanism that reverses a classification. 3 The API ships opt-in control, and leaves monitoring entirely to the caller. 4

The conformance frameworks supply the paperwork format. ISO/IEC 42001 requires managed change and lifecycle control over AI systems as part of an auditable management system. 18 SOC 2's CC8.1 is the change-management criterion: changes authorized, designed, tested, approved, and documented. 19 The response-object fields of Section 4 are precisely the raw material those artifacts want. A logged chain of fallback blocks, with requested model, serving model, and category, would be a change record an auditor can replay; that is our mapping, not a certification claim.

The EU AI Act supplies the transparency overlay. Article 72 requires providers of high-risk AI systems to “actively and systematically collect, document and analyse relevant data” on performance throughout the system lifetime; it binds high-risk-system providers rather than general-purpose model providers, so it reaches Fable 5 indirectly, through the downstream deployers who build high-risk systems on it and must monitor what they ship. 20 Article 55(1)(c) binds providers of general-purpose models with systemic risk to “keep track of, document, and report, without undue delay” serious incidents to the AI Office, and Article 51's presumption threshold — training compute above 10^25 floating point operations — is plainly the territory a Mythos-class model occupies. 21 For a deployer, the practical reading is symmetrical with SR 26-2: a serving-model substitution is an event your post-market monitoring plan must be able to see, and the response object is where you see it. 4

Section 07

The Cautionary Precedent

The 2025 OpenAI episodes are the control group for the Fable 5 design, and they are worth retelling with dates. In late September 2025, OpenAI deployed a safety router in ChatGPT that detected emotionally sensitive conversations and switched mid-chat to GPT-5-thinking; per Nick Turley, routing happened per-message and temporarily, and ChatGPT would disclose the active model only when asked. 9 No trigger-rate percentage was published, and there was no way to turn the system off — contemporaneous coverage noted that “currently there is no option to turn off the system.” 10 Paying subscribers accused the company of treating adults like children. 9

The router saga that preceded it ran the same shape at larger scale. GPT-5 shipped in August 2025 not as a single model but as a network of variants behind a real-time router, and the backlash centered on degraded performance, lost control, and the suspicion that a cheaper model was being substituted without disclosure; OpenAI restored GPT-4o for paying users, fixed what it called buggy routing, and raised limits. 11 By December 2025 the router was quietly removed for free and Go tiers while remaining for paid subscribers, and the same router had been performing safety duty all along, forwarding conversations showing signs of psychological distress to models designed for them. 12 Routing, safety substitution, and capability substitution were one undisclosed mechanism, which is precisely why users could not tell a safety intervention from a cost optimization. 11

Set the three disclosure properties of Section 2 against those precedents and the contrast is structural. Named target: the announcement names Opus 4.8; the 2025 routers named nothing up front. 1 Published rate: Anthropic publishes a bound; OpenAI published none for the safety router. 9 Instrumentation: the response object records the substitution; the 2025 systems offered disclosure on request at best. 4 We do not claim Anthropic designed against the OpenAI backlash; no source we located states the motive, and the comparison should be read as structural, not causal. 9

Disclosure has not bought immunity from skepticism, and it should not. Handy AI's launch-day read was that “the fallback safeguard means you don't always know which model answered you.” 22 On the consumer surface, between notices, that is a fair description of the user experience. The governance answer is not to argue with the skepticism but to operationalize the checking: the response object exists so that “which model answered” is a query, not a feeling. 4

Section 08

The Honest Limits of Disclosure

Four limits bound everything above, and stating them plainly is part of the discipline. First, the published rate is unverifiable. The “more than 95%” figure is pre-launch internal data, measured in sessions, stated as a one-sided bound, with no third-party methodology or audit attached; it may be entirely accurate, and nothing published at launch lets anyone outside Anthropic confirm it. 1

Second, the fourth safeguard category is unverifiable by design. An intervention that degrades capability through prompt modification, steering vectors, or PEFT, with no fallback and no notification, leaves no user-observable trace; the vendor-estimated ~0.03% of traffic is disclosed in the system card and is, by construction, not independently measurable. 2 Third, notification is not appeal. The consumer surface tells you the switch happened and offers a channel to report incorrect blocks, but no documented mechanism exists to contest and reverse a classification, against the explicit MANAGE 4.1 expectation of appeal and override. 17

Fourth, the dogfooding limit. These papers are drafted on the model family they describe. Every Anthropic claim herein is therefore attributed rather than asserted, and no claim about the pipeline's own model is presented as independently established. The reader should apply the same discount to this paper's vendor-sourced facts that the paper applies to them.

Against those limits, the honest promise. This design gives an operator more to audit than any production router before it: a named target, a published bound, a structured refusal taxonomy, and a per-request record of who answered. 4 It does not yet give anyone a way to verify the published rate, to detect the hidden degradation, or to appeal a false positive. 1 A disclosed fallback is governable; a hidden degradation is only acknowledgeable. Your discipline has to cover both, because the next release decides which one you get.

This whitepaper is the technical companion to a short executive brief. When the Model Changes Mid-Request states the same argument in plain language for the reader who will never open the references below.

End of paper

↑ Back to top

Who answered your request?

Context

The Finding

The Safeguard Fallback Pattern

The Model Change You Did Not Schedule

The Primitive: A Named, Measured Fallback Model in Production

The Three Disclosure Modes

The Response-Object Audit Trail

Eval Validity When the Serving Model Varies

The Regulated Overlay: A Mid-Request Model Change Is Still a Model Change

The Cautionary Precedent

The Honest Limits of Disclosure