KellerAI White Paper · Frontier Tier Governance · June 2026
When the vendor
grades itself.
Reading the “>95% of sessions” disclosure as a post-deployment-monitoring precedent — novel in kind, weak in form, and gradeable against grammars that already exist.
Audience
Engineering, risk & compliance leadership
Scope
Vendor safety-metric disclosure design
Method
Primary-source verification · regime transfer
Length
~4,000 words · 29 references
Abstract
On 2026-06-09 Anthropic released Claude Fable 5 and Claude Mythos 5, which it describes as two configurations of one model: Fable 5 generally available with safeguards that block high-risk domains, Mythos 5 with those safeguards lifted for a small group of vetted partners 1. The launch post carries a number no frontier vendor has attached to a flagship release before: “Our early data shows that more than 95% of Fable sessions involve no fallback at all” 4.
We argue that this disclosure should be read as a post-deployment-monitoring precedent, and graded as one. Read that way it is novel in kind and weak in form: a one-sided bound on an undefined denominator, measured before launch, published once, with no per-category breakdown, no false-positive split, no cadence commitment, and no audit path 4. The same release carries a second intervention channel the headline number excludes entirely: the system card states that requests targeting frontier-LLM development are degraded through “prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT)” with no fallback and no user notification, at an estimated ~0.03% of traffic 5.
The paper performs two transfers. First, it scores the metric against the transparency grammar content moderation built over seven years of enforcement reporting 15 16 and against the critique literature that grammar accumulated 17. Second, it reads the disclosure through post-market surveillance regimes — FAERS's denominator failure and ASRS's incentive design 22 — and derives the disclosure a credible version would contain. Between the transfers sits the regulated overlay: EU AI Act post-market-monitoring and serious-incident duties, the GPAI Code of Practice, and NIST MANAGE 4.1, none of which the voluntary metric satisfies 1821. The paper closes with the consumer-side instrumentation that lets an operator compute the one number the vendor's global average cannot supply: their own intervention rate 2324.
A short, executive version of this argument is available as the companion brief, When the Vendor Grades Itself .
A one-sided bound on an undefined denominator, measured before launch, published once. That is not a monitoring regime. It is a press-day number with a monitoring regime's costume.
The Metric as Published: Anatomy of the Disclosure
The governing sentence is short. “Our early data shows that more than 95% of Fable sessions involve no fallback at all” — and, of the safeguards, “they trigger, on average, in less than 5% of sessions” 4. The mechanism it summarizes is disclosed, not silent: “When Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs” 2. Read precisely, the disclosure makes five separable design choices — unit, evidence base, bound form, breakdown, and adjacent-channel scope — and each choice is favorable to the vendor.
The unit. The metric is denominated in sessions, not requests, and no Anthropic surface defines what a session is — we checked the announcement, the full system-card text, the support article, and the platform documentation 29. A session with one fallback anywhere in it sits outside the 95%, so the figure cannot be converted into a per-request rate, compared across surfaces, or recomputed by anyone outside the company 4. A workload of long agentic sessions and a workload of single-shot queries could both report “5% of sessions” while experiencing utterly different per-request realities 4. The problem with the unit is not that it inflates the number. It is that it makes the number unfalsifiable.
The evidence base. The qualifier is “our early data” — pre-launch internal measurement, not production telemetry 4. Anthropic also states its aim “to reduce false positives as we update and refine the safeguards after launch,” which means the classifiers generating the number are scheduled to change after the number was taken 6. The published figure describes neither today's classifiers nor today's traffic.
The bound and the breakdown. The figure is a one-sided bound with no point estimate and no per-category split 4. The announcement separately states that, “for the time being we have arranged for Fable to fall back to Opus 4.8 on most requests related to biology and chemistry” — in a different section than the 95% sentence 6. Assemble the two and you learn that the reassuring global average coexists with at least one category where fallback is the norm, not the exception. The breakdown that would reconcile them is not published 4.
The adjacent channel. The system card discloses a fourth safeguard category the headline excludes: “Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model” — effectiveness is limited through prompt modification, steering vectors, or PEFT, at an estimated “~0.03% of traffic, concentrated in fewer than 0.1% of organizations” 5. Note the unit: traffic — a third denomination, also undefined, alongside the announcement's sessions and the API's requests 29.
The placement of that fourth category is itself a disclosure choice. The announcement's notification promise — “users will be informed whenever this occurs” — reads as a statement about the safeguard apparatus as a whole, and the announcement nowhere mentions an intervention that is never notified 25. A reader of the launch post alone would reasonably conclude that every intervention is visible. Only the system card corrects them 5.
The system card also refines the fallback surface in ways the announcement does not: claude.ai falls back automatically with a notification and no configuration; the Messages API blocks by default and returns a structured refusal, with server-side fallback opt-in and “reflected in the response object”; some Claude interfaces emit a session event on every fallback 3. And the vendor monitors all of it on a substrate no customer sees: Anthropic requires 30-day retention “for all traffic on Mythos-class models” in part to defend against “attacks that operate across many requests” and to “identify and reduce false positives” 7. Anthropic can measure its own false-positive rate. Nothing in the launch materials commits it to publishing one 7.
Mainstream coverage reproduced the figure essentially verbatim. TechCrunch relayed “early data showing at least 95% of Fable sessions running entirely on the model's own responses” without interrogating the denominator, the evidence base, or the audit status 27. The number passed into the record ungraded. Grading it is the rest of this paper.
Novel in Kind: The Precedent Landscape
To grade the disclosure you first need the prior art, and the prior art is thin. What frontier vendors normally publish are evaluation-set rates. Anthropic's Opus 4.5 system card reports the model refusing 88.39% of harmful requests against 66.96% for Opus 4.1; the Opus 4.6 card reports a 0.04% over-refusal rate against 8.50% for Sonnet 4.5 11. These are real measurements, but their denominator is an eval set the vendor constructed — not live traffic.
The same practice holds across the industry. OpenAI's GPT-5 system card reports per-category “not_unsafe” rates of 0.995 to 0.999, graded by LLM judges against OpenAI's own policy, alongside its shift from hard refusals to safe-completions 12. Google's Gemini model cards report eval-based safety results only, and Google drew criticism for shipping Gemini 2.5 Pro without a timely safety report at all 12. An eval-set rate answers a design question: how does the system behave on the prompts we chose to test? A production rate answers an operational one: how often does the system intervene on the traffic customers actually send? The two are not substitutes, and a governance reviewer who accepts one as the other has accepted a category error 1112.
Production rates are rare. The closest precedent is OpenAI's sensitive-conversations disclosure of October 2025: an estimated ~0.07% of active users showing possible signs of psychosis or mania, and 0.15% with conversations containing explicit suicide-planning indicators, against roughly 800 million weekly users 13. But that figure quantifies the prevalence of user behavior. The Fable 5 figure quantifies the rate of vendor intervention — how often the safety apparatus itself fires. Those are different objects, and the second had never had a launch-day number before 134.
The nearest mechanism precedent published no rate at all. In September 2025 OpenAI began routing emotionally sensitive ChatGPT conversations mid-chat to GPT-5-thinking, on a per-message and temporary basis, with the active model disclosed only when the user asks — and no trigger-rate percentage has ever been published for it 10. Anthropic's own prior work fits the same pattern: Clio analyzed roughly one million conversations and published per-cluster flag rates but no global production intervention rate, and Constitutional Classifiers published eval-set block rates as a one-time research result 14.
So the novelty judgment is two-sided. The Fable 5 figure is, to our knowledge, the first frontier-launch headline metric quantifying how often a vendor-side safety intervention fires across all usage of a flagship model, attached to a notification promise 42. That is genuinely new, and against OpenAI's rateless routing it is strictly more transparent 10. It is also weak in form, in every dimension Section 02 catalogued. Both halves matter, because the next vendor will copy the precedent as shipped — not as intended.
Weak in Form: The Content-Moderation Grammar
A vendor publishing a rate about its own enforcement system is not a new genre. Content moderation has been doing it, quarterly and publicly, since 2018 — and the grammar it converged on defines what a credible intervention metric contains. Meta's Community Standards Enforcement Report publishes per-policy content actioned, a proactive rate, appeals and restores, and — as its flagship metric — prevalence: violating views as a share of all views, for example 0.15-0.16% for violent and graphic content 15. YouTube's enforcement report publishes removals by policy and by detection method, the Violative View Rate, and appeal and reinstatement counts 16. Five elements recur: a defined denominator, exposure weighting, a per-policy breakdown, a fixed cadence, and an appeals channel.
That grammar also accumulated a critique corpus, and the critique transfers to AI safety metrics almost without editing. The DSA Observatory argues that platform transparency reports withhold the two numbers that measure accuracy — precision and recall — “even though companies already calculate evaluation metrics internally,” and that appeal rates are a poor accuracy proxy 17. The Center for Countering Digital Hate documents how platform-chosen metrics flatter enforcement performance, and an audit of the DSA Transparency Database found self-reports heterogeneous and hard to compare; count-based metrics “cannot tell whether content moderation systems are accurate, proportionate, or effective” 17.
Map the critique onto Fable 5 and it lands squarely. “Less than 5% of sessions” is an action count 4. It says nothing about precision — what fraction of triggers were false positives — or recall — what fraction of genuinely dangerous sessions were caught 17. Anthropic concedes the precision problem qualitatively: the safeguards are tuned conservatively, will sometimes catch harmless requests, and currently send most biology and chemistry requests to Opus 4.8 6. The quantitative version of that concession is exactly what the disclosure omits 4.
Score the metric against the five elements. Defined denominator: no — the session is undefined 29. Exposure weighting: no — a flat session count, not an impact-weighted rate 4. Per-policy breakdown: no — one aggregate spanning cyber, bio-chem, and distillation 4. Fixed cadence: no — one launch-day figure, with a tuning commitment guaranteed to change the underlying rate silently 6. Appeals channel: no documented contest path, although the notification at least tells the user the intervention happened — which is more than most moderation actions offered in 2018 8.
The notification deserves a precise compliment, because it is doing different work than an appeal. A notification discloses that the system intervened; an appeal channel corrects the system when it intervened wrongly, and its reversal rate doubles as a crude public precision signal — that is what Meta's restores and YouTube's reinstatement counts provide 1516. Fable 5 ships the disclosure half without the correction half: the support article tells users the checks may flag normal content and offers no path to contest a flag 8.
The maturity comparison is the uncomfortable part. Content moderation took roughly seven years — 2018 to the DSA-era critique — to evolve from raw action counts to exposure-weighted prevalence plus appeals, and it is still missing precision and recall 1517. The Fable 5 metric enters at the 2018 maturity level: a single self-reported action rate on an undefined denominator, with a notification but no contest path 48. The seven-year curve is sitting there, already paid for, free to copy.
The field being copied already wrote the critique of the copy.
Design Lessons from Older Surveillance Regimes
Two regimes much older than content moderation specify the remaining design variables. The FDA's FAERS database shows what an intervention count without a denominator cannot do: because reporting is spontaneous and largely voluntary, “neither the prevalence nor incidence of an adverse event can be calculated” from it, underreporting is severe and skewed, and the system is positioned as hypothesis-generating only 22. An event count is interpretable only against a defined exposure denominator — precisely the piece the Fable 5 session metric leaves undefined 29.
NASA's Aviation Safety Reporting System shows that reporter incentives and stewardship are design variables, not givens. ASRS is voluntary, confidential, and non-punitive, with limited immunity from FAA enforcement under AC 00-46E, and it is administered by NASA precisely because a neutral third party with no enforcement power increases report flow; it publishes searchable incident narratives, not only aggregates 22. Four regimes, five dimensions:
Dimension | FAERS | ASRS | Meta CSER | Fable 5 metric |
|---|---|---|---|---|
| Denominator | None (known flaw) | None (narrative system) | Views (prevalence) | “Sessions,” undefined |
| Cadence | Continuous + quarterly | Continuous | Quarterly | One-time, launch day |
Reporter incentive | Mandatory / voluntary | Immunity + confidentiality | Internal telemetry | Internal telemetry only |
Independent steward | FDA | NASA (neutral) | None (self-report) | None |
Public granularity | Case-level | Narrative-level | Per-policy aggregates | Single aggregate bound |
The table cites the regimes' own documentation for the first three columns and Sections 02 and 04 for the last 22154. Read across the rows and the Fable 5 column combines the weakest cell of each neighbor: FAERS's missing denominator, the self-report problem Meta's regime is criticized for, none of ASRS's incentive design, and a granularity coarser than any of the three 22154.
From the table falls a specification — and we state plainly that this is the disclosure we recommend, not one any vendor has committed to; as of 2026-06-09 no Anthropic surface commits to any element of it 29. A credible safeguard-transparency disclosure would: define the session denominator; commit to a reporting cadence; break out per-trigger-category rates; operate a false-positive and appeal channel whose incentives favor reporting; and submit the figure to an independent steward or auditor 22 17.
The incentive element deserves the emphasis ASRS earned. A user whose benign request trips the classifier today receives a notification and has nowhere incentive-aligned to send the false-positive report — the support article tells them the checks “may flag safe, normal content,” and the loop ends there 8. Meanwhile the vendor runs its own false-positive review on 30 days of retained traffic 7. Anthropic has, in effect, built itself a private ASRS and published none of its outputs 7.
The Regulated Overlay: Art. 72, Art. 55(1)(c), the Code of Practice, and MANAGE 4.1
The voluntary metric sits adjacent to instruments that are not voluntary, and the adjacency is instructive. EU AI Act Article 72 requires providers of high-risk AI systems to “actively and systematically collect, document and analyse relevant data… on the performance of high-risk AI systems throughout their lifetime,” under a documented post-market monitoring plan, with a Commission template due by 2 February 2026 18. Article 72 binds providers of high-risk systems, not GPAI model providers as such — so Fable 5 maps onto it by analogy, and concretely through every downstream deployer who embeds it in a high-risk system 18.
Anthropic's own binding channel runs through the GPAI provisions. Article 55(1)(c) requires providers of general-purpose models with systemic risk to “keep track of, document, and report, without undue delay, to the AI Office and, as appropriate, to national competent authorities” relevant information about serious incidents; Article 51(2) presumes high-impact capability above 1025 training FLOP — a threshold a model whose vendor says its capabilities exceed anything it has ever made generally available plainly implicates 19 1. The obligations have applied since 2 August 2025, a Commission serious-incident reporting template exists, and Article 73 carries the parallel duty for high-risk-system providers; legal analyses place Commission fining powers from August 2026 19.
The GPAI Code of Practice's Safety and Security chapter, finalized in July 2025, makes the incident half concrete for its signatories: systemic-risk identification, assessment, and mitigation commitments, serious-incident reporting on staggered deadlines down to two days for critical-infrastructure disruption, and incident documentation retained at least five years 20. Whether Anthropic is a signatory is not stated on the Code's overview page, and we did not locate a primary statement of its status — we flag that as unverified rather than assume it 20.
Against these instruments, the classification of the 95% figure is clear: an aggregate intervention rate is neither a serious incident report nor a risk-assessment artifact, though it could serve as supporting evidence inside a safety framework 2018. The best reading is that the metric voluntarily implements part of Article 72's spirit — systematic collection of lifetime performance data — while committing to none of the cadence, completeness, or regulator-facing reporting the binding instruments require 1819.
NIST's AI Risk Management Framework completes the overlay. The metric is a Measure-function output. But MANAGE 4.1 asks for more: “Post-deployment AI system monitoring plans are implemented, including mechanisms for capturing and evaluating input from users and other relevant AI actors, appeal and override, decommissioning, incident response, recovery, and change management” 21. Fable 5 supplies the measurement with no documented appeal or override mechanism for fallback decisions 218.
The operator consequence is direct. A deployer embedding Fable 5 in a high-risk system inherits Article 72 duties it cannot discharge by pointing at a one-time vendor bound on an undefined denominator 184. The vendor-risk review should therefore ask the questions the disclosure leaves open — what is a session, on what cadence will the figure be refreshed, which categories drive it, and who has examined the telemetry — and treat every unanswered one as monitoring scope the deployer must cover itself 429. Generating that monitoring data yourself is Section 07.
The Workload-Mix Problem and Instrumenting Your Side
A global average is uninformative for any single tenant when triggers concentrate by domain, and the launch week supplied the evidence. SANS Institute's Rob T. Lee reported that routine incident-response, detection, and basic forensics workflows were automatically routed from Fable 5 to Opus 4.8 in his initial testing, and that the classifiers “broadly identify cybersecurity-related requests rather than attempting to distinguish between benign and malicious”; Anthropic's response is that the tuning is “intentionally conservative” 25. A launch-day Hacker News user reported a benign internal business-prospecting workload tripping the classifier, with the notification preserved verbatim: “Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well” 28.
The economics follow. Launch commentary states it directly: a workload heavy in safeguarded territory “may pay the Fable 5 premium while receiving Opus 4.8 answers, in which case routing directly to Opus 4.8 is both cheaper and equivalent” 24. As an analytic illustration — not a measurement — a 5% global session rate is fully consistent with a security shop seeing 40% of its own sessions flagged, because the global figure averages over workloads that never touch the trigger domains 24 25. Your rate is not the vendor's rate. The vendor's number cannot tell you what your rate is.
What makes the consumer-side answer feasible is that the disclosed channels are observable — this is the corrected framing this series is built on. Consumer surfaces show a switch notice and label each response with the model that produced it 8. The Messages API returns a refusal as HTTP 200 with stop_reason: "refusal" and a structured stop_details.category of “cyber”, “bio”, or “reasoning_extraction”; opt-in server-side fallback marks every model boundary with a fallback content block, records each attempt in usage.iterations, and reports the serving model in the top-level model field 9. Anthropic's own docs make the monitoring argument for us: a refusal is an HTTP 200, so error-rate dashboards never see it — “instrument refusals as their own signal” 9. The full API mechanics are the companion FTG-1 paper's lane, When the Model Changes Mid-Request .
The practitioner baseline already expects this discipline for guardrails you operate yourself: per-request block/allow logs, OpenTelemetry-compatible traces, drift detection over trigger metrics, and the rule that “teams should measure false positive rates and tune rules instead of turning them off” 23. What follows extends that baseline to a guardrail you do not operate. It is the instrumentation we recommend — a proposed discipline, not a shipped product 23.
Log every vendor-intervention signal — fallback notices, responding-model identity, refusal stop-reasons and categories — as first-class telemetry with tenant and workload labels, using the response fields the vendor already exposes 9 23.
Compute per-tenant, per-workload intervention rates and alert on drift.
Anthropic states it will “update and refine the safeguards after launch,” so the vendor's safety layer is an unversioned dependency: your rate can change without any change on your side 6. There is no version pin, no changelog, and no deprecation window for a classifier update — drift in your own intervention rate is the only signal that the dependency changed 6 9.
Feed the rate into routing economics. Above a threshold, route the workload directly to Opus 4.8 — cheaper, and on the vendor's own account equivalent for safeguarded turns 24. The billing mechanics of a fallback-served response belong to the companion FTG-4 paper,
When the Rate Card Has Tiers
.
Carry the unnotified channel as residual risk.
The ~0.03% frontier-LLM-development degradation leaves no consumer-observable trace by design — no fallback, no notification 5. No consumer-side control closes it; only vendor reporting or third-party audit can 5.
The four steps recover, tenant by tenant, the disclosure the vendor did not publish: a defined denominator (your sessions), a cadence (your alerting window), a per-category breakdown (your refusal categories), and a false-positive record (your labeled triggers) 923. The fifth element — the independent steward — you cannot build alone 22.
Honest Limits
This paper's own evidence has edges, and stating them is what separates the paper from the disclosure it critiques. The “>95%” and “~0.03%” figures are vendor-reported and unverifiable by construction: the telemetry that produced them is vendor-internal, and no audit or replication path exists as of 2026-06-09 4 5. Every use of either number in this paper is an attributed vendor statement, not an established fact.
We parsed the full system-card PDF on 2026-06-09 rather than relying on the truncated extracts that circulated at launch. The degradation estimate reads, verbatim, “~0.03% of traffic, concentrated in fewer than 0.1% of organizations” — resolving the unit question, but resolving it to “traffic,” a denomination as undefined as “sessions” 5 29. No source we checked — announcement, system card, support article, platform documentation — defines the session denominator; the absence is a finding, not a gap in our search 29.
The ~40% local-skew figure in Section 07 is an analytic illustration grounded in launch-week reporting, not a measurement; no published per-workload trigger-rate data exists yet 2425. The five-element disclosure specification and the four-step instrumentation are proposals: no vendor has committed to the former and no shipped implementation of the latter exists that we know of 2923.
Credit must land where it is due. This is, on our reading of the precedent landscape, the most transparent production-intervention number any frontier vendor has attached to a launch, and OpenAI's rateless, disclosed-on-request routing remains the worse baseline 10. The critique here is of the disclosure's form, not of the act of disclosing. A vendor that publishes a weak number can be asked for a better one. A vendor that publishes none cannot.
Last, the discourse gap is itself a dated observation. CyberScoop's launch analysis pressed on jailbreak durability but raised no demand for a denominator, a cadence, or an audit, and we located no published commentary making the content-moderation transfer this paper performs 26. Independent academic analysis typically lags a launch by weeks; this paper's novelty claims should be re-checked against the literature before being repeated 26.
For the executive version of this argument — the number you cannot check, the security-shop story, and the four moves you can make anyway — read the companion brief, When the Vendor Grades Itself: The Safety Number You Cannot Check .
End of paper
↑ Back to top