When the Vendor Grades Itself: Reading the ">95% of Sessions" Disclosure as a Post-Deployment-Monitoring Precedent

KellerAI

The safety number nobody can check

The first vendor-published production intervention rate, graded against the transparency grammars and surveillance regimes that already wrote the rulebook.

KellerAI White Paper · In-Depth · Regulation & Compliance · Jun 2026 · ~17 min read

Context

Anthropic attached an unprecedented number to the Claude Fable 5 launch: the vendor states that more than 95% of Fable sessions involve no safeguard fallback at all. Mainstream coverage relayed the figure essentially verbatim. Nobody asked what a session is, when the number would be refreshed, or who had examined the telemetry behind it.

Older fields already wrote the rulebook for grading a self-reported enforcement metric. Content moderation spent seven years converging on defined denominators, fixed cadences, per-policy breakdowns, and appeals channels — and its critics still demand the precision and recall the platforms withhold. Pharmacovigilance and aviation safety learned the same lessons about denominators, reporter incentives, and independent stewardship decades earlier.

The Finding

Scored against that grammar, the Fable 5 metric satisfies none of the five elements: the session denominator is undefined on every Anthropic surface we checked, the evidence base is pre-launch "early data," the figure is a one-sided bound with no breakdown, no cadence is committed, and there is no contest path. The same release ships a fourth safeguard the headline excludes — frontier-LLM-development requests degraded at a reported ~0.03% of traffic, with no fallback and no notification.

The operator's move is consumer-side. The disclosed channels are observable: consumer surfaces label the responding model, and the Messages API returns structured refusals and marks every fallback boundary in the response object. That makes your own per-workload intervention rate computable — the one number the vendor's global average cannot supply — while the unnotified degradation channel remains residual risk that only vendor reporting or third-party audit can close.

Tags:

Safeguard transparencyPost-deployment monitoringAI vendor governance

Share:LinkedIn Reddit X

Cite this paper

KellerAI. (2026, June 9). When the Vendor Grades Itself: Reading the ">95% of Sessions" Disclosure as a Post-Deployment-Monitoring Precedent. KellerAI. https://kellerai.blog/when-the-vendor-grades-itself-in-depth

Paper Details

CategoryRegulation & Compliance

AudienceSenior engineering, risk, and compliance leaders in regulated industries

MethodPrimary-source verification of the launch materials and the parsed system card; transfer analysis from content-moderation transparency reporting and post-market surveillance regimes; regulatory clause mapping against EU AI Act, GPAI Code of Practice, and NIST AI RMF primary texts

Length~4,000 · ~17 min

Reading levelTechnical

Sections8

References29

Versionv1.0 · Updated Jun 2026

PublishedJun 2026

Key Takeaways

—Anthropic's ">95% of sessions" figure is the first vendor-published production intervention rate for a frontier launch — and it defines no denominator, no cadence, no breakdown, no appeal path, and no audit.
—The same release ships a fourth safeguard the headline excludes: frontier-LLM-development requests degraded at a reported ~0.03% of traffic, with no fallback and no user notification.
—A global average is not your rate: log the vendor's refusal and fallback signals, compute per-workload intervention rates, route to Opus 4.8 above a threshold, and carry the unnotified channel as residual risk.

When the Vendor Grades Itself: The Safety Number You Cannot Check
Brief·In-depth
When the Model Changes Mid-Request
Brief·In-depth
When the Rate Card Has Tiers
Brief·In-depth

Placeholder — pending analytics

KellerAI White Paper · Frontier Tier Governance · June 2026

When the vendor

grades itself.

Reading the “>95% of sessions” disclosure as a post-deployment-monitoring precedent — novel in kind, weak in form, and gradeable against grammars that already exist.

Audience: Engineering, risk & compliance leadership
Scope: Vendor safety-metric disclosure design
Method: Primary-source verification · regime transfer
Length: ~4,000 words · 29 references

Section 01

Abstract

On 2026-06-09 Anthropic released Claude Fable 5 and Claude Mythos 5, which it describes as two configurations of one model: Fable 5 generally available with safeguards that block high-risk domains, Mythos 5 with those safeguards lifted for a small group of vetted partners 1. The launch post carries a number no frontier vendor has attached to a flagship release before: “Our early data shows that more than 95% of Fable sessions involve no fallback at all” 4.

We argue that this disclosure should be read as a post-deployment-monitoring precedent, and graded as one. Read that way it is novel in kind and weak in form: a one-sided bound on an undefined denominator, measured before launch, published once, with no per-category breakdown, no false-positive split, no cadence commitment, and no audit path 4. The same release carries a second intervention channel the headline number excludes entirely: the system card states that requests targeting frontier-LLM development are degraded through “prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT)” with no fallback and no user notification, at an estimated ~0.03% of traffic 5.

The paper performs two transfers. First, it scores the metric against the transparency grammar content moderation built over seven years of enforcement reporting 15 16 and against the critique literature that grammar accumulated 17. Second, it reads the disclosure through post-market surveillance regimes — FAERS's denominator failure and ASRS's incentive design 22 — and derives the disclosure a credible version would contain. Between the transfers sits the regulated overlay: EU AI Act post-market-monitoring and serious-incident duties, the GPAI Code of Practice, and NIST MANAGE 4.1, none of which the voluntary metric satisfies 18 21. The paper closes with the consumer-side instrumentation that lets an operator compute the one number the vendor's global average cannot supply: their own intervention rate 23 24.

A short, executive version of this argument is available as the companion brief, When the Vendor Grades Itself .

A one-sided bound on an undefined denominator, measured before launch, published once. That is not a monitoring regime. It is a press-day number with a monitoring regime's costume.

The thesis

Section 02

The Metric as Published: Anatomy of the Disclosure

The governing sentence is short. “Our early data shows that more than 95% of Fable sessions involve no fallback at all” — and, of the safeguards, “they trigger, on average, in less than 5% of sessions” 4. The mechanism it summarizes is disclosed, not silent: “When Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs” 2. Read precisely, the disclosure makes five separable design choices — unit, evidence base, bound form, breakdown, and adjacent-channel scope — and each choice is favorable to the vendor.

The unit. The metric is denominated in sessions, not requests, and no Anthropic surface defines what a session is — we checked the announcement, the full system-card text, the support article, and the platform documentation 29. A session with one fallback anywhere in it sits outside the 95%, so the figure cannot be converted into a per-request rate, compared across surfaces, or recomputed by anyone outside the company 4. A workload of long agentic sessions and a workload of single-shot queries could both report “5% of sessions” while experiencing utterly different per-request realities 4. The problem with the unit is not that it inflates the number. It is that it makes the number unfalsifiable.

The evidence base. The qualifier is “our early data” — pre-launch internal measurement, not production telemetry 4. Anthropic also states its aim “to reduce false positives as we update and refine the safeguards after launch,” which means the classifiers generating the number are scheduled to change after the number was taken 6. The published figure describes neither today's classifiers nor today's traffic.

The bound and the breakdown. The figure is a one-sided bound with no point estimate and no per-category split 4. The announcement separately states that, “for the time being we have arranged for Fable to fall back to Opus 4.8 on most requests related to biology and chemistry” — in a different section than the 95% sentence 6. Assemble the two and you learn that the reassuring global average coexists with at least one category where fallback is the norm, not the exception. The breakdown that would reconcile them is not published 4.

The adjacent channel. The system card discloses a fourth safeguard category the headline excludes: “Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model” — effectiveness is limited through prompt modification, steering vectors, or PEFT, at an estimated “~0.03% of traffic, concentrated in fewer than 0.1% of organizations” 5. Note the unit: traffic — a third denomination, also undefined, alongside the announcement's sessions and the API's requests 29.

The placement of that fourth category is itself a disclosure choice. The announcement's notification promise — “users will be informed whenever this occurs” — reads as a statement about the safeguard apparatus as a whole, and the announcement nowhere mentions an intervention that is never notified 2 5. A reader of the launch post alone would reasonably conclude that every intervention is visible. Only the system card corrects them 5.

The system card also refines the fallback surface in ways the announcement does not: claude.ai falls back automatically with a notification and no configuration; the Messages API blocks by default and returns a structured refusal, with server-side fallback opt-in and “reflected in the response object”; some Claude interfaces emit a session event on every fallback 3. And the vendor monitors all of it on a substrate no customer sees: Anthropic requires 30-day retention “for all traffic on Mythos-class models” in part to defend against “attacks that operate across many requests” and to “identify and reduce false positives” 7. Anthropic can measure its own false-positive rate. Nothing in the launch materials commits it to publishing one 7.

Mainstream coverage reproduced the figure essentially verbatim. TechCrunch relayed “early data showing at least 95% of Fable sessions running entirely on the model's own responses” without interrogating the denominator, the evidence base, or the audit status 27. The number passed into the record ungraded. Grading it is the rest of this paper.

Section 03

Novel in Kind: The Precedent Landscape

To grade the disclosure you first need the prior art, and the prior art is thin. What frontier vendors normally publish are evaluation-set rates. Anthropic's Opus 4.5 system card reports the model refusing 88.39% of harmful requests against 66.96% for Opus 4.1; the Opus 4.6 card reports a 0.04% over-refusal rate against 8.50% for Sonnet 4.5 11. These are real measurements, but their denominator is an eval set the vendor constructed — not live traffic.

The same practice holds across the industry. OpenAI's GPT-5 system card reports per-category “not_unsafe” rates of 0.995 to 0.999, graded by LLM judges against OpenAI's own policy, alongside its shift from hard refusals to safe-completions 12. Google's Gemini model cards report eval-based safety results only, and Google drew criticism for shipping Gemini 2.5 Pro without a timely safety report at all 12. An eval-set rate answers a design question: how does the system behave on the prompts we chose to test? A production rate answers an operational one: how often does the system intervene on the traffic customers actually send? The two are not substitutes, and a governance reviewer who accepts one as the other has accepted a category error 11 12.

Production rates are rare. The closest precedent is OpenAI's sensitive-conversations disclosure of October 2025: an estimated ~0.07% of active users showing possible signs of psychosis or mania, and 0.15% with conversations containing explicit suicide-planning indicators, against roughly 800 million weekly users 13. But that figure quantifies the prevalence of user behavior. The Fable 5 figure quantifies the rate of vendor intervention — how often the safety apparatus itself fires. Those are different objects, and the second had never had a launch-day number before 13 4.

The nearest mechanism precedent published no rate at all. In September 2025 OpenAI began routing emotionally sensitive ChatGPT conversations mid-chat to GPT-5-thinking, on a per-message and temporary basis, with the active model disclosed only when the user asks — and no trigger-rate percentage has ever been published for it 10. Anthropic's own prior work fits the same pattern: Clio analyzed roughly one million conversations and published per-cluster flag rates but no global production intervention rate, and Constitutional Classifiers published eval-set block rates as a one-time research result 14.

So the novelty judgment is two-sided. The Fable 5 figure is, to our knowledge, the first frontier-launch headline metric quantifying how often a vendor-side safety intervention fires across all usage of a flagship model, attached to a notification promise 4 2. That is genuinely new, and against OpenAI's rateless routing it is strictly more transparent 10. It is also weak in form, in every dimension Section 02 catalogued. Both halves matter, because the next vendor will copy the precedent as shipped — not as intended.

Section 04

Weak in Form: The Content-Moderation Grammar

A vendor publishing a rate about its own enforcement system is not a new genre. Content moderation has been doing it, quarterly and publicly, since 2018 — and the grammar it converged on defines what a credible intervention metric contains. Meta's Community Standards Enforcement Report publishes per-policy content actioned, a proactive rate, appeals and restores, and — as its flagship metric — prevalence: violating views as a share of all views, for example 0.15-0.16% for violent and graphic content 15. YouTube's enforcement report publishes removals by policy and by detection method, the Violative View Rate, and appeal and reinstatement counts 16. Five elements recur: a defined denominator, exposure weighting, a per-policy breakdown, a fixed cadence, and an appeals channel.

That grammar also accumulated a critique corpus, and the critique transfers to AI safety metrics almost without editing. The DSA Observatory argues that platform transparency reports withhold the two numbers that measure accuracy — precision and recall — “even though companies already calculate evaluation metrics internally,” and that appeal rates are a poor accuracy proxy 17. The Center for Countering Digital Hate documents how platform-chosen metrics flatter enforcement performance, and an audit of the DSA Transparency Database found self-reports heterogeneous and hard to compare; count-based metrics “cannot tell whether content moderation systems are accurate, proportionate, or effective” 17.

Map the critique onto Fable 5 and it lands squarely. “Less than 5% of sessions” is an action count 4. It says nothing about precision — what fraction of triggers were false positives — or recall — what fraction of genuinely dangerous sessions were caught 17. Anthropic concedes the precision problem qualitatively: the safeguards are tuned conservatively, will sometimes catch harmless requests, and currently send most biology and chemistry requests to Opus 4.8 6. The quantitative version of that concession is exactly what the disclosure omits 4.

Score the metric against the five elements. Defined denominator: no — the session is undefined 29. Exposure weighting: no — a flat session count, not an impact-weighted rate 4. Per-policy breakdown: no — one aggregate spanning cyber, bio-chem, and distillation 4. Fixed cadence: no — one launch-day figure, with a tuning commitment guaranteed to change the underlying rate silently 6. Appeals channel: no documented contest path, although the notification at least tells the user the intervention happened — which is more than most moderation actions offered in 2018 8.

The notification deserves a precise compliment, because it is doing different work than an appeal. A notification discloses that the system intervened; an appeal channel corrects the system when it intervened wrongly, and its reversal rate doubles as a crude public precision signal — that is what Meta's restores and YouTube's reinstatement counts provide 15 16. Fable 5 ships the disclosure half without the correction half: the support article tells users the checks may flag normal content and offers no path to contest a flag 8.

The maturity comparison is the uncomfortable part. Content moderation took roughly seven years — 2018 to the DSA-era critique — to evolve from raw action counts to exposure-weighted prevalence plus appeals, and it is still missing precision and recall 15 17. The Fable 5 metric enters at the 2018 maturity level: a single self-reported action rate on an undefined denominator, with a notification but no contest path 4 8. The seven-year curve is sitting there, already paid for, free to copy.

The field being copied already wrote the critique of the copy.

The transfer

Section 05

Design Lessons from Older Surveillance Regimes

Two regimes much older than content moderation specify the remaining design variables. The FDA's FAERS database shows what an intervention count without a denominator cannot do: because reporting is spontaneous and largely voluntary, “neither the prevalence nor incidence of an adverse event can be calculated” from it, underreporting is severe and skewed, and the system is positioned as hypothesis-generating only 22. An event count is interpretable only against a defined exposure denominator — precisely the piece the Fable 5 session metric leaves undefined 29.

NASA's Aviation Safety Reporting System shows that reporter incentives and stewardship are design variables, not givens. ASRS is voluntary, confidential, and non-punitive, with limited immunity from FAA enforcement under AC 00-46E, and it is administered by NASA precisely because a neutral third party with no enforcement power increases report flow; it publishes searchable incident narratives, not only aggregates 22. Four regimes, five dimensions:

Dimension	FAERS	ASRS	Meta CSER	Fable 5 metric
Denominator	None (known flaw)	None (narrative system)	Views (prevalence)	“Sessions,” undefined
Cadence	Continuous + quarterly	Continuous	Quarterly	One-time, launch day
Reporter incentive	Mandatory / voluntary	Immunity + confidentiality	Internal telemetry	Internal telemetry only
Independent steward	FDA	NASA (neutral)	None (self-report)	None
Public granularity	Case-level	Narrative-level	Per-policy aggregates	Single aggregate bound

The table cites the regimes' own documentation for the first three columns and Sections 02 and 04 for the last 22 15 4. Read across the rows and the Fable 5 column combines the weakest cell of each neighbor: FAERS's missing denominator, the self-report problem Meta's regime is criticized for, none of ASRS's incentive design, and a granularity coarser than any of the three 22 15 4.

From the table falls a specification — and we state plainly that this is the disclosure we recommend, not one any vendor has committed to; as of 2026-06-09 no Anthropic surface commits to any element of it 29. A credible safeguard-transparency disclosure would: define the session denominator; commit to a reporting cadence; break out per-trigger-category rates; operate a false-positive and appeal channel whose incentives favor reporting; and submit the figure to an independent steward or auditor 22 17.

The incentive element deserves the emphasis ASRS earned. A user whose benign request trips the classifier today receives a notification and has nowhere incentive-aligned to send the false-positive report — the support article tells them the checks “may flag safe, normal content,” and the loop ends there 8. Meanwhile the vendor runs its own false-positive review on 30 days of retained traffic 7. Anthropic has, in effect, built itself a private ASRS and published none of its outputs 7.

Section 06

The Regulated Overlay: Art. 72, Art. 55(1)(c), the Code of Practice, and MANAGE 4.1

The voluntary metric sits adjacent to instruments that are not voluntary, and the adjacency is instructive. EU AI Act Article 72 requires providers of high-risk AI systems to “actively and systematically collect, document and analyse relevant data… on the performance of high-risk AI systems throughout their lifetime,” under a documented post-market monitoring plan, with a Commission template due by 2 February 2026 18. Article 72 binds providers of high-risk systems, not GPAI model providers as such — so Fable 5 maps onto it by analogy, and concretely through every downstream deployer who embeds it in a high-risk system 18.

Anthropic's own binding channel runs through the GPAI provisions. Article 55(1)(c) requires providers of general-purpose models with systemic risk to “keep track of, document, and report, without undue delay, to the AI Office and, as appropriate, to national competent authorities” relevant information about serious incidents; Article 51(2) presumes high-impact capability above 10²⁵ training FLOP — a threshold a model whose vendor says its capabilities exceed anything it has ever made generally available plainly implicates 19 1. The obligations have applied since 2 August 2025, a Commission serious-incident reporting template exists, and Article 73 carries the parallel duty for high-risk-system providers; legal analyses place Commission fining powers from August 2026 19.

The GPAI Code of Practice's Safety and Security chapter, finalized in July 2025, makes the incident half concrete for its signatories: systemic-risk identification, assessment, and mitigation commitments, serious-incident reporting on staggered deadlines down to two days for critical-infrastructure disruption, and incident documentation retained at least five years 20. Whether Anthropic is a signatory is not stated on the Code's overview page, and we did not locate a primary statement of its status — we flag that as unverified rather than assume it 20.

Against these instruments, the classification of the 95% figure is clear: an aggregate intervention rate is neither a serious incident report nor a risk-assessment artifact, though it could serve as supporting evidence inside a safety framework 20 18. The best reading is that the metric voluntarily implements part of Article 72's spirit — systematic collection of lifetime performance data — while committing to none of the cadence, completeness, or regulator-facing reporting the binding instruments require 18 19.

NIST's AI Risk Management Framework completes the overlay. The metric is a Measure-function output. But MANAGE 4.1 asks for more: “Post-deployment AI system monitoring plans are implemented, including mechanisms for capturing and evaluating input from users and other relevant AI actors, appeal and override, decommissioning, incident response, recovery, and change management” 21. Fable 5 supplies the measurement with no documented appeal or override mechanism for fallback decisions 21 8.

The operator consequence is direct. A deployer embedding Fable 5 in a high-risk system inherits Article 72 duties it cannot discharge by pointing at a one-time vendor bound on an undefined denominator 18 4. The vendor-risk review should therefore ask the questions the disclosure leaves open — what is a session, on what cadence will the figure be refreshed, which categories drive it, and who has examined the telemetry — and treat every unanswered one as monitoring scope the deployer must cover itself 4 29. Generating that monitoring data yourself is Section 07.

Section 07

The Workload-Mix Problem and Instrumenting Your Side

A global average is uninformative for any single tenant when triggers concentrate by domain, and the launch week supplied the evidence. SANS Institute's Rob T. Lee reported that routine incident-response, detection, and basic forensics workflows were automatically routed from Fable 5 to Opus 4.8 in his initial testing, and that the classifiers “broadly identify cybersecurity-related requests rather than attempting to distinguish between benign and malicious”; Anthropic's response is that the tuning is “intentionally conservative” 25. A launch-day Hacker News user reported a benign internal business-prospecting workload tripping the classifier, with the notification preserved verbatim: “Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well” 28.

The economics follow. Launch commentary states it directly: a workload heavy in safeguarded territory “may pay the Fable 5 premium while receiving Opus 4.8 answers, in which case routing directly to Opus 4.8 is both cheaper and equivalent” 24. As an analytic illustration — not a measurement — a 5% global session rate is fully consistent with a security shop seeing 40% of its own sessions flagged, because the global figure averages over workloads that never touch the trigger domains 24 25. Your rate is not the vendor's rate. The vendor's number cannot tell you what your rate is.

What makes the consumer-side answer feasible is that the disclosed channels are observable — this is the corrected framing this series is built on. Consumer surfaces show a switch notice and label each response with the model that produced it 8. The Messages API returns a refusal as HTTP 200 with stop_reason: "refusal" and a structured stop_details.category of “cyber”, “bio”, or “reasoning_extraction”; opt-in server-side fallback marks every model boundary with a fallback content block, records each attempt in usage.iterations, and reports the serving model in the top-level model field 9. Anthropic's own docs make the monitoring argument for us: a refusal is an HTTP 200, so error-rate dashboards never see it — “instrument refusals as their own signal” 9. The full API mechanics are the companion FTG-1 paper's lane, When the Model Changes Mid-Request .

The practitioner baseline already expects this discipline for guardrails you operate yourself: per-request block/allow logs, OpenTelemetry-compatible traces, drift detection over trigger metrics, and the rule that “teams should measure false positive rates and tune rules instead of turning them off” 23. What follows extends that baseline to a guardrail you do not operate. It is the instrumentation we recommend — a proposed discipline, not a shipped product 23.

Log every vendor-intervention signal — fallback notices, responding-model identity, refusal stop-reasons and categories — as first-class telemetry with tenant and workload labels, using the response fields the vendor already exposes 9 23.
Compute per-tenant, per-workload intervention rates and alert on drift.

Anthropic states it will “update and refine the safeguards after launch,” so the vendor's safety layer is an unversioned dependency: your rate can change without any change on your side 6. There is no version pin, no changelog, and no deprecation window for a classifier update — drift in your own intervention rate is the only signal that the dependency changed 6 9.
Feed the rate into routing economics. Above a threshold, route the workload directly to Opus 4.8 — cheaper, and on the vendor's own account equivalent for safeguarded turns 24. The billing mechanics of a fallback-served response belong to the companion FTG-4 paper,
When the Rate Card Has Tiers
.
Carry the unnotified channel as residual risk.

The ~0.03% frontier-LLM-development degradation leaves no consumer-observable trace by design — no fallback, no notification 5. No consumer-side control closes it; only vendor reporting or third-party audit can 5.

The four steps recover, tenant by tenant, the disclosure the vendor did not publish: a defined denominator (your sessions), a cadence (your alerting window), a per-category breakdown (your refusal categories), and a false-positive record (your labeled triggers) 9 23. The fifth element — the independent steward — you cannot build alone 22.

Section 08

Honest Limits

This paper's own evidence has edges, and stating them is what separates the paper from the disclosure it critiques. The “>95%” and “~0.03%” figures are vendor-reported and unverifiable by construction: the telemetry that produced them is vendor-internal, and no audit or replication path exists as of 2026-06-09 4 5. Every use of either number in this paper is an attributed vendor statement, not an established fact.

We parsed the full system-card PDF on 2026-06-09 rather than relying on the truncated extracts that circulated at launch. The degradation estimate reads, verbatim, “~0.03% of traffic, concentrated in fewer than 0.1% of organizations” — resolving the unit question, but resolving it to “traffic,” a denomination as undefined as “sessions” 5 29. No source we checked — announcement, system card, support article, platform documentation — defines the session denominator; the absence is a finding, not a gap in our search 29.

The ~40% local-skew figure in Section 07 is an analytic illustration grounded in launch-week reporting, not a measurement; no published per-workload trigger-rate data exists yet 24 25. The five-element disclosure specification and the four-step instrumentation are proposals: no vendor has committed to the former and no shipped implementation of the latter exists that we know of 29 23.

Credit must land where it is due. This is, on our reading of the precedent landscape, the most transparent production-intervention number any frontier vendor has attached to a launch, and OpenAI's rateless, disclosed-on-request routing remains the worse baseline 10. The critique here is of the disclosure's form, not of the act of disclosing. A vendor that publishes a weak number can be asked for a better one. A vendor that publishes none cannot.

Last, the discourse gap is itself a dated observation. CyberScoop's launch analysis pressed on jailbreak durability but raised no demand for a denominator, a cadence, or an audit, and we located no published commentary making the content-moderation transfer this paper performs 26. Independent academic analysis typically lags a launch by weeks; this paper's novelty claims should be re-checked against the literature before being repeated 26.

For the executive version of this argument — the number you cannot check, the security-shop story, and the four moves you can make anyway — read the companion brief, When the Vendor Grades Itself: The Safety Number You Cannot Check .

End of paper

↑ Back to top

References

1Anthropic (2026). Introducing Claude Fable 5 and Claude Mythos 5. anthropic.com/news/claude-fable-5-mythos-5 — two configurations of one model: Fable 5 generally available with safeguards, Mythos 5 with safeguards lifted for a small group of trusted partners. Accessed 2026-06-09.
2Anthropic (2026). Claude Fable 5 / Mythos 5 launch announcement, fallback statement: 'When Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs.' anthropic.com/news/claude-fable-5-mythos-5. Accessed 2026-06-09.
3Anthropic (2026). Claude Fable 5 and Claude Mythos 5 System Card, section 1.5 (Novel safeguards) — surface-dependent fallback behavior: notified automatic fallback in client applications; Messages API blocked by default with a structured refusal category, server-side fallback opt-in and 'reflected in the response object'; a session event emitted on some Claude interfaces. www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf. Parsed 2026-06-09.
4Anthropic (2026). Launch announcement, the metric in both framings: 'Our early data shows that more than 95% of Fable sessions involve no fallback at all' and 'they trigger, on average, in less than 5% of sessions.' anthropic.com/news/claude-fable-5-mythos-5. Accessed 2026-06-09.
5Anthropic (2026). System Card, section 1.5, frontier-LLM-development safeguards: 'these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT)… We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations.' Parsed 2026-06-09.
6Anthropic (2026). Launch announcement, false-positive acknowledgments: conservative tuning that will sometimes catch harmless requests; 'our aim is to reduce false positives as we update and refine the safeguards after launch'; 'for the time being we have arranged for Fable to fall back to Opus 4.8 on most requests related to biology and chemistry.' Accessed 2026-06-09.
7Anthropic (2026). Launch announcement, data retention: 'We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces'; human access logged; rationale includes defending against 'attacks that operate across many requests' and work to 'identify and reduce false positives.' Accessed 2026-06-09.
8Anthropic support (2026). Why Claude switched models in your conversation with Fable 5. support.claude.com/en/articles/15363606 — explicit switch notice, per-response model labeling, automatic on consumer surfaces but not the API, checks 'intentionally broad' and may block authorized security testing and benign biology. Accessed 2026-06-09.
9Anthropic platform docs (2026). Refusals and fallback. platform.claude.com/docs/en/build-with-claude/refusals-and-fallback — HTTP 200 refusals with stop_reason 'refusal' and stop_details.category ('cyber', 'bio', 'reasoning_extraction'); opt-in fallbacks parameter under the server-side-fallback-2026-06-01 beta header; fallback content blocks and per-attempt usage.iterations; 'Instrument refusals as their own signal.' Accessed 2026-06-09.
10TechCrunch (2025-09-29). OpenAI rolls out safety routing system, parental controls on ChatGPT — per-message routing of sensitive conversations to GPT-5-thinking; the active model disclosed when the user asks; no trigger-rate percentage published; mixed user reactions. Accessed 2026-06-09.
11Anthropic (2025-2026). Claude Opus 4.5 and Opus 4.6 system cards — evaluation-set safety rates: 88.39% harmful-request refusal vs 66.96% for Opus 4.1; 0.04% over-refusal vs 8.50% for Sonnet 4.5; denominators are vendor-constructed eval sets, not live traffic. anthropic.com/claude-opus-4-5-system-card; www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf.
12OpenAI (2025-08-13). GPT-5 system card and safe-completions — per-category 'not_unsafe' rates of 0.995-0.999 graded by LLM judges; Google DeepMind Gemini model cards report eval-based safety results only; Fortune (2025-04-09) on the Gemini 2.5 Pro safety report released without timely documentation.
13OpenAI (2025-10-27). Strengthening ChatGPT's responses in sensitive conversations — ~0.07% of active users with possible psychosis/mania signs and 0.15% with explicit suicide-planning indicators, against ~800M weekly active users; a production prevalence disclosure denominated in user behavior. openai.com/index/strengthening-chatgpt-responses-in-sensitive-conversations.
14Anthropic (2024-12-12). Clio: privacy-preserving insights into real-world AI use (anthropic.com/research/clio; arXiv:2412.13678) — per-cluster flag rates over ~1M conversations, no global production intervention rate; Anthropic (2025). Constitutional Classifiers — eval-set jailbreak-block rates as a one-time research result.
15Meta. Community Standards Enforcement Report and prevalence-metric methodology. transparency.meta.com/reports/community-standards-enforcement; transparency.meta.com/policies/improving/prevalence-metric — quarterly since 2018: content actioned per policy, proactive rate, appeals and restores, and prevalence (violating views as a share of all views; violent/graphic content 0.15-0.16%).
16Google/YouTube. Community Guidelines enforcement transparency report and FAQ. transparencyreport.google.com/youtube-policy/removals; support.google.com/transparencyreport/answer/9209072 — removals by policy and detection source (automated vs human flagging), the Violative View Rate, appeals and reinstatement counts. Verified 2026-06-09.
17DSA Observatory (2026-01-08). The metrics we're missing in DSA content-moderation transparency — precision and recall withheld 'even though companies already calculate evaluation metrics internally'; CCDH (2025-02-24). More Transparency and Less Spin; arXiv:2312.10269 (DSA Transparency Database audit); policyreview.info — count-based metrics 'cannot tell whether content moderation systems are accurate, proportionate, or effective.'
18EU AI Act, Article 72 — post-market monitoring by providers of high-risk AI systems: 'actively and systematically collect, document and analyse relevant data… on the performance of high-risk AI systems throughout their lifetime,' under a documented post-market monitoring plan; Commission implementing-act template due by 2 February 2026. artificialintelligenceact.eu/article/72. Verified against primary text 2026-06-09.
19EU AI Act, Articles 55(1)(c) and 51(2) — GPAI systemic-risk providers must 'keep track of, document, and report, without undue delay, to the AI Office and, as appropriate, to national competent authorities' relevant information about serious incidents; the 10^25-FLOP high-impact presumption; in application since 2 August 2025; a Commission serious-incident reporting template exists; Article 73 is the parallel high-risk-system channel. artificialintelligenceact.eu/article/55; /article/51. Verified 2026-06-09.
20EU GPAI Code of Practice, Safety and Security chapter (finalized July 2025). artificialintelligenceact.eu/code-of-practice-overview — systemic-risk identification, assessment, and mitigation commitments; Measure 9.3 staggered serious-incident deadlines (down to 2 days for critical-infrastructure disruption) and minimum 5-year retention of incident documentation; the overview page lists no signatories. Verified 2026-06-09.
21NIST (2023-01). AI Risk Management Framework 1.0 (AI 100-1), MANAGE 4.1: 'Post-deployment AI system monitoring plans are implemented, including mechanisms for capturing and evaluating input from users and other relevant AI actors, appeal and override, decommissioning, incident response, recovery, and change management.' nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf. Parsed 2026-06-09.
22FDA FAERS limitations (PMC, 2025; medxdrg.com) — 'neither the prevalence nor incidence of an adverse event can be calculated' from FAERS data; severe, skewed underreporting; hypothesis-generating only. NASA ASRS (asrs.arc.nasa.gov/overview/immunity.html; nasa.gov) — voluntary, confidential, non-punitive reporting with limited immunity (FAA AC 00-46E), administered by a neutral non-enforcing third party, publishing searchable narratives.
23Practitioner guardrail-observability baseline: bigdataboutique.com (AI guardrails in production LLM apps); avidoai.com (2025, LLM guardrail testing); iguazio.com (2025, LLM observability tools) — per-request block/allow logs, OpenTelemetry traces, drift detection over trigger metrics; 'teams should measure false positive rates and tune rules instead of turning them off.'
24lushbinary.com (Claude Fable 5 vs Mythos 5 safeguards explained); digitalapplied.com (2026) — the workload-mix problem: 'If your workload is heavily in safeguarded territory, you may pay the Fable 5 premium while receiving Opus 4.8 answers, in which case routing directly to Opus 4.8 is both cheaper and equivalent.' Accessed 2026-06-09.
25CSO Online (2026). Anthropic releases Mythos-class Fable 5 model with safeguards for cyber risks — SANS Institute's Rob T. Lee: routine incident-response, detection, and basic forensics workflows auto-routed from Fable 5 to Opus 4.8 in initial testing; classifiers 'broadly identify cybersecurity-related requests rather than attempting to distinguish between benign and malicious'; Anthropic calls the tuning 'intentionally conservative.' Accessed 2026-06-09.
26CyberScoop (2026-06-09). Mythos on a leash — launch analysis relaying 'no known universal jailbreaking techniques'; notes silence on partial jailbreaks; raises no demand for a denominator definition, reporting cadence, or audit of the fallback metric. Accessed 2026-06-09.
27TechCrunch (2026-06-09). Anthropic released Claude Fable 5, its most powerful model, publicly, days after warning AI is getting too dangerous — relays the metric without interrogation ('early data showing at least 95% of Fable sessions running entirely on the model's own responses'); mandatory retention 'could set an industry precedent'; 'double the price of Opus 4.8.' Accessed 2026-06-09.
28Hacker News launch thread (2026-06-09). news.ycombinator.com/item?id=48463808 — a user's benign internal business-prospecting workload (data collection and scoring) triggered the classifier, with the notification text preserved: 'Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well.' Verified 2026-06-09.
29Anthropic (2026). Claude Fable 5 and Claude Mythos 5 System Card — absence check: the card never restates the >95% metric and nowhere defines the session denominator; its frontier-LLM-development estimate is denominated in 'traffic.' Full text parsed 2026-06-09; the announcement, support article, and platform docs were checked the same day with the same result.