Skip to main content
kellerai.blog

Who gets the unrestricted model?

Vetting as the safety control in the first two-tier frontier release

KellerAI White Paper · In-Depth · Model Governance & Upgrades · Jun 2026 · ~16 min read

Context

In April 2026, Anthropic unveiled Mythos and restricted it to a vetted partner cohort, citing the damage the model could do in the wrong hands. Two months later it shipped Claude Fable 5 — an equally powerful sibling of the same reported model — to the general public, safe only because classifiers route high-risk requests to Claude Opus 4.8 with notification, while Mythos 5 remained gated behind Project Glasswing.

Anthropic's own Responsible Scaling Policy names the mechanism that separates the two products: an enhanced due-diligence vetting process that acts as a compensating control when a partner's use-case requires adjusting safeguards. The safeguard can be lifted for the right customer, which means the safeguard is no longer the control. The vetting is.

The Finding

The vendor publishes the program's outcome metric — more than 10,000 high- or critical-severity flaws found by roughly 50 initial partners — but not the vetting criteria that gate access, so the load-bearing safety control is asserted rather than auditable. OpenAI converged on the same tiered structure within four months, with one telling difference: it publishes its criteria class (objective, KYC-based identity verification), while Anthropic's security requirements remain unenumerated.

The enterprise consequence is a mirror-image discipline. A capability tier is structurally a new data-sensitivity axis, and the paper proposes governing it with the RBAC machinery regulated organizations already run: a tier registry that includes the fallback model you never chose, role mapping for people and service principals, change-managed tier escalation under SOC 2 CC8.1 and SR 11-7, and an incident path that alerts on the per-request fallback observability the Messages API already exposes.

Tags:
Tiered model accessAI access governanceFrontier model safeguards
Cite this paper

KellerAI. (2026, June 9). Tiered Access Governance. KellerAI. https://kellerai.blog/when-access-is-the-safeguard-in-depth

Paper Details
CategoryModel Governance & Upgrades
AudienceSenior engineering, risk, and compliance leaders in regulated industries
MethodPrimary-source verification of all 28 citations on 2026-06-09 (vendor announcements, system-card reporting, RSP and Glasswing program pages, OpenAI Trusted Access pages, EU AI Act articles, NIST AI RMF, ISO/IEC 42001, SOC 2 CC8.1, SR 11-7), recorded in a per-claim verification ledger; comparative analysis of the two vendor vetting regimes; proposed mapping of enterprise access-control and change-management standards onto model capability tiers.
Length~3,900 · ~16 min
Reading levelTechnical
Sections8
References28
Versionv1.0 · Updated Jun 2026
PublishedJun 2026
Key Takeaways
  • Anthropic's RSP calls partner vetting a compensating control that replaces deployed safeguards, but the vetting criteria are not publicly enumerated — the control is asserted, not auditable.
  • Two vendors converged on identity-based tiered access within four months: OpenAI's Trusted Access for Cyber and Project Glasswing make capability tiers an industry pattern, not an experiment.
  • Map your existing RBAC discipline onto capability tiers: a tier registry, role mapping, change-managed approvals, and an incident path for tier changes you did not initiate.
Related
Placeholder — pending analytics

KellerAI White Paper · June 2026 · Frontier Tier Governance

Tiered Access Governance

Vetting as the safety control in the first two-tier frontier release.

On June 9, 2026, Anthropic released one reported frontier model as two products: Claude Fable 5, generally available with safeguards that block high-risk domains, and Claude Mythos 5, described as the same capabilities with relevant safeguards lifted, restricted to vetted partners. We make the case that when a vendor's own risk assessment, not capability economics, draws the product boundary, access governance — who qualifies for the unrestricted tier, who vets them, who audits the vetting — becomes the primary safety mechanism. Anthropic's own policy language calls partner vetting a compensating control that substitutes for deployed safeguards, and OpenAI converged on the same structure within four months, which makes tiered access an industry pattern rather than one vendor's improvisation. We argue that enterprises should govern this boundary the way they already govern data sensitivity: a tier registry, role mapping, an approval workflow, and an incident path for tier changes they did not initiate. The enterprise framework in Section 6 is a recommended discipline, not shipped tooling, and we say so throughout.

Section 01

When the Safeguard Can Be Lifted

Anthropic released Claude Fable 5 and Claude Mythos 5 on June 9, 2026, and its system card describes them as two configurations of one new model: Fable 5 for general use, carrying safeguards that block tasks in high-risk domains such as biology and cybersecurity, and Mythos 5 with "relevant safeguards lifted," available only to a small number of trusted partners 1. Anthropic states that the model's capabilities "exceed those of any model we've ever made generally available" 1. Mythos 5 is restricted to Project Glasswing partners, and soon to select biology researchers, "until our broader trusted access program is available"; it carries the API identifier claude-mythos-5 and is not generally available 2. The asymmetry is total at the distribution layer: Fable 5 is generally available on the Claude API, AWS, Amazon Bedrock, Vertex AI, and Microsoft Foundry, while Mythos 5 is reachable only by approved Glasswing customers through a vendor account team 2.

The public tier's safeguards are disclosed, not silent. Anthropic states that when Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, "the response is automatically handled by Claude Opus 4.8 instead," and that "users will be informed whenever this occurs" 3. In the Messages API there is no automatic fallback at all: a declined request returns HTTP 200 with stop_reason: "refusal" and a structured stop_details.category, and server-side fallback is an explicit opt-in whose result is reflected in the response object 4. The trigger is narrow by design: only a safety-classifier decline starts a fallback, while rate limits, overloads, and server errors are returned to the caller as-is 4. One safeguard in the same release is structurally different: Anthropic's system card reports a fourth category, covering frontier-LLM development, that degrades capability with no fallback and no user notification, at an estimated ~0.03% of traffic — a figure no outside party can measure 5.

That spectrum, from disclosed fallback to structurally hidden degradation, is the subject of this series' first pair. This paper governs the boundary those safeguards sit inside. When the same reported model ships in a safeguarded form for everyone and a safeguards-lifted form for the vetted few, the question "is the model safe?" becomes the question "is the access boundary governed?" — and the binding control is no longer a classifier.

When the safeguard can be lifted for the right customer, the safeguard is no longer the control. The vetting is.

The FTG-2 claim

We develop that claim in four moves: the release itself as the first shipped two-tier frontier boundary (Section 2), the vendor's vetting program and the policy text that makes vetting a compensating control (Section 3), the convergent OpenAI precedent (Section 4), and the instability of the boundary (Section 5). Sections 6 and 7 then turn to your side of the boundary: an RBAC-shaped discipline we recommend for capability tiers, and the regulatory overlay that already attaches to it. A note on method before we start: these papers are written about the very model family running our authoring pipeline, so every Anthropic claim in this paper is phrased as an attributed report, never as established fact.

Section 02

The First Shipped Two-Tier Frontier Release

Tiered access itself is not new. OpenAI began gating enhanced cyber capability behind an identity- and trust-based program in February 2026 14, and shipped a fine-tuned, more permissive variant to its highest tiers in April 15. What is new on June 9 is the same-day pairing: a generally available frontier model and an explicitly reduced-safeguard sibling of the same reported model, released together as two products of one launch 1 2. We scope the "first" carefully — first general-availability frontier release shipped the same day as a safeguards-lifted sibling tier — because the precedents differ in mechanism, and Section 4 returns to them.

The "same underlying model" framing deserves its own caution. Anthropic's announcement says the two products share the same underlying model, and that claim is Anthropic-asserted and uninspectable from outside, so we treat it as a report rather than a fact 1. Everything that follows holds either way: what separates the two products at the point of sale is not a training run but a safeguard configuration and an access decision 1 2. Price is not the boundary either. Anthropic's platform documentation lists both models at $10 per million input tokens and $50 per million output tokens, with the same 1M-token context window 2. When two products share a model, a rate card, and a context window, the only thing the vendor is selling differently is the safeguard posture — and the only thing separating the customers is the vetting.

The boundary's history is short and instructive. CNBC reports that Anthropic unveiled Mythos in April 2026 as a model that excels at identifying security flaws in software, restricted it to a select group explicitly because of concerns about "the model's potential to do damage in the wrong hands," and two months later declared itself ready to release an equally powerful model to the public 16. The capability did not change in those two months; the vendor's risk determination did. A boundary that a vendor can move in eight weeks is a governance surface, not a constant.

The public tier's safety case rests on the disclosed safeguard spectrum from Section 1: classifier-routed fallback with user notification on consumer surfaces 3, and a blocked-by-default Messages API with structured refusals and opt-in fallback 4. The restricted tier's risk posture is described only by its maker: Anthropic's system card reports that Mythos 5 is "near the border of our RSP and FCF threshold," that the catastrophic chemical-biological risk from its development is "low, but higher than for any previous model," and observers note no ASL designation was stated in the launch announcement 8. These are self-assessments with no published methodology, and we phrase them accordingly 8.

Access to either tier also carries a new condition: Anthropic states it "will require 30-day retention for all traffic on Mythos-class models," with retained data not used to train new Claude models or for any non-safety purpose 6. AWS corroborates the requirement independently and adds a disclosure Anthropic's page does not make: once you opt in, "your data will leave AWS's data and security boundary" 7. The tier boundary, in other words, is not only about what the model will do for you — it reprices what you must accept to be allowed near it.

Section 03

Vetting as a Compensating Control: Glasswing and the RSP

Anthropic's Responsible Scaling Policy states that the company is developing "a tiered access system that allows for nuanced control over safeguard adjustments" 11. The policy describes an enhanced due-diligence process that evaluates potential partners on two criteria — "their overall trustworthiness and the beneficial nature of their use-case" — and then says the load-bearing thing plainly: "this vetting process will act as a compensating control" when a partner's use-case requires adjusting safeguards 11. Compensating control is auditor language. It means the primary control is absent and something else is standing in for it. For the Mythos tier, the something else is the vetting. And auditor language invites the auditor's questions: who performs the vetting, against what written criteria, on what review cycle, and who examines the examiners. Anthropic's public materials answer the first question — Anthropic — and none of the other three 1110.

The same policy carries a conditional commitment with teeth: models reaching certain Capability Thresholds require Anthropic to upgrade its safeguards to the ASL-3 Security Standard or the ASL-3 Deployment Standard before deployment under baseline measures is permitted 12. The RSP, read as a whole, is a capability-gated access framework — the policy infrastructure a two-tier release needs was written before the release shipped 1112.

Project Glasswing is that policy in operation, and its public page is precise about dates and names. The page, dated April 7, 2026, names the gated model "Claude Mythos Preview" — not Mythos 5 — and states "we do not plan to make Claude Mythos Preview generally available," while framing safe at-scale deployment of Mythos-class models as an eventual goal 10. It names eleven launch partners, including AWS, Apple, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, NVIDIA, and Palo Alto Networks, plus over 40 additional organizations that build or maintain critical software 10. It describes partner categories. It does not disclose vetting criteria 10.

Keep the two model names distinct, because the dates carry governance weight. Every fact on the Glasswing page is an April fact about Claude Mythos Preview; the platform documentation names Mythos 5 as that model's successor, shipped June 9 102. A vetting decision made in April for one model now governs access to its more capable successor — which is precisely the kind of silent scope expansion an access-review cycle exists to catch.

The program's outcomes are published with more precision than its gate. Anthropic reports that the initial cohort of roughly 50 partners, with access from early April 2026, found more than 10,000 high- or critical-severity security flaws in their codebases, and on June 2, 2026 it announced an expansion to approximately 150 new organizations across more than 15 countries, in sectors including power, water, healthcare, communications, and hardware 13. The stated gate for every new partner: "each one will need to meet our security requirements before they gain access" — and those requirements are not publicly enumerated anywhere in the announcement 13.

Name the asymmetry, because it is the section's finding. The vendor publishes the outcome metric — flaws found — but not the vetting criteria, so the compensating control is asserted rather than auditable 1013. When vetting substitutes for deployed safeguards, the vetting criteria are the safety documentation. As of June 9, 2026, that documentation consists of the phrase "our security requirements," checked against Anthropic's Glasswing page and its expansion announcement 10 13.

Section 04

Convergent Precedent: OpenAI's Trusted Access for Cyber

On February 5, 2026, OpenAI announced it was piloting Trusted Access for Cyber, "an identity and trust-based framework designed to help ensure enhanced cyber capabilities are being placed in the right hands" 14. The program includes an invite-only tier for researchers and teams who need "even more cyber capable or permissive models," and it enumerates prohibited behavior that survives the trust grant — data exfiltration, malware creation or deployment, and destructive or unauthorized testing — with trusted users still bound by the usage policies 14.

The April 14, 2026 expansion made the top tier concrete: customers in the highest tiers receive GPT-5.4-Cyber, a variant "purposely fine-tuned for additional cyber capabilities and with fewer capability restrictions," which lowers the refusal boundary for legitimate security work and adds binary reverse-engineering capability 15. OpenAI states its vetting uses "clear, objective criteria and methods — such as strong KYC and identity verification," automated over time, and it classified GPT-5.4 as "high" cyber capability under its Preparedness Framework 15. That expansion landed one week after Anthropic's Glasswing page went up 1015.

Two vendors independently converged on identity-based tiered access to frontier capability within four months. That makes tiered access an industry pattern, not one vendor's improvisation — and it hands you a comparison axis. The framework that follows is our analysis, not either vendor's. On vetting transparency: OpenAI publishes its criteria class — objective, KYC-based — while Anthropic's security requirements are unenumerated 15 13. On onboarding: OpenAI offers self-serve individual verification and an enterprise path through a representative, while Glasswing partners are vendor-selected 1410. On capability mechanism: OpenAI ships a fine-tuned permissive variant, while Anthropic reports a safeguards-lifted configuration of the same underlying model 151. On threshold frameworks: a Preparedness Framework classification on one side, an RSP capability-threshold regime on the other 15 11.

One TAC detail matters most for Section 6. The prohibited behaviors that persist even for trusted users are a contractual acceptable-use overlay: the vendor grants the capability and simultaneously constrains it in writing 14. That is exactly the shape of the policy your own organization will need on its side of the boundary — a grant is never just a grant; it is a grant plus enumerated limits plus someone accountable for both. And the comparison axis is not academic: if your organization consumes frontier capability from both vendors, you are now subject to two differently shaped vetting regimes for the same class of capability, and your tier registry should record which regime vetted which credential 1413.

Section 05

The Boundary in Motion: Staged Rollout or Standing Regime?

How long will the boundary exist? The vendor's own statements disagree. Cybersecurity Dive reports that Anthropic expects to be able to bring Mythos-class models "to all our customers in the coming weeks," conditioned on robust safeguards against misuse — the two-tier structure framed as a temporary staged rollout 18. Axios, reporting the same launch, states that the formal trusted-access program that would determine who gets Mythos 5 is still in development and that Anthropic "has not provided a timeline" for launching it 17.

We do not resolve that conflict, and you should distrust any analysis that does. "Coming weeks" and "no timeline" are both vendor statements published the same week, and the finding is the tension itself: the boundary's lifespan is undefined by the vendor that drew it 1718. Dianne Penn, Anthropic's head of product management for research and labs, told Axios the company is "being deliberately conservative at launch," which means some legitimate security work will be routed away from the public tier while the formal program remains unbuilt 17.

The boundary also moves fast in both directions. In April the restricted model was too dangerous for general release; in June an equally powerful sibling was public 16. Meanwhile the gated side grew: Axios notes that organizations spent two months lobbying for Mythos Preview access and that the partnership expanded to more than 200 companies and governments in the week before launch 17. Existing Mythos Preview users are being offered an upgrade to Mythos 5 17.

Here is the governance consequence. If you treat the Fable/Mythos split as a launch artifact — a temporary inconvenience that resolves "in the coming weeks" — you will build nothing, and the next boundary will find you with no discipline to apply. The obligations the boundary creates are standing obligations: someone in your organization can now hold credentials to a safeguards-lifted frontier model, the vendor can move the boundary without your consent, and both facts persist whether this particular boundary lasts three weeks or three years. Tier governance is not a one-time onboarding decision, because the two-month flip shows the tiers themselves are not one-time decisions 16.

Notice, too, that the durable artifact here is not the boundary but the gate. The formal trusted-access program — the standing machinery that will decide who gets Mythos 5 and every less-restricted model after it — is the piece Anthropic says is still being built 17. Boundaries will come and go with each release; the vetting regime is what accumulates. An enterprise should plan for the regime, not the release: the discipline in the next section is written to survive the day Fable and Mythos merge, because its subject is the decision structure, not the product names.

Section 06

The Enterprise Mirror: RBAC Mapped to Capability Tiers

Everything in this section is the discipline we recommend, not shipped tooling. No published framework mapping role-based access control onto model capability tiers surfaced in our research for this series as of June 9, 2026, so what follows is a proposal built from controls your organization already runs. The design intuition: a capability tier is a new data- sensitivity axis. You already decide who may touch production, who may see regulated data, who may approve a payment above a threshold. Deciding who may invoke a frontier-unrestricted model is the same decision on a new column.

The tier registry. Enumerate the model tiers your organization consumes the way you enumerate data classifications. The Fable/Mythos release alone yields three capability surfaces inside one vendor relationship: the public safeguarded tier (claude-fable-5), the restricted safeguards-lifted tier ( claude-mythos-5), and the fallback target (Claude Opus 4.8) that answers routed requests — a model your workload can receive without ever naming it 23. A registry that omits the fallback target is not a registry; it is a list of the tiers you chose, missing the one the vendor chooses for you.

The role mapping. Define which roles and which service principals may invoke each tier, and record who granted the mapping. A Glasswing partner that holds claude-mythos-5 credentials has passed Anthropic's enhanced due diligence; the internal grant of those credentials to a team or an agent pipeline is the enterprise-side mirror of that vetting, and it deserves the same two questions Anthropic's policy asks — trustworthiness and beneficial use 2 11. Map service principals as deliberately as people: an agent pipeline holding unrestricted-tier credentials is a role, and the April-to-June lesson applies here too — a mapping reviewed only at grant time silently inherits every capability the tier gains afterward, so the review cycle, not the grant, is the control.

The approval workflow. Treat tier escalation as change management, because your existing control language already describes it. SOC 2's CC8.1 criterion requires that the entity "authorizes, designs, develops or acquires, configures, documents, tests, approves, and implements changes to infrastructure, data, software, and procedures" 26. SR 11-7 — the Federal Reserve's model-risk guidance, superseded in April 2026 by the revised SR 26-2 — defines effective challenge as "critical analysis by objective, informed parties who can identify model limitations and assumptions and produce appropriate changes" 24. A workload moving from a safeguarded tier to an unrestricted one is a change in both senses, and the approval gate should already exist in your change-management system; it only needs the new column.

The incident path. A tier change you did not initiate is an incident, and this release makes two kinds observable and one kind not. A Fable workload answered by Opus 4.8 through the disclosed fallback is fully visible per-request: the top-level model field names the serving model, a fallback content block marks the boundary, usage.iterations records every attempt, and a refusal carries stop_details.category4. Consume those fields, alert on them, and name an escalation owner. Instrument refusals as their own signal, because a refusal arrives as a successful HTTP 200 and monitoring built on error rates will never see it; Anthropic's own integration guidance recommends emitting one event per refusal and one per fallback-served response, then alerting on the gap between the two counts 4. That gap is your unhandled-tier-change rate, and it belongs on the same dashboard as your authentication failures. The hidden end of the spectrum cannot be instrumented: Anthropic's system card reports the frontier-LLM- development safeguard degrades capability with no fallback and no notification at ~0.03% of traffic, so the honest entry in your risk register is an accepted, unmonitorable vendor behavior — recorded, owned, and reviewed, precisely because no alert will ever fire on it 5.

The incident path is not hypothetical, because the classifier boundary is broad. SANS' Rob T. Lee reported that routine incident-response, detection, and basic forensics workflows were auto-routed from Fable 5 to Opus 4.8 in initial testing, with classifiers that "broadly identify cybersecurity-related requests" rather than distinguishing benign from malicious; Anthropic's response is that the tuning is intentionally conservative 23. Read that as a design input: the roles most likely to need the unrestricted tier — your defenders — are the most likely to be routed off it 23.

ISO/IEC 42001:2023 supplies the umbrella. It specifies requirements for establishing, implementing, maintaining, and continually improving an AI management system within an organization 25. Under an AIMS, the tier registry, the role mappings, the approval records, and the incident log stop being ad-hoc artifacts and become conformance evidence — which is what turns a recommended discipline into an auditable one 25.

Section 07

The Regulated Overlay: GPAI Systemic-Risk Obligations

The tier boundary is not only vendor policy; parts of it are law. EU AI Act Article 55(1)(c) requires providers of general-purpose AI models with systemic risk to "keep track of, document, and report, without undue delay, to the AI Office and, as appropriate, to national competent authorities, relevant information about serious incidents and possible corrective measures" 20. Article 51 presumes systemic risk when cumulative training compute exceeds 10 25 floating-point operations 20. Whether Mythos 5 meets that presumption cannot be computed from public information — no training-compute figure for the model has been published — so we treat applicability as a threshold analysis, not an established fact, and we flag the absence deliberately 20. The connection to Section 6 is direct: if the provider must track and report serious incidents, the deployer's incident path is the upstream feed — a tier-change event your organization never detects is one the reporting chain never sees 20.

The General-Purpose AI Code of Practice gives the obligation operational shape. Its Safety and Security chapter commits signatories to systematic identification, assessment, and mitigation of systemic risks, and to tracking, documenting, and reporting serious incidents on staggered severity deadlines — as short as two days for critical-infrastructure disruption 21. Whether Anthropic is a signatory is a fact we could not verify: the Code's overview page lists the commitments but, as of June 9, 2026, our check found no confirmation of Anthropic's signatory status, and we cite that absence rather than assume either answer 21.

Article 72 — post-market monitoring — requires providers of high-risk AI systems to "actively and systematically collect, document and analyse relevant data" on performance throughout the system's lifetime, and it binds high-risk-system providers, not GPAI model providers as such 19. Fable 5 maps to it by analogy and through downstream deployers who embed it in high-risk systems, and we keep the analogy honest: if you are such a deployer, the monitoring plan is yours to write, and tier-change events belong in it 19.

The NIST AI RMF says the quiet part as a checklist item. MANAGE 4.1 calls for post-deployment monitoring plans including "mechanisms for capturing and evaluating input from users and other relevant AI actors, appeal and override, decommissioning, incident response, recovery, and change management" 22. Every element of Section 6's mirror is named in that sentence. The tier boundary is a managed change surface, and the framework your risk team already cites expects you to manage it 22.

For regulated financial institutions the mapping is older still. Under SR 11-7's model-risk discipline — carried forward into the Federal Reserve's April 2026 revision — granting a workload access to an unrestricted tier is a model change, and a model change demands independent validation and ongoing monitoring by parties with the incentives, competence, and influence to challenge it effectively 24. None of these frameworks mention capability tiers by name. All of them already govern the shape of the decision a capability tier forces.

Section 08

Honest Limits

The evidence base for tier governance is dominated by vendor statements about the vendor's own controls, and you should weigh this paper accordingly. The headline safeguard-effectiveness claim — that an external partner found Fable 5's cyber safeguards "the most robust of any model tested," with zero harmful single-turn compliance across requests using 30 public jailbreak techniques — names no partner and publishes no evaluation 9. The vetting criteria that the RSP makes load-bearing are unenumerated, so the claim that vetting compensates for lifted safeguards cannot be audited from outside 10 13. We can also say what would change that verdict: published vetting criteria of the kind OpenAI describes, a third party attesting the due-diligence process, or reproduction steps for the program's headline discoveries 1528. None of the three existed as of June 9, 2026, across the sources this series checked 101328.

The program's flagship discovery narrative has been independently challenged. Security researchers at flyingpenguin document that CVE-2026-4747, a FreeBSD flaw attributed to the Glasswing effort, had a fix published in 2007 that was present in the model's training data, and that no reproduction steps were published with the launch materials — a finding consistent with backlog recovery rather than novel discovery 28. Independent corroboration of the underlying capability does exist, but it is partial: the UK AI Security Institute's evaluation of Claude Mythos Preview backed Anthropic's cyber-capability claims, as Simon Willison's April 14 analysis relays 27. Capability is corroborated; the safety controls around it are not.

Two further limits are structural. The RSP threshold self-assessment — "near the border" — is unverifiable by construction, because only Anthropic can run its own framework against its own model 8. And the hidden frontier-LLM-development safeguard is unmeasurable by design: it leaves no user-visible trace, so its reported ~0.03% incidence rests entirely on the system card that discloses it 5.

Finally, our own contribution carries its own flag. The enterprise mirror in Section 6 is a recommended discipline; no third party can yet verify any enterprise running it, and we have not claimed otherwise. We wrote this paper with the model family it describes, which is why every Anthropic claim above is attributed rather than asserted — the discipline we ask of the vendor's disclosures is the one we owe you in ours. When the safeguard can be lifted for the right customer, the vetting is the control. Govern your side of it.

A short, non-technical version of this argument is available as When Access Is the Safeguard , the executive brief companion to this paper.

End of paper

↑ Back to top

References
  1. 1Anthropic. "Introducing Claude Fable 5 and Claude Mythos 5." anthropic.com/news/claude-fable-5-mythos-5, June 9, 2026; Claude Fable 5 & Claude Mythos 5 System Card (PDF), www-cdn.anthropic.com, June 9, 2026.
  2. 2Anthropic. "Introducing Claude Fable 5 and Claude Mythos 5" (Mythos 5 restriction statement); Anthropic Platform documentation, "Introducing Claude Fable 5 and Claude Mythos 5," platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5, June 9, 2026.
  3. 3Anthropic. "Introducing Claude Fable 5 and Claude Mythos 5," fallback disclosure: "Users will be informed whenever this occurs." anthropic.com/news/claude-fable-5-mythos-5, June 9, 2026.
  4. 4Anthropic Platform documentation. "Refusals and fallback." platform.claude.com/docs/en/build-with-claude/refusals-and-fallback, June 9, 2026; Claude Fable 5 & Claude Mythos 5 System Card (surface-specific fallback behavior), June 9, 2026.
  5. 5Claude Fable 5 & Claude Mythos 5 System Card (frontier-LLM-development safeguard, ~0.03% of traffic, no fallback, no notification), www-cdn.anthropic.com, June 9, 2026; independent coverage at digg.com/ai/qle3xf2z, June 2026.
  6. 6Anthropic. "Introducing Claude Fable 5 and Claude Mythos 5," 30-day retention policy for Mythos-class models. anthropic.com/news/claude-fable-5-mythos-5, June 9, 2026.
  7. 7AWS News Blog. "Anthropic Claude Fable 5 on AWS: Mythos-class capabilities with built-in safeguards now available." aws.amazon.com/blogs/aws, June 9, 2026.
  8. 8Claude Fable 5 & Claude Mythos 5 System Card (RSP/FCF threshold self-assessment), June 9, 2026; Handy AI, "Model drop: Fable 5 / Mythos 5" (no-ASL-designation observation), handyai.substack.com, June 9, 2026.
  9. 9Anthropic. "Introducing Claude Fable 5 and Claude Mythos 5," cyber-safeguard effectiveness claims (unnamed external partner, unpublished evaluation). anthropic.com/news/claude-fable-5-mythos-5, June 9, 2026.
  10. 10Anthropic. "Project Glasswing: Securing critical software for the AI era." anthropic.com/glasswing, April 7, 2026.
  11. 11Anthropic. "Responsible Scaling Policy" (tiered access system; enhanced due diligence; vetting as compensating control). anthropic.com/responsible-scaling-policy, accessed June 9, 2026.
  12. 12Anthropic. "Responsible Scaling Policy" (Capability Thresholds; ASL-3 Security and Deployment Standards conditional commitment). anthropic.com/responsible-scaling-policy, accessed June 9, 2026.
  13. 13Anthropic. "Expanding Project Glasswing." anthropic.com/news/expanding-project-glasswing, June 2, 2026.
  14. 14OpenAI. "Introducing Trusted Access for Cyber." openai.com/index/trusted-access-for-cyber, February 5, 2026.
  15. 15OpenAI. "Trusted access for the next era of cyber defense." openai.com/index/scaling-trusted-access-for-cyber-defense, April 14, 2026.
  16. 16Capoot, A. "Anthropic releases Mythos-like AI model to the public two months after private rollout rocked Wall Street." CNBC, cnbc.com/2026/06/09/anthropic-mythos-claude-fable-5.html, June 9, 2026.
  17. 17Axios. "Anthropic and OpenAI are now cybersecurity's kingmakers." axios.com/2026/06/09/anthropic-openai-mythos-ai-model-access, June 9, 2026.
  18. 18Cybersecurity Dive. Reporting on the Project Glasswing expansion and Mythos-class release timeline. cybersecuritydive.com/news/ai-anthropic-claude-mythos-project-glasswing-expand/821714, June 2026.
  19. 19European Union. AI Act, Article 72 — Post-Market Monitoring by Providers. artificialintelligenceact.eu/article/72, accessed June 9, 2026.
  20. 20European Union. AI Act, Article 51 (classification of general-purpose AI models with systemic risk; 10^25-FLOP presumption) and Article 55(1)(c) (serious-incident obligations). artificialintelligenceact.eu/article/51 and /article/55, accessed June 9, 2026.
  21. 21EU AI Office. General-Purpose AI Code of Practice, Safety and Security chapter. artificialintelligenceact.eu/code-of-practice-overview, accessed June 9, 2026.
  22. 22NIST. "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, January 2023; MANAGE 4.1 text via the NIST AIRC Playbook, airc.nist.gov/airmf-resources/playbook/manage.
  23. 23CSO Online. "Anthropic releases Mythos-class Fable 5 model with safeguards for cyber risks" (SANS' Rob T. Lee on classifier breadth). csoonline.com/article/4183094, June 2026.
  24. 24Board of Governors of the Federal Reserve System. SR 11-7: "Guidance on Model Risk Management," April 4, 2011, federalreserve.gov/boarddocs/srletters/2011/sr1107.htm; superseded by SR 26-2, "Revised Guidance on Model Risk Management," April 17, 2026.
  25. 25ISO/IEC 42001:2023. "Information technology — Artificial intelligence — Management system." iso.org/standard/81230.html.
  26. 26AICPA Trust Services Criteria, CC8.1 (change management): "The entity authorizes, designs, develops or acquires, configures, documents, tests, approves, and implements changes to infrastructure, data, software, and procedures to meet its objectives."
  27. 27Willison, S. "Cybersecurity Looks Like Proof of Work Now." simonwillison.net/2026/Apr/14/cybersecurity-proof-of-work, April 14, 2026; UK AI Security Institute, "Our evaluation of Claude Mythos Preview's cyber capabilities," aisi.gov.uk, April 2026.
  28. 28flyingpenguin. "Executive Summary for Claude Mythos Project Glasswing: June 2026 Verification Status." flyingpenguin.com/executive-summary-for-claude-mythos-project-glasswing-june-2026-verification-status, June 2026.