Skip to main content
kellerai.blog

The Blast Radius of Opus 4.8

A modest model release still ripples through 94 plugins, 53 agent bundles, and a deferred-tool gateway built on the prior model.

KellerAI White Paper · In-Depth · Model Governance & Upgrades · May 2026 · ~16 min read

Context

Anthropic shipped Claude Opus 4.8 on 2026-05-28 as "a modest but tangible improvement" with narrow headline deltas: a few benchmark points, a behavioral honesty refinement, an unchanged sticker price. The argument of this paper is that the size of a model release is the wrong lens for an operator. Opus 4.8 arrived alongside a Claude Code platform update (v2.1.154–2.1.156) that changed the default effort tier to xhigh, added Dynamic Workflows, shipped alwaysLoad and defaultEnabled plugin flags, and introduced mid-conversation system messages — a payload that is not modest at all.

For an ecosystem of 94 marketplace plugins—53 shipping agents, 25 shipping hooks, 7 shipping MCP servers, and 11 hardcoding claude-sonnet-4-6—a model-default change is not a single switch but a propagation path through agents, hooks, and tool surfaces. The model is the trigger; the platform changes are the payload; and the ecosystem built on the prior model is the surface that absorbs it.

The Finding

The vendor-reported benchmarks show consistent but narrow gains (SWE-bench Pro +4.9pp, Terminal-Bench +8.5pp, OSWorld +0.6pp), and the headline behavioral claim—"4 times fewer unflagged code flaws"—is an internal coding evaluation with no independent replication as of 2026-05-29. A benchmark score is a function of the scaffolding around the model, not just the model; an operator cannot adopt the vendor's number as its own validation.

The blast radius is not that Opus 4.8 is risky; it is that a modest release still demands a deliberate, governed adoption pass: run the 11 Sonnet-hardcoded plugins through a model-default migration pass before 2026-06-15, wire effort-to-model routing in the creator family, stand up per-model surface profiles (haiku_deferred / sonnet_mixed / opus_all) so tool loading is appropriate to the model in use, and complete the agent-creator to kellerai-agent-creator rename while the namespace cost is lowest.

Tags:
Model Upgrade GovernanceAgent Ecosystem PropagationPlatform Surface Mapping
Cite this paper

KellerAI. (2026, May 29). The Blast Radius of a Modest Release: Governing the Opus 4.8 Upgrade Across a 94-Plugin Agent Ecosystem. KellerAI. https://kellerai.blog/opus-4-8-blast-radius-in-depth

Paper Details
CategoryModel Governance & Upgrades
AudienceEngineering teams, platform architects, and compliance leads governing agent ecosystem upgrades
MethodComprehensive platform audit (Claude Code v2.1.154–2.1.156, 94-plugin marketplace composition) + vendor benchmark analysis + propagation-path tracing across agents, hooks, and MCP surfaces + six-week planning window analysis + sequencing against 2026-06-15 Agent SDK credit change deadline
Length~3,950 · ~16 min
Reading levelTechnical
Sections8
References61
Versionv1.0 · Updated May 2026
PublishedMay 2026
Key Takeaways
  • A modest model release has a non-trivial blast radius when the model ships alongside platform changes that propagate through agents, hooks, and deferred-tool surfaces.
  • The vendor's benchmark is not your validation; harness divergence (87.6% vendor vs 82.0% independent on prior model) and selective omission of competitors' higher scores are structural integrity problems, not fabrication.
  • The 2026-06-15 Agent SDK credit change anchors sequencing: four P0 enhancements (model-default migration, effort-to-model routing, per-model surface profiles, kellerai-agent-creator rename) must land before that date; four P1 enhancements follow within four weeks.
Related
Placeholder — pending analytics
Section 01

Abstract

Anthropic shipped Claude Opus 4.8 on 2026-05-28 and described it in its own materials as “a modest but tangible improvement on its predecessor” 1 . The headline deltas are narrow — a few benchmark points, a behavioral honesty refinement, an unchanged sticker price 14 . The argument of this paper is that the size of the model release is the wrong lens for an operator. What matters is the surface the release touches. Opus 4.8 arrived alongside a Claude Code platform update (v2.1.154–2.1.156) that changed the default effort tier, added Dynamic Workflows, shipped alwaysLoad and defaultEnabled plugin flags, and introduced mid-conversation system messages 791415 . For an ecosystem of 94 marketplace plugins — 53 of them shipping agents, 25 shipping hooks, 7 shipping MCP servers, and 11 hardcoding claude-sonnet-4-6 — a model-default change is not a single switch but a propagation path through agents, hooks, and tool surfaces 3738394041 . This review catalogs what actually changed, separates vendor-reported evaluation numbers from anything an operator can independently trust, traces the blast radius through the marketplace, and proposes a prioritized enhancement and sequencing plan anchored to the 2026-06-15 Agent SDK credit change 266 . The conclusion is not that Opus 4.8 is risky; it is that a modest release still demands a deliberate, governed adoption pass.

A modest release still demands a deliberate, governed adoption pass. The model is the trigger; the platform changes are the payload; and the ecosystem built on the prior model is the surface that absorbs it.

The thesis
Section 02

The Trigger

Claude Opus 4.8 reached general availability on 2026-05-28 under the API model id claude-opus-4-8 12 . The release closed a 41-to-42-day gap from Opus 4.7, which shipped 2026-04-16 — the shortest gap so far between Opus releases, against prior gaps of 70 to 75 days 3 . Anthropic's own framing for the release was deliberately understated: “a modest but tangible improvement on its predecessor” 1 . Press coverage attributed the compressed cadence to competitive pressure from OpenAI's GPT-5.5 and Codex line and Google's Gemini 3.1, following what reporting called a “chilly reception” for Opus 4.7 16 .

The vendor's modesty is, for once, accurate at the level of the model. The benchmark deltas are narrow and the architecture appears unchanged; the genuine advance is behavioral rather than a capability jump 1 . But “modest” is a statement about the model, not about the operator's exposure. An operator who has built an agent ecosystem on the prior model inherits a different question: not “is the new model better” but “what in my system assumed the old defaults, and what breaks or drifts when those defaults move.”

That distinction is the thesis of this paper. A modest release has a non-trivial blast radius precisely because the model is not shipped alone. Opus 4.8 went out the same day as a Claude Code platform update that changed the default effort tier to xhigh, co-launched Dynamic Workflows as a research preview, and added plugin and MCP flags that change how tools load 971415 . The model is the trigger; the platform changes are the payload; and the ecosystem built on the prior model is the surface that absorbs it.

This review covers the window 2026-03-02 through 2026-05-29, with focus on the 2026-04-16 (Opus 4.7) through 2026-05-28 (Opus 4.8) transition 5 . It treats three sweep areas as one connected system: the creator plugin family that authors skills and agents, the 94-plugin marketplace those creators populate, and the tool-search / MCP / gateway surface through which every plugin's tools reach the model 193755 .

Section 03

What Actually Changed

2.1 The model

Opus 4.8 carries a 1M-token context window by default on the Claude API, Amazon Bedrock, and Google Vertex AI 6 . Microsoft Foundry is the exception at 200k, a limit the marketing headline of “1M context window” omits 617 . Maximum output remains 128k, unchanged from 4.7 6 . The 1M context itself is not new to 4.8 — it went generally available at standard pricing on 2026-03-13, ending the earlier long-context surcharge 6 .

Two behavioral controls matter for agent operators. Adaptive thinking triggers reasoning only when needed, and 4.8 narrows the problem space relative to 4.6 and 4.7 10 . The xhigh effort tier — introduced with Opus 4.7 on 2026-04-16 and recommended for coding and agentic work — sits above low, medium, and high, with max removing token constraints entirely 11 . An independent test measured a 2.7× token-cost spread between the cheapest and most expensive effort tier on 4.7, which sets the stakes for any default-effort change 11 .

The flagship behavioral claim is that Opus 4.8 is “around four times less likely than 4.7 to allow flaws in code it has written to pass unremarked” 1 . This is the most-cited line in the release, and it carries a hard caveat: it is an internal coding-evaluation result with no independent replication as of 2026-05-29 1 . The alignment claims — “new highs on prosocial traits,” substantially lower misaligned behavior — are similarly unverifiable without the system card 1 . The honest reading is that the genuine advance is behavioral honesty: a model trained to flag its own uncertainty is qualitatively different from one that is merely more capable, and it is hard to benchmark independently 1 . An operator can value that direction without treating “4× fewer unflagged flaws” as a number it has verified.

2.2 Pricing

Standard API pricing is unchanged: $5 per million input tokens and $25 per million output tokens, holding flat across Opus 4.5, 4.6, 4.7, and 4.8 4 . Batch processing is 50% off ($2.50 / $12.50), cache reads are 90% off, and the minimum prompt-cache length for 4.8 dropped to 1,024 tokens, lower than 4.7 4 . That lower cache floor is a direct lever for the creator family, which re-reads the same skill bodies and reference files on every invocation 45 .

The sticker price hides an effective-cost caveat. The Opus 4.7 tokenizer emits up to 35% more tokens for equivalent input than 4.6, raising effective cost above the flat sticker rate, and this is likely inherited by 4.8 12 . None of the 4.8 launch materials disclose this 12 .

Fast Mode is a new pricing tier: $10 / $50 per million tokens (2× the standard rate) for roughly 2.5× faster output, marketed as three times cheaper than the prior Opus 4.6 fast tier of $30 / $150 4 . It is a research preview, Claude-API-only — not on Bedrock, Vertex, or Foundry — and waitlisted, with a /fast toggle available in Claude Code 414 . In Claude Code, fast mode defaulted to Opus 4.7 as of v2.1.142 before the 4.8 toggle landed 14 .

2.3 The Claude Code platform

The platform changes shipped across 27 point releases in 87 days, with the Opus 4.8 release landing in v2.1.154 and a thinking-block fix following in v2.1.156 13 . The current versions an operator should pin against are Claude Code 2.1.156, kellerai-skill-creator 2.1.8, agent-creator 1.6.1, skill-creator 1.6.1, and a marketplace of 94 plugins 1320212237 .

Effort and fast mode. The default effort for Opus 4.8 in Claude Code is now xhigh (v2.1.154); UI labels were renamed from “Speed”/“Intelligence” to “Faster”/“Smarter,” and a /fast toggle delivers 2.5× speed at identical quality, three times cheaper than Opus 4.7 fast mode 9 . User-facing effort control also launched on claude.ai and Cowork on 2026-05-28 11 . A lean system prompt is now the default for Claude Code on all models except Haiku, Sonnet, and Opus ≤4.7, reducing startup token cost 27 .

Dynamic Workflows. This is the most consequential new orchestration capability for the creator family. Claude writes a JavaScript workflow script that coordinates tens to hundreds of parallel subagents in a single session, verifying outputs before reporting 7 . The runtime caps concurrency at 16 simultaneous agents and 1,000 agents total per run, with live progress tracking 15 . Availability spans Claude Code CLI, Desktop, and VS Code for Max/Team/Enterprise, plus API/Bedrock/Vertex/Foundry, activated by an ultracode setting or automatically 7 . The warning attached is that workflows “can consume substantially more tokens than a typical Claude Code session” 7 . The workflow script syntax and control-flow primitives are not yet publicly documented 28 .

Tool search and deferred loading. The tool_search_tool_regex_20251119 and tool_search_tool_bm25_20251119 variants support precise name matching and natural-language discovery respectively, returning the top three to five tools per query 29 . Tools can opt into defer_loading: true, where the name is visible but the full definition loads only on discovery, saving roughly 20–30K tokens at session start 29 . New in v2.1.154, MCP servers can declare alwaysLoad: true to skip deferral, and gateway_session_surface_apply(project_path?) recomputes the session tool surface 1530 .

Skills unification. Commands and skills are unified: a file at .claude/commands/deploy.md and .claude/skills/deploy/SKILL.md both create /deploy and behave identically, backwards-compatible 31 . Skills gained a disallowed-tools frontmatter field (v2.1.152) for read-only or restricted-scope skills, a /reload-skills command, and the ability for SessionStart hooks to return reloadSkills: true to make new skills available mid-session 32 .

Plugin flags. Plugins can declare defaultEnabled: false (v2.1.154) to skip auto-enable on install 33 . The /plugin browse pane now shows projected context cost per plugin, lists each plugin's components, and enforces dependency relationships 34 . The plugin manifest still lives at .claude-plugin/plugin.json and must not declare agents, skills, commands, or hooks arrays — those are auto-discovered from directory structure 35 .

Mid-conversation system messages and MCP. The Messages API now accepts role:"system" inside the messages array, allowing instructions, permissions, or token budget to be updated mid-run without breaking the prompt cache (2026-05-28) 8 . New hook events MessageDisplay (v2.1.152) and InstructionsLoaded arrived alongside effort-level exposure in hooks via effort.level and $CLAUDE_EFFORT (v2.1.133) 3642 . Stdio MCP subprocesses now receive CLAUDE_CODE_SESSION_ID, CLAUDE_PROJECT_DIR, and CLAUDECODE=1 1843 .

The Agent SDK credit change. From 2026-06-15, Agent SDK usage on a subscription draws from a separate monthly Agent SDK credit bucket rather than standard interactive limits 26 . This is the single dated deadline in the window and it anchors the sequencing in §6.

Section 04

Evaluation Integrity

The vendor-reported benchmarks show consistent but narrow gains. SWE-bench Pro rises from 64.3% to 69.2% (+4.9pp), OSWorld-Verified from 82.8% to 83.4% (+0.6pp), and Terminal-Bench 2.1 from 66.1% to 74.6% (+8.5pp); Online-Mind2Web posts 84.0% and Humanity's Last Exam 49.8% without tools, 57.9% with tools 51 . None of these has been independently replicated for 4.8 as of 2026-05-29 51 .

The central integrity problem is harness divergence. On the prior model, Anthropic reported SWE-bench Verified at 87.6% while the independent vals.ai harness measured 82.0% — a 5.6-point gap that is undisclosed and likely applies to 4.8 37 . The mechanism matters more than the number: a benchmark score is a function of the scaffolding around the model, not just the model, so a vendor's harness and an operator's harness can legitimately produce different results on the identical model. An operator cannot adopt the vendor's 87.6% as its own validation; it can only adopt the number its own harness produces.

The headline framing is also selective. Anthropic's positioning that 4.8 wins “across almost all benchmarks” omits Terminal-Bench 2.1, where GPT-5.5 scores 78.2% against Opus 4.8's 74.6% — a regression relative to the competitor that the headline drops 36 . On SWE-bench Pro, Opus 4.8 (69.2) does lead GPT-5.5 (58.6) and Gemini 3.1 Pro (54.2), and on OSWorld it leads GPT-5.5 (78.7) 36 . The “most capable model” framing is further complicated by Claude Mythos Preview, a restricted model not in the public API that reportedly outscores 4.8 (SWE-bench Pro 77.8%, Verified 93.9%, Terminal-Bench 82.0%) — making 4.8 the most capable publicly available Anthropic model, not the most capable model 5253 .

The “4× fewer unflagged code flaws” claim is internal-eval-only with no independent replication 1 . The independent signals that do exist are modest: CodeRabbit measured +4% recall versus 4.7 with flat precision 24 , and a Hebbia CTO testimonial claiming 61% cheaper per-task token cost came with no methodology and was vendor-curated 23 . The conclusion for evaluation integrity is narrow and firm: the trust concern here is harness-specificity and selective omission, not fabrication 1 . A downstream operator must run its own evals — which is exactly what the creator family is built to do 49 .

A benchmark score is a function of the scaffolding around the model, not just the model. An operator cannot adopt the vendor's number as its own validation; it can only adopt the number its own harness produces.

Evaluation integrity
Section 05

Blast Radius — The Ecosystem

The marketplace under review contains 94 plugins spanning agent development, code quality, issue tracking, reasoning, research, frontend, dev tools, workflow, hooks, compliance, and content 3755 . The exposure is not uniform; it concentrates in three structural counts.

Surface

Count

Why it matters for Opus 4.8 / CC 2.1.156

Plugins hardcoding claude-sonnet-4-611Pinned to a Sonnet alias; exposed to default-mapping shifts [38]
Plugins shipping /agents53Each agent inherits or pins a model; effort default moved to xhigh [39]
Plugins with 5+ agents6Candidate surfaces for Dynamic Workflows orchestration [39]
Plugins shipping /hooks25Exposed to new hook events and effort-in-hooks changes [40]
Plugins shipping MCP servers7Exposed to deferred-loading and ToolSearch masking [41]
Blast-radius shortlist (immediate testing)15Highest-leverage combinations of the above [54]

The 11 model-hardcoded plugins are the most concrete fragility. They pin claude-sonnet-4-6 (frontend-mobile additionally references the Sonnet 4.5 preview claude-sonnet-4-20250514), and they include fresh-eyes, frontend-mobile, grok-imagine-browser, kellerai-chief-of-staff, kellerai-repo-audit, magic, morphllm-sdk, plugin-healer (documentation examples only), quartermaster, the-collector, and thoughtbox 38 . A hardcoded Sonnet pin is not broken by an Opus release, but it is silently excluded from any Opus 4.8 upside and is exposed if Claude Code shifts the Sonnet alias mapping 38 .

The propagation path explains why a single default matters. A model-default change does not stop at the /model setting; it ripples outward through three layers. At the agent layer, the 53 agent-bearing plugins either pin a model or inherit the session default, so the new xhigh effort default and any model-alias shift propagate into every inherited agent simultaneously 399 . At the hook layer, the 25 hook-bearing plugins receive effort level via effort.level and $CLAUDE_EFFORT, so a changed default reaches hook logic that branches on effort, and the new MessageDisplay and InstructionsLoaded events expand the hook contract those plugins must satisfy 424036 . At the MCP layer, the 7 MCP-bearing plugins expose tools through a gateway that defers loading; a model whose context budget or tool-accuracy threshold differs changes which tools should be surfaced, and ToolSearch can mask tool availability if it is not tuned for the new tier 4155 . One default at the top thus fans out to agents, hooks, and tool surfaces without any single plugin being explicitly edited.

The 15-plugin shortlist for immediate testing concentrates this exposure. The highest-ranked are keller-pr-review (0.57.0), which carries an explicit Opus fallback and must be tested for 4.8 speedup and instruction fidelity; morphllm-sdk (3.1.2), the central gateway with deferred MCP loading and ToolSearch exposure; thoughtbox (0.29.2), with six reasoning agents hardcoded to Sonnet; and cross-discipline-research (1.12.1), a multi-model plugin with 23 MCP tools subject to ToolSearch deferral 54 . The shortlist continues through fresh-eyes, kellerai-chief-of-staff, the-collector, quartermaster, kellerai-skill-creator, ralph-loop-sica, live-code-qa, frontend-mobile, kellerai-grc, magic, and vc-pitch-prep — each combining model hardcoding, multi-agent coordination, hook timing, or MCP exposure 54 . The blast-radius headline is that Opus 4.8's speed and instruction precision are upside for the reasoning-heavy plugins, but the 11 Sonnet-pinned plugins need model-routing updates and all 53 agent-bearing plus 25 hook-bearing plugins need a CC 2.1.156 compatibility pass within four weeks 55 .

Section 06

Enhancement Catalog

The creator family is unusually well-positioned for this upgrade because it has no model-version hardcoding: kellerai-skill-creator, agent-creator, and skill-creator all default to model: sonnet with no pinned 4.5/4.6/4.7 ids in any agent frontmatter 4450 . The opportunity is therefore not remediation but exploitation — the family leaves prompt caching, structured outputs, mid-conversation overrides, and effort-based routing on the table 45464748 . The catalog below spans all three sweep areas plus governance, prioritized P0 (land before 2026-06-15), P1 (within four weeks), P2 (next quarter).

Pri

Area

Enhancement

Source

P0MarketplaceModel-default migration pass for the 11 claude-sonnet-4-6-hardcoded plugins[38]
P0Tool-search/MCPPer-model surface profiles (haiku_deferred / sonnet_mixed / opus_all)[60]
P0Creator familyEffort→model routing wired from the existing SQLite effort telemetry[48]
P0GovernanceRename agent-creator → kellerai-agent-creator (folder + internal) for namespace distance[21]
P1Creator familyPrompt caching on SKILL.md / reference reads in the improve loop[45]
P1Creator familyStructured outputs for grading / comparison / analysis JSON[46]
P1MarketplaceCC 2.1.156 hook-event compatibility pass for the 25 hook plugins[40]
P1MarketplaceToolSearch / deferred-loading audit for the 7 MCP plugins[41]
P1Tool-search/MCPplugin.json alwaysLoad / conditional-by-model flags[58]
P2Creator familyGraceful Dynamic-Workflow ↔ inline fallback for subagent fan-out[44]
P2MarketplaceDynamic-Workflow routing for the 6 plugins with 5+ agents[39]
P2Tool-search/MCPContext-budget-driven re-surfacing at turn boundaries[59]
P2Tool-search/MCPMid-conversation system messages for per-turn surface recompute[61]

Creator family. The improve loop re-reads SKILL.md bodies and reference files (schemas.md, mode-diagrams.md, agent instructions) on every invocation; prompt caching them is estimated at roughly 20% token savings on multi-iteration loops, and the 1,024-token cache floor on 4.8 makes more of them cacheable 454 . The grader, comparator, and analyzer hand-curate grading.json, comparison-N.json, and analysis.json today; routing these through structured outputs would enforce schema compliance without post-processing 46 . The SQLite agent-spawn hook already captures effort per dispatch but does not use it for model selection — wiring effort inference to auto-select Haiku, Sonnet, or Opus by eval complexity is a P0 because it lands cost control before the Agent SDK credit change 4826 . The architecture already assumes Dynamic Workflows for parallel fan-out and degrades to inline single-run mode without subagents, so the workflow ↔ inline fallback needs to be made explicit and graceful against the new 16/1,000-agent caps 4415 .

Marketplace. The 11 Sonnet-pinned plugins need an explicit model-default migration pass; this is P0 because it determines whether reasoning-heavy plugins see any 4.8 upside at all 38 . The 25 hook-bearing plugins need a CC 2.1.156 compatibility pass covering the new MessageDisplay and InstructionsLoaded events and the effort-in-hooks fields 403642 . The 7 MCP plugins need a ToolSearch / deferred-loading audit to confirm tools are not masked under the new tier 4155 . The 6 plugins with 5+ agents are the natural first candidates for Dynamic-Workflow routing once the syntax is documented 3928 .

Tool-search / MCP. The gateway config is one-size-fits-all today, with a static core_tools split that does not vary by model tier 5660 . Per-model surface profiles — haiku_deferred, sonnet_mixed, opus_all — are P0 because Opus 4.8's larger context window changes the right core/deferred split, and the current config gives no per-model guidance 60 . The alwaysLoad and conditional-by-model plugin flags (new in v2.1.154) let plugins like thoughtbox and morphllm-sdk declare themselves always-loaded, which closes the gap where deferred tools stay hidden if ToolSearch fails 5830 . Context-budget-driven re-surfacing and mid-conversation system messages are P2 because they require new turn-boundary integration the gateway does not yet have, but they are the mechanism by which the surface could become dynamic rather than fixed at SessionStart 5961 .

Governance. The agent-creator plugin is an upstream Anthropic name carried verbatim, and it pairs with the already-renamed kellerai-skill-creator; renaming it kellerai-agent-creator (folder and internal references) gives the KellerAI fork namespace distance from the upstream plugins it extends 2120 . This is P0 not because it is hard but because the namespace decision is cheapest to make before further forks accumulate against the upstream name.

Section 07

Recommendations & Sequencing

The window has exactly one hard deadline: the Agent SDK credit change on 2026-06-15, which moves SDK usage on a subscription to a separate monthly credit bucket 26 . Everything cost-sensitive that depends on SDK-driven subagent fan-out should land before that date, because after it the cost model for the creator family's parallel runs and any Dynamic-Workflow adoption changes 267 . This makes the four P0 items the first sprint.

Land before 2026-06-15 (the P0 set). First, run the model-default migration pass on the 11 Sonnet-pinned plugins so reasoning-heavy work is eligible for Opus 4.8 routing rather than silently excluded 38 . Second, wire effort→model routing in the creator family from the existing SQLite effort telemetry, so cheap evals run on Haiku and expensive ones on Opus before the credit accounting changes underneath them 4826 . Third, stand up the per-model surface profiles ( haiku_deferred / sonnet_mixed / opus_all) so tool loading is appropriate to the model actually in use, rather than a one-size-fits-all config 60 . Fourth, complete the agent-creator kellerai-agent-creator rename while the namespace cost is lowest 21 .

Within four weeks (the P1 set). Add prompt caching on SKILL.md and reference reads in the improve loop, capturing the estimated ~20% token savings made cheaper by the 1,024-token cache floor 454 . Move grading, comparison, and analysis JSON to structured outputs to eliminate post-processing 46 . Run the CC 2.1.156 hook-event compatibility pass across the 25 hook plugins and the ToolSearch / deferred-loading audit across the 7 MCP plugins 4041 . Add the alwaysLoad / conditional-by-model flags to the plugins that warrant always-loaded status 58 . Use the 15-plugin shortlist as the test ordering for all of this, starting with keller-pr-review, morphllm-sdk, thoughtbox, and cross-discipline-research 54 .

Next quarter (the P2 set). Make the Dynamic-Workflow ↔ inline fallback explicit in the creator family against the 16/1,000-agent caps, then extend Dynamic-Workflow routing to the 6 plugins with 5+ agents once the workflow syntax is publicly documented 443928 . Build context-budget-driven re-surfacing and per-turn surface recompute on mid-conversation system messages, the two mechanisms that turn the static SessionStart surface into a dynamic one 5961 .

Each recommendation ties back to a catalog row, and the ordering is driven by the single dated deadline rather than by perceived model risk. The model upgrade is the occasion; the governance pass is the work.

Section 08

Honest Limits

This review cannot confirm Opus 4.8's headline behavioral claim. The “4× fewer unflagged code flaws” result is an internal coding evaluation with no independent replication as of 2026-05-29, and the alignment claims are unverifiable without the system card 1 . The vendor benchmarks in §3 are vendor-reported and none has been independently replicated for 4.8 51 .

The platform surface carries several unverified or gated items. The Dynamic Workflows script syntax and control-flow primitives are not yet publicly documented — the feature is described at a high level only, and subagent-routing mechanics inside a workflow are unknown 28 . Fast Mode is a research preview, Claude-API-only and waitlisted, so its pricing is real but not generally accessible 4 . The Microsoft Foundry 1M-context exception (200k) is documented but the marketing headline omits it 617 . The harness divergence number (87.6% vendor versus 82.0% vals.ai) is measured on the prior model and only “likely” applies to 4.8 37 .

Several gateway-integration claims are forward-looking rather than confirmed. The current session is running Claude Code 2.1.145+ against a claude_code_local profile, and Dynamic Workflows usage in the gateway is “NONE found” — the migration items in §5 and §6 are proposals against the documented feature surface, not observations of working integrations 5557 . The per-model surface profiles, alwaysLoad flags, context-budget re-surfacing, and mid-conversation re-surfacing are all gap-analysis recommendations whose blockers (schema updates, turn-boundary hooks, profile expansion) are documented but unbuilt 60585961 . Finally, the Agent SDK credit change is dated 2026-06-15 but its cost-per-token-per-credit details are not documented, so the sequencing in §6 optimizes around the deadline without modeling the exact post-deadline cost 26 .

References
  1. 1Anthropic. "Claude Opus 4.8." anthropic.com, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
  2. 2Coursiv. "Claude Opus 4.8." coursiv.io, 2026-05-28. https://coursiv.io/blog/claude-opus-4-8
  3. 3Hidekazu Konishi. "Claude Opus Release Timeline." hidekazu-konishi.com, 2026-05-28. https://hidekazu-konishi.com
  4. 4Finout. "Claude API Pricing." finout.io, 2026-05-28. https://finout.io
  5. 5cross-discipline-research:technical-claims-agent. Review window and 4.7→4.8 focus. Per Anthropic Opus 4.8 release notes / model card and public benchmark reporting at release (2026-05-28). https://www.anthropic.com/news/claude-opus-4-8
  6. 6Releasebot. "Anthropic Updates — Claude Opus 4.8." releasebot.io, 2026-05-28. https://releasebot.io/updates/anthropic
  7. 7Anthropic / Claude. "Introducing Dynamic Workflows in Claude Code." claude.com, 2026-05-28. https://claude.com/blog/introducing-dynamic-workflows-in-claude-code
  8. 8Anthropic. "Mid-conversation system messages (Messages API)." anthropic.com, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
  9. 9Claude Code Changelog. "v2.1.154." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
  10. 10CodeRabbit. "Opus 4.8 adaptive thinking test." youtube.com (LzgPzQud0zA), 2026-05-28. https://www.youtube.com/watch?v=LzgPzQud0zA
  11. 11Claude Platform. "Effort controls." platform.claude.com, 2026-05-28. https://platform.claude.com/effort
  12. 12Finout. "Claude Opus 4.7 tokenizer / effective-cost story." finout.io, 2026-04-16. https://finout.io
  13. 13Claude Code Changelog. "v2.1.130–v2.1.156 — 27 point releases in 87 days." Per public Claude Code changelog. code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
  14. 14Claude Code Changelog. "v2.1.142 — fast mode default model." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
  15. 15MarkTechPost. "Anthropic Ships Claude Opus 4.8 alongside Dynamic Workflows and cheaper Fast Mode, workflows capped at 1,000 subagents." marktechpost.com, 2026-05-28. https://www.marktechpost.com/2026/05/28/anthropic-ships-claude-opus-4-8-alongside-dynamic-workflows-and-cheaper-fast-mode-with-workflows-capped-at-1000-subagents/
  16. 16TechCrunch. "Anthropic releases Claude Opus 4.8." techcrunch.com, 2026-05-28. https://techcrunch.com
  17. 17ReleaseBot / Coursiv. "1M context default; Foundry 200k exception." releasebot.io, 2026-05-28. https://releasebot.io/updates/anthropic
  18. 18Claude Platform. "MCP connector — stdio environment variables." platform.claude.com, 2026-05. https://platform.claude.com/mcp-connector
  19. 19Anthropic. "Claude Code plugins documentation — creator plugin family." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  20. 20KellerAI internal codebase. "kellerai-skill-creator plugin identity (v2.1.8)." KellerAI's kellerai-skill-creator plugin, plugin.json manifest (non-public). See also /kellerai-skill-creator. Illustrative; not verbatim: { "name": "kellerai-skill-creator", "version": "2.1.8", "description": "..." }
  21. 21KellerAI internal codebase. "agent-creator plugin identity (v1.6.1)." KellerAI's agent-creator plugin, plugin.json manifest (non-public). Illustrative; not verbatim: { "name": "agent-creator", "version": "1.6.1", "description": "..." }
  22. 22KellerAI internal codebase. "skill-creator plugin identity (v1.6.1)." KellerAI's skill-creator plugin, plugin.json manifest (non-public). Illustrative; not verbatim: { "name": "skill-creator", "version": "1.6.1", "description": "..." }
  23. 23Anthropic / Claude. "Hebbia CTO testimonial — 61% cheaper per-task token cost." claude.com, 2026-05-28. https://claude.com/opus
  24. 24CodeRabbit. "Independent Opus 4.8 test — +4% recall, flat precision." youtube.com (LzgPzQud0zA), 2026-05-28. https://www.youtube.com/watch?v=LzgPzQud0zA
  25. 25Every. "Vibe check — Senior Engineer Benchmark." every.to, 2026-05-28. https://every.to/vibe-check
  26. 26Claude Code. "Agent SDK overview — SDK billing change June 15, 2026." code.claude.com, 2026-05. https://code.claude.com/docs/en/agent-sdk/overview.md
  27. 27Claude Code Changelog. "v2.1.154 — lean system prompt default." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
  28. 28Claude Code Changelog. "Dynamic Workflow script syntax — not yet publicly documented." Per public Claude Code changelog and Anthropic release materials; workflow control-flow primitives unverified as of 2026-05-29. code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
  29. 29Anthropic. "Tool-search variants and deferred-loading behavior." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
  30. 30Claude Code Changelog. "v2.1.154 — alwaysLoad MCP flag; gateway session surface apply." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
  31. 31Claude Code. "Skills documentation — commands/skills unification." code.claude.com, 2026-05. https://code.claude.com/docs/en/skills
  32. 32Claude Code Changelog. "v2.1.152 — disallowed-tools, /reload-skills, reloadSkills hook." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
  33. 33Claude Code Changelog. "v2.1.154 — defaultEnabled plugin flag." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
  34. 34Claude Code Changelog. "v2.1.143-2.1.145 — /plugin projected context cost, components, dependency enforcement." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
  35. 35Claude Code. "Plugins documentation — plugin.json schema and auto-discovery." code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  36. 36Anthropic / public benchmark reporting. "Terminal-Bench 2.1 regression vs GPT-5.5; SWE-bench Pro / OSWorld competitive standing." Per Anthropic Opus 4.8 release materials and SWE-bench Verified / Terminal-Bench public leaderboard data, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
  37. 37SWE-bench Verified leaderboard / vals.ai. "Harness divergence: 87.6% vendor vs 82.0% vals.ai (prior model)." Per SWE-bench Verified public leaderboard and Anthropic Opus 4.8 release notes / model card. https://www.anthropic.com/news/claude-opus-4-8
  38. 38Internal review artifacts (KellerAI). "11 plugins hardcoding claude-sonnet-4-6." Per internal marketplace audit, 2026-05-29.
  39. 39Internal review artifacts (KellerAI). "53 plugins with /agents; 6 with 5+ agents." Per internal marketplace audit, 2026-05-29.
  40. 40Internal review artifacts (KellerAI). "25 plugins with /hooks." Per internal marketplace audit, 2026-05-29.
  41. 41Internal review artifacts (KellerAI). "7 plugins with MCP servers." Per internal marketplace audit, 2026-05-29.
  42. 42Claude Code Changelog. "v2.1.133 — effort.level and $CLAUDE_EFFORT in hooks." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
  43. 43Claude Code Changelog. "v2.1.139,v2.1.154 — CLAUDE_PROJECT_DIR, CLAUDECODE=1 for stdio MCP." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
  44. 44Anthropic. "Claude Code plugins documentation — Dynamic Workflows; creator plugin architecture." Per public Anthropic plugins documentation; Dynamic Workflows assumed for parallel fan-out; graceful degradation to inline; no model hardcoding. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  45. 45Anthropic. "Claude Code plugins documentation — prompt caching on SKILL.md / reference reads (~20% savings)." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  46. 46Anthropic. "Claude Code plugins documentation — structured outputs for grading / comparison / analysis JSON." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  47. 47Anthropic. "Claude Code plugins documentation — mid-conversation system message overrides per phase." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  48. 48Anthropic. "Claude Code plugins documentation — agent-spawn hook captures effort; effort→model routing opportunity." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  49. 49Anthropic. "Claude Code plugins documentation — eval / improve / create / benchmark modes (operator-run evals)." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  50. 50Anthropic. "Claude Code plugins documentation — baked-in assumptions: model: sonnet, no version hardcoding." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
  51. 51Anthropic. "Opus 4.8 vendor benchmark table (SWE-bench Pro, OSWorld, Terminal-Bench, HLE, Online-Mind2Web)." Per Anthropic Opus 4.8 release notes / model card and SWE-bench Verified / Terminal-Bench public leaderboard reporting; none independently replicated for 4.8 as of 2026-05-29. https://www.anthropic.com/news/claude-opus-4-8
  52. 52Anthropic. "Claude Mythos Preview leads 4.8; \"most capable publicly available\" framing." Per Anthropic Opus 4.8 release notes / model card, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
  53. 53Vellum. "Claude Mythos Preview benchmarks." vellum.ai, 2026-05. https://www.vellum.ai
  54. 54Internal review artifacts (KellerAI). "15-plugin blast-radius shortlist with ranks and actions." Per internal marketplace audit, 2026-05-29.
  55. 55Internal review artifacts (KellerAI). "Executive summary: counts, fragility, 4-week compatibility-pass headline." Per internal marketplace audit, 2026-05-29.
  56. 56Anthropic. "Tool-search / MCP documentation — deferred-loading config and per-server core overrides." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
  57. 57Anthropic. "Tool-search / MCP documentation — Dynamic Workflows gateway integration gap; migration requirements." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
  58. 58Anthropic. "Tool-search / MCP documentation — missing alwaysLoad / defaultEnabled / onlyFor plugin flags." Per public Anthropic tool-search / MCP and plugins documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
  59. 59Anthropic. "Tool-search / MCP documentation — context-budget signaling not modeled at turn boundaries." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
  60. 60Anthropic. "Tool-search / MCP documentation — no per-model tool strategies; haiku_deferred / sonnet_mixed / opus_all profiles." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
  61. 61Anthropic. "Tool-search / MCP documentation — mid-conversation system messages not leveraged for surface recompute." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search