The Blast Radius of a Modest Release: Governing the Opus 4.8 Upgrade Across a 94-Plugin Agent Ecosystem

KellerAI

The Blast Radius of Opus 4.8

A modest model release still ripples through 94 plugins, 53 agent bundles, and a deferred-tool gateway built on the prior model.

KellerAI White Paper · In-Depth · Model Governance & Upgrades · May 2026 · ~16 min read

Context

Anthropic shipped Claude Opus 4.8 on 2026-05-28 as "a modest but tangible improvement" with narrow headline deltas: a few benchmark points, a behavioral honesty refinement, an unchanged sticker price. The argument of this paper is that the size of a model release is the wrong lens for an operator. Opus 4.8 arrived alongside a Claude Code platform update (v2.1.154–2.1.156) that changed the default effort tier to xhigh, added Dynamic Workflows, shipped alwaysLoad and defaultEnabled plugin flags, and introduced mid-conversation system messages — a payload that is not modest at all.

For an ecosystem of 94 marketplace plugins—53 shipping agents, 25 shipping hooks, 7 shipping MCP servers, and 11 hardcoding claude-sonnet-4-6—a model-default change is not a single switch but a propagation path through agents, hooks, and tool surfaces. The model is the trigger; the platform changes are the payload; and the ecosystem built on the prior model is the surface that absorbs it.

The Finding

The vendor-reported benchmarks show consistent but narrow gains (SWE-bench Pro +4.9pp, Terminal-Bench +8.5pp, OSWorld +0.6pp), and the headline behavioral claim—"4 times fewer unflagged code flaws"—is an internal coding evaluation with no independent replication as of 2026-05-29. A benchmark score is a function of the scaffolding around the model, not just the model; an operator cannot adopt the vendor's number as its own validation.

The blast radius is not that Opus 4.8 is risky; it is that a modest release still demands a deliberate, governed adoption pass: run the 11 Sonnet-hardcoded plugins through a model-default migration pass before 2026-06-15, wire effort-to-model routing in the creator family, stand up per-model surface profiles (haiku_deferred / sonnet_mixed / opus_all) so tool loading is appropriate to the model in use, and complete the agent-creator to kellerai-agent-creator rename while the namespace cost is lowest.

Tags:

Model Upgrade GovernanceAgent Ecosystem PropagationPlatform Surface Mapping

Share:LinkedIn Reddit X

Cite this paper

KellerAI. (2026, May 29). The Blast Radius of a Modest Release: Governing the Opus 4.8 Upgrade Across a 94-Plugin Agent Ecosystem. KellerAI. https://kellerai.blog/opus-4-8-blast-radius-in-depth

Paper Details

CategoryModel Governance & Upgrades

AudienceEngineering teams, platform architects, and compliance leads governing agent ecosystem upgrades

MethodComprehensive platform audit (Claude Code v2.1.154–2.1.156, 94-plugin marketplace composition) + vendor benchmark analysis + propagation-path tracing across agents, hooks, and MCP surfaces + six-week planning window analysis + sequencing against 2026-06-15 Agent SDK credit change deadline

Length~3,950 · ~16 min

Reading levelTechnical

Sections8

References61

Versionv1.0 · Updated May 2026

PublishedMay 2026

Key Takeaways

—A modest model release has a non-trivial blast radius when the model ships alongside platform changes that propagate through agents, hooks, and deferred-tool surfaces.
—The vendor's benchmark is not your validation; harness divergence (87.6% vendor vs 82.0% independent on prior model) and selective omission of competitors' higher scores are structural integrity problems, not fabrication.
—The 2026-06-15 Agent SDK credit change anchors sequencing: four P0 enhancements (model-default migration, effort-to-model routing, per-model surface profiles, kellerai-agent-creator rename) must land before that date; four P1 enhancements follow within four weeks.

When a "Modest" Model Release Isn't
Brief·In-depth
What Changes When the Model Changes
Brief·In-depth
Capability Convergence and the Vendor Dependency Trap
Brief·In-depth

Placeholder — pending analytics

Section 01

Abstract

Anthropic shipped Claude Opus 4.8 on 2026-05-28 and described it in its own materials as “a modest but tangible improvement on its predecessor” ¹ . The headline deltas are narrow — a few benchmark points, a behavioral honesty refinement, an unchanged sticker price ¹⁴ . The argument of this paper is that the size of the model release is the wrong lens for an operator. What matters is the surface the release touches. Opus 4.8 arrived alongside a Claude Code platform update (v2.1.154–2.1.156) that changed the default effort tier, added Dynamic Workflows, shipped alwaysLoad and defaultEnabled plugin flags, and introduced mid-conversation system messages ⁷⁹¹⁴¹⁵ . For an ecosystem of 94 marketplace plugins — 53 of them shipping agents, 25 shipping hooks, 7 shipping MCP servers, and 11 hardcoding claude-sonnet-4-6 — a model-default change is not a single switch but a propagation path through agents, hooks, and tool surfaces ³⁷³⁸³⁹⁴⁰⁴¹ . This review catalogs what actually changed, separates vendor-reported evaluation numbers from anything an operator can independently trust, traces the blast radius through the marketplace, and proposes a prioritized enhancement and sequencing plan anchored to the 2026-06-15 Agent SDK credit change ²⁶⁶ . The conclusion is not that Opus 4.8 is risky; it is that a modest release still demands a deliberate, governed adoption pass.

A modest release still demands a deliberate, governed adoption pass. The model is the trigger; the platform changes are the payload; and the ecosystem built on the prior model is the surface that absorbs it.

The thesis

Section 02

The Trigger

Claude Opus 4.8 reached general availability on 2026-05-28 under the API model id claude-opus-4-8 ¹² . The release closed a 41-to-42-day gap from Opus 4.7, which shipped 2026-04-16 — the shortest gap so far between Opus releases, against prior gaps of 70 to 75 days ³ . Anthropic's own framing for the release was deliberately understated: “a modest but tangible improvement on its predecessor” ¹ . Press coverage attributed the compressed cadence to competitive pressure from OpenAI's GPT-5.5 and Codex line and Google's Gemini 3.1, following what reporting called a “chilly reception” for Opus 4.7 ¹⁶ .

The vendor's modesty is, for once, accurate at the level of the model. The benchmark deltas are narrow and the architecture appears unchanged; the genuine advance is behavioral rather than a capability jump ¹ . But “modest” is a statement about the model, not about the operator's exposure. An operator who has built an agent ecosystem on the prior model inherits a different question: not “is the new model better” but “what in my system assumed the old defaults, and what breaks or drifts when those defaults move.”

That distinction is the thesis of this paper. A modest release has a non-trivial blast radius precisely because the model is not shipped alone. Opus 4.8 went out the same day as a Claude Code platform update that changed the default effort tier to xhigh, co-launched Dynamic Workflows as a research preview, and added plugin and MCP flags that change how tools load ⁹⁷¹⁴¹⁵ . The model is the trigger; the platform changes are the payload; and the ecosystem built on the prior model is the surface that absorbs it.

This review covers the window 2026-03-02 through 2026-05-29, with focus on the 2026-04-16 (Opus 4.7) through 2026-05-28 (Opus 4.8) transition ⁵ . It treats three sweep areas as one connected system: the creator plugin family that authors skills and agents, the 94-plugin marketplace those creators populate, and the tool-search / MCP / gateway surface through which every plugin's tools reach the model ¹⁹³⁷⁵⁵ .

Section 03

What Actually Changed

2.1 The model

Opus 4.8 carries a 1M-token context window by default on the Claude API, Amazon Bedrock, and Google Vertex AI ⁶ . Microsoft Foundry is the exception at 200k, a limit the marketing headline of “1M context window” omits ⁶¹⁷ . Maximum output remains 128k, unchanged from 4.7 ⁶ . The 1M context itself is not new to 4.8 — it went generally available at standard pricing on 2026-03-13, ending the earlier long-context surcharge ⁶ .

Two behavioral controls matter for agent operators. Adaptive thinking triggers reasoning only when needed, and 4.8 narrows the problem space relative to 4.6 and 4.7 ¹⁰ . The xhigh effort tier — introduced with Opus 4.7 on 2026-04-16 and recommended for coding and agentic work — sits above low, medium, and high, with max removing token constraints entirely ¹¹ . An independent test measured a 2.7× token-cost spread between the cheapest and most expensive effort tier on 4.7, which sets the stakes for any default-effort change ¹¹ .

The flagship behavioral claim is that Opus 4.8 is “around four times less likely than 4.7 to allow flaws in code it has written to pass unremarked” ¹ . This is the most-cited line in the release, and it carries a hard caveat: it is an internal coding-evaluation result with no independent replication as of 2026-05-29 ¹ . The alignment claims — “new highs on prosocial traits,” substantially lower misaligned behavior — are similarly unverifiable without the system card ¹ . The honest reading is that the genuine advance is behavioral honesty: a model trained to flag its own uncertainty is qualitatively different from one that is merely more capable, and it is hard to benchmark independently ¹ . An operator can value that direction without treating “4× fewer unflagged flaws” as a number it has verified.

2.2 Pricing

Standard API pricing is unchanged: $5 per million input tokens and $25 per million output tokens, holding flat across Opus 4.5, 4.6, 4.7, and 4.8 ⁴ . Batch processing is 50% off ($2.50 / $12.50), cache reads are 90% off, and the minimum prompt-cache length for 4.8 dropped to 1,024 tokens, lower than 4.7 ⁴ . That lower cache floor is a direct lever for the creator family, which re-reads the same skill bodies and reference files on every invocation ⁴⁵ .

The sticker price hides an effective-cost caveat. The Opus 4.7 tokenizer emits up to 35% more tokens for equivalent input than 4.6, raising effective cost above the flat sticker rate, and this is likely inherited by 4.8 ¹² . None of the 4.8 launch materials disclose this ¹² .

Fast Mode is a new pricing tier: $10 / $50 per million tokens (2× the standard rate) for roughly 2.5× faster output, marketed as three times cheaper than the prior Opus 4.6 fast tier of $30 / $150 ⁴ . It is a research preview, Claude-API-only — not on Bedrock, Vertex, or Foundry — and waitlisted, with a /fast toggle available in Claude Code ⁴¹⁴ . In Claude Code, fast mode defaulted to Opus 4.7 as of v2.1.142 before the 4.8 toggle landed ¹⁴ .

2.3 The Claude Code platform

The platform changes shipped across 27 point releases in 87 days, with the Opus 4.8 release landing in v2.1.154 and a thinking-block fix following in v2.1.156 ¹³ . The current versions an operator should pin against are Claude Code 2.1.156, kellerai-skill-creator 2.1.8, agent-creator 1.6.1, skill-creator 1.6.1, and a marketplace of 94 plugins ¹³²⁰²¹²²³⁷ .

Effort and fast mode. The default effort for Opus 4.8 in Claude Code is now xhigh (v2.1.154); UI labels were renamed from “Speed”/“Intelligence” to “Faster”/“Smarter,” and a /fast toggle delivers 2.5× speed at identical quality, three times cheaper than Opus 4.7 fast mode ⁹ . User-facing effort control also launched on claude.ai and Cowork on 2026-05-28 ¹¹ . A lean system prompt is now the default for Claude Code on all models except Haiku, Sonnet, and Opus ≤4.7, reducing startup token cost ²⁷ .

Dynamic Workflows. This is the most consequential new orchestration capability for the creator family. Claude writes a JavaScript workflow script that coordinates tens to hundreds of parallel subagents in a single session, verifying outputs before reporting ⁷ . The runtime caps concurrency at 16 simultaneous agents and 1,000 agents total per run, with live progress tracking ¹⁵ . Availability spans Claude Code CLI, Desktop, and VS Code for Max/Team/Enterprise, plus API/Bedrock/Vertex/Foundry, activated by an ultracode setting or automatically ⁷ . The warning attached is that workflows “can consume substantially more tokens than a typical Claude Code session” ⁷ . The workflow script syntax and control-flow primitives are not yet publicly documented ²⁸ .

Tool search and deferred loading. The tool_search_tool_regex_20251119 and tool_search_tool_bm25_20251119 variants support precise name matching and natural-language discovery respectively, returning the top three to five tools per query ²⁹ . Tools can opt into defer_loading: true, where the name is visible but the full definition loads only on discovery, saving roughly 20–30K tokens at session start ²⁹ . New in v2.1.154, MCP servers can declare alwaysLoad: true to skip deferral, and gateway_session_surface_apply(project_path?) recomputes the session tool surface ¹⁵³⁰ .

Skills unification. Commands and skills are unified: a file at .claude/commands/deploy.md and .claude/skills/deploy/SKILL.md both create /deploy and behave identically, backwards-compatible ³¹ . Skills gained a disallowed-tools frontmatter field (v2.1.152) for read-only or restricted-scope skills, a /reload-skills command, and the ability for SessionStart hooks to return reloadSkills: true to make new skills available mid-session ³² .

Plugin flags. Plugins can declare defaultEnabled: false (v2.1.154) to skip auto-enable on install ³³ . The /plugin browse pane now shows projected context cost per plugin, lists each plugin's components, and enforces dependency relationships ³⁴ . The plugin manifest still lives at .claude-plugin/plugin.json and must not declare agents, skills, commands, or hooks arrays — those are auto-discovered from directory structure ³⁵ .

Mid-conversation system messages and MCP. The Messages API now accepts role:"system" inside the messages array, allowing instructions, permissions, or token budget to be updated mid-run without breaking the prompt cache (2026-05-28) ⁸ . New hook events MessageDisplay (v2.1.152) and InstructionsLoaded arrived alongside effort-level exposure in hooks via effort.level and $CLAUDE_EFFORT (v2.1.133) ³⁶⁴² . Stdio MCP subprocesses now receive CLAUDE_CODE_SESSION_ID, CLAUDE_PROJECT_DIR, and CLAUDECODE=1 ¹⁸⁴³ .

The Agent SDK credit change. From 2026-06-15, Agent SDK usage on a subscription draws from a separate monthly Agent SDK credit bucket rather than standard interactive limits ²⁶ . This is the single dated deadline in the window and it anchors the sequencing in §6.

Section 04

Evaluation Integrity

The vendor-reported benchmarks show consistent but narrow gains. SWE-bench Pro rises from 64.3% to 69.2% (+4.9pp), OSWorld-Verified from 82.8% to 83.4% (+0.6pp), and Terminal-Bench 2.1 from 66.1% to 74.6% (+8.5pp); Online-Mind2Web posts 84.0% and Humanity's Last Exam 49.8% without tools, 57.9% with tools ⁵¹ . None of these has been independently replicated for 4.8 as of 2026-05-29 ⁵¹ .

The central integrity problem is harness divergence. On the prior model, Anthropic reported SWE-bench Verified at 87.6% while the independent vals.ai harness measured 82.0% — a 5.6-point gap that is undisclosed and likely applies to 4.8 ³⁷ . The mechanism matters more than the number: a benchmark score is a function of the scaffolding around the model, not just the model, so a vendor's harness and an operator's harness can legitimately produce different results on the identical model. An operator cannot adopt the vendor's 87.6% as its own validation; it can only adopt the number its own harness produces.

The headline framing is also selective. Anthropic's positioning that 4.8 wins “across almost all benchmarks” omits Terminal-Bench 2.1, where GPT-5.5 scores 78.2% against Opus 4.8's 74.6% — a regression relative to the competitor that the headline drops ³⁶ . On SWE-bench Pro, Opus 4.8 (69.2) does lead GPT-5.5 (58.6) and Gemini 3.1 Pro (54.2), and on OSWorld it leads GPT-5.5 (78.7) ³⁶ . The “most capable model” framing is further complicated by Claude Mythos Preview, a restricted model not in the public API that reportedly outscores 4.8 (SWE-bench Pro 77.8%, Verified 93.9%, Terminal-Bench 82.0%) — making 4.8 the most capable publicly available Anthropic model, not the most capable model ⁵²⁵³ .

The “4× fewer unflagged code flaws” claim is internal-eval-only with no independent replication ¹ . The independent signals that do exist are modest: CodeRabbit measured +4% recall versus 4.7 with flat precision ²⁴ , and a Hebbia CTO testimonial claiming 61% cheaper per-task token cost came with no methodology and was vendor-curated ²³ . The conclusion for evaluation integrity is narrow and firm: the trust concern here is harness-specificity and selective omission, not fabrication ¹ . A downstream operator must run its own evals — which is exactly what the creator family is built to do ⁴⁹ .

A benchmark score is a function of the scaffolding around the model, not just the model. An operator cannot adopt the vendor's number as its own validation; it can only adopt the number its own harness produces.

Evaluation integrity

Section 05

Blast Radius — The Ecosystem

The marketplace under review contains 94 plugins spanning agent development, code quality, issue tracking, reasoning, research, frontend, dev tools, workflow, hooks, compliance, and content ³⁷⁵⁵ . The exposure is not uniform; it concentrates in three structural counts.

Surface	Count	Why it matters for Opus 4.8 / CC 2.1.156
Plugins hardcoding claude-sonnet-4-6	11	Pinned to a Sonnet alias; exposed to default-mapping shifts [38]
Plugins shipping /agents	53	Each agent inherits or pins a model; effort default moved to xhigh [39]
Plugins with 5+ agents	6	Candidate surfaces for Dynamic Workflows orchestration [39]
Plugins shipping /hooks	25	Exposed to new hook events and effort-in-hooks changes [40]
Plugins shipping MCP servers	7	Exposed to deferred-loading and ToolSearch masking [41]
Blast-radius shortlist (immediate testing)	15	Highest-leverage combinations of the above [54]

The 11 model-hardcoded plugins are the most concrete fragility. They pin claude-sonnet-4-6 (frontend-mobile additionally references the Sonnet 4.5 preview claude-sonnet-4-20250514), and they include fresh-eyes, frontend-mobile, grok-imagine-browser, kellerai-chief-of-staff, kellerai-repo-audit, magic, morphllm-sdk, plugin-healer (documentation examples only), quartermaster, the-collector, and thoughtbox ³⁸ . A hardcoded Sonnet pin is not broken by an Opus release, but it is silently excluded from any Opus 4.8 upside and is exposed if Claude Code shifts the Sonnet alias mapping ³⁸ .

The propagation path explains why a single default matters. A model-default change does not stop at the /model setting; it ripples outward through three layers. At the agent layer, the 53 agent-bearing plugins either pin a model or inherit the session default, so the new xhigh effort default and any model-alias shift propagate into every inherited agent simultaneously ³⁹⁹ . At the hook layer, the 25 hook-bearing plugins receive effort level via effort.level and $CLAUDE_EFFORT, so a changed default reaches hook logic that branches on effort, and the new MessageDisplay and InstructionsLoaded events expand the hook contract those plugins must satisfy ⁴²⁴⁰³⁶ . At the MCP layer, the 7 MCP-bearing plugins expose tools through a gateway that defers loading; a model whose context budget or tool-accuracy threshold differs changes which tools should be surfaced, and ToolSearch can mask tool availability if it is not tuned for the new tier ⁴¹⁵⁵ . One default at the top thus fans out to agents, hooks, and tool surfaces without any single plugin being explicitly edited.

The 15-plugin shortlist for immediate testing concentrates this exposure. The highest-ranked are keller-pr-review (0.57.0), which carries an explicit Opus fallback and must be tested for 4.8 speedup and instruction fidelity; morphllm-sdk (3.1.2), the central gateway with deferred MCP loading and ToolSearch exposure; thoughtbox (0.29.2), with six reasoning agents hardcoded to Sonnet; and cross-discipline-research (1.12.1), a multi-model plugin with 23 MCP tools subject to ToolSearch deferral ⁵⁴ . The shortlist continues through fresh-eyes, kellerai-chief-of-staff, the-collector, quartermaster, kellerai-skill-creator, ralph-loop-sica, live-code-qa, frontend-mobile, kellerai-grc, magic, and vc-pitch-prep — each combining model hardcoding, multi-agent coordination, hook timing, or MCP exposure ⁵⁴ . The blast-radius headline is that Opus 4.8's speed and instruction precision are upside for the reasoning-heavy plugins, but the 11 Sonnet-pinned plugins need model-routing updates and all 53 agent-bearing plus 25 hook-bearing plugins need a CC 2.1.156 compatibility pass within four weeks ⁵⁵ .

Section 06

Enhancement Catalog

The creator family is unusually well-positioned for this upgrade because it has no model-version hardcoding: kellerai-skill-creator, agent-creator, and skill-creator all default to model: sonnet with no pinned 4.5/4.6/4.7 ids in any agent frontmatter ⁴⁴⁵⁰ . The opportunity is therefore not remediation but exploitation — the family leaves prompt caching, structured outputs, mid-conversation overrides, and effort-based routing on the table ⁴⁵⁴⁶⁴⁷⁴⁸ . The catalog below spans all three sweep areas plus governance, prioritized P0 (land before 2026-06-15), P1 (within four weeks), P2 (next quarter).

Pri	Area	Enhancement	Source
P0	Marketplace	Model-default migration pass for the 11 claude-sonnet-4-6-hardcoded plugins	[38]
P0	Tool-search/MCP	Per-model surface profiles (haiku_deferred / sonnet_mixed / opus_all)	[60]
P0	Creator family	Effort→model routing wired from the existing SQLite effort telemetry	[48]
P0	Governance	Rename agent-creator → kellerai-agent-creator (folder + internal) for namespace distance	[21]
P1	Creator family	Prompt caching on SKILL.md / reference reads in the improve loop	[45]
P1	Creator family	Structured outputs for grading / comparison / analysis JSON	[46]
P1	Marketplace	CC 2.1.156 hook-event compatibility pass for the 25 hook plugins	[40]
P1	Marketplace	ToolSearch / deferred-loading audit for the 7 MCP plugins	[41]
P1	Tool-search/MCP	plugin.json alwaysLoad / conditional-by-model flags	[58]
P2	Creator family	Graceful Dynamic-Workflow ↔ inline fallback for subagent fan-out	[44]
P2	Marketplace	Dynamic-Workflow routing for the 6 plugins with 5+ agents	[39]
P2	Tool-search/MCP	Context-budget-driven re-surfacing at turn boundaries	[59]
P2	Tool-search/MCP	Mid-conversation system messages for per-turn surface recompute	[61]

Creator family. The improve loop re-reads SKILL.md bodies and reference files (schemas.md, mode-diagrams.md, agent instructions) on every invocation; prompt caching them is estimated at roughly 20% token savings on multi-iteration loops, and the 1,024-token cache floor on 4.8 makes more of them cacheable ⁴⁵⁴ . The grader, comparator, and analyzer hand-curate grading.json, comparison-N.json, and analysis.json today; routing these through structured outputs would enforce schema compliance without post-processing ⁴⁶ . The SQLite agent-spawn hook already captures effort per dispatch but does not use it for model selection — wiring effort inference to auto-select Haiku, Sonnet, or Opus by eval complexity is a P0 because it lands cost control before the Agent SDK credit change ⁴⁸²⁶ . The architecture already assumes Dynamic Workflows for parallel fan-out and degrades to inline single-run mode without subagents, so the workflow ↔ inline fallback needs to be made explicit and graceful against the new 16/1,000-agent caps ⁴⁴¹⁵ .

Marketplace. The 11 Sonnet-pinned plugins need an explicit model-default migration pass; this is P0 because it determines whether reasoning-heavy plugins see any 4.8 upside at all ³⁸ . The 25 hook-bearing plugins need a CC 2.1.156 compatibility pass covering the new MessageDisplay and InstructionsLoaded events and the effort-in-hooks fields ⁴⁰³⁶⁴² . The 7 MCP plugins need a ToolSearch / deferred-loading audit to confirm tools are not masked under the new tier ⁴¹⁵⁵ . The 6 plugins with 5+ agents are the natural first candidates for Dynamic-Workflow routing once the syntax is documented ³⁹²⁸ .

Tool-search / MCP. The gateway config is one-size-fits-all today, with a static core_tools split that does not vary by model tier ⁵⁶⁶⁰ . Per-model surface profiles — haiku_deferred, sonnet_mixed, opus_all — are P0 because Opus 4.8's larger context window changes the right core/deferred split, and the current config gives no per-model guidance ⁶⁰ . The alwaysLoad and conditional-by-model plugin flags (new in v2.1.154) let plugins like thoughtbox and morphllm-sdk declare themselves always-loaded, which closes the gap where deferred tools stay hidden if ToolSearch fails ⁵⁸³⁰ . Context-budget-driven re-surfacing and mid-conversation system messages are P2 because they require new turn-boundary integration the gateway does not yet have, but they are the mechanism by which the surface could become dynamic rather than fixed at SessionStart ⁵⁹⁶¹ .

Governance. The agent-creator plugin is an upstream Anthropic name carried verbatim, and it pairs with the already-renamed kellerai-skill-creator; renaming it kellerai-agent-creator (folder and internal references) gives the KellerAI fork namespace distance from the upstream plugins it extends ²¹²⁰ . This is P0 not because it is hard but because the namespace decision is cheapest to make before further forks accumulate against the upstream name.

Section 07

Recommendations & Sequencing

The window has exactly one hard deadline: the Agent SDK credit change on 2026-06-15, which moves SDK usage on a subscription to a separate monthly credit bucket ²⁶ . Everything cost-sensitive that depends on SDK-driven subagent fan-out should land before that date, because after it the cost model for the creator family's parallel runs and any Dynamic-Workflow adoption changes ²⁶⁷ . This makes the four P0 items the first sprint.

Land before 2026-06-15 (the P0 set). First, run the model-default migration pass on the 11 Sonnet-pinned plugins so reasoning-heavy work is eligible for Opus 4.8 routing rather than silently excluded ³⁸ . Second, wire effort→model routing in the creator family from the existing SQLite effort telemetry, so cheap evals run on Haiku and expensive ones on Opus before the credit accounting changes underneath them ⁴⁸²⁶ . Third, stand up the per-model surface profiles ( haiku_deferred / sonnet_mixed / opus_all) so tool loading is appropriate to the model actually in use, rather than a one-size-fits-all config ⁶⁰ . Fourth, complete the agent-creator → kellerai-agent-creator rename while the namespace cost is lowest ²¹ .

Within four weeks (the P1 set). Add prompt caching on SKILL.md and reference reads in the improve loop, capturing the estimated ~20% token savings made cheaper by the 1,024-token cache floor ⁴⁵⁴ . Move grading, comparison, and analysis JSON to structured outputs to eliminate post-processing ⁴⁶ . Run the CC 2.1.156 hook-event compatibility pass across the 25 hook plugins and the ToolSearch / deferred-loading audit across the 7 MCP plugins ⁴⁰⁴¹ . Add the alwaysLoad / conditional-by-model flags to the plugins that warrant always-loaded status ⁵⁸ . Use the 15-plugin shortlist as the test ordering for all of this, starting with keller-pr-review, morphllm-sdk, thoughtbox, and cross-discipline-research ⁵⁴ .

Next quarter (the P2 set). Make the Dynamic-Workflow ↔ inline fallback explicit in the creator family against the 16/1,000-agent caps, then extend Dynamic-Workflow routing to the 6 plugins with 5+ agents once the workflow syntax is publicly documented ⁴⁴³⁹²⁸ . Build context-budget-driven re-surfacing and per-turn surface recompute on mid-conversation system messages, the two mechanisms that turn the static SessionStart surface into a dynamic one ⁵⁹⁶¹ .

Each recommendation ties back to a catalog row, and the ordering is driven by the single dated deadline rather than by perceived model risk. The model upgrade is the occasion; the governance pass is the work.

Section 08

Honest Limits

This review cannot confirm Opus 4.8's headline behavioral claim. The “4× fewer unflagged code flaws” result is an internal coding evaluation with no independent replication as of 2026-05-29, and the alignment claims are unverifiable without the system card ¹ . The vendor benchmarks in §3 are vendor-reported and none has been independently replicated for 4.8 ⁵¹ .

The platform surface carries several unverified or gated items. The Dynamic Workflows script syntax and control-flow primitives are not yet publicly documented — the feature is described at a high level only, and subagent-routing mechanics inside a workflow are unknown ²⁸ . Fast Mode is a research preview, Claude-API-only and waitlisted, so its pricing is real but not generally accessible ⁴ . The Microsoft Foundry 1M-context exception (200k) is documented but the marketing headline omits it ⁶¹⁷ . The harness divergence number (87.6% vendor versus 82.0% vals.ai) is measured on the prior model and only “likely” applies to 4.8 ³⁷ .

Several gateway-integration claims are forward-looking rather than confirmed. The current session is running Claude Code 2.1.145+ against a claude_code_local profile, and Dynamic Workflows usage in the gateway is “NONE found” — the migration items in §5 and §6 are proposals against the documented feature surface, not observations of working integrations ⁵⁵⁵⁷ . The per-model surface profiles, alwaysLoad flags, context-budget re-surfacing, and mid-conversation re-surfacing are all gap-analysis recommendations whose blockers (schema updates, turn-boundary hooks, profile expansion) are documented but unbuilt ⁶⁰⁵⁸⁵⁹⁶¹ . Finally, the Agent SDK credit change is dated 2026-06-15 but its cost-per-token-per-credit details are not documented, so the sequencing in §6 optimizes around the deadline without modeling the exact post-deadline cost ²⁶ .

References

1Anthropic. "Claude Opus 4.8." anthropic.com, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
2Coursiv. "Claude Opus 4.8." coursiv.io, 2026-05-28. https://coursiv.io/blog/claude-opus-4-8
3Hidekazu Konishi. "Claude Opus Release Timeline." hidekazu-konishi.com, 2026-05-28. https://hidekazu-konishi.com
4Finout. "Claude API Pricing." finout.io, 2026-05-28. https://finout.io
5cross-discipline-research:technical-claims-agent. Review window and 4.7→4.8 focus. Per Anthropic Opus 4.8 release notes / model card and public benchmark reporting at release (2026-05-28). https://www.anthropic.com/news/claude-opus-4-8
6Releasebot. "Anthropic Updates — Claude Opus 4.8." releasebot.io, 2026-05-28. https://releasebot.io/updates/anthropic
7Anthropic / Claude. "Introducing Dynamic Workflows in Claude Code." claude.com, 2026-05-28. https://claude.com/blog/introducing-dynamic-workflows-in-claude-code
8Anthropic. "Mid-conversation system messages (Messages API)." anthropic.com, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
9Claude Code Changelog. "v2.1.154." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
10CodeRabbit. "Opus 4.8 adaptive thinking test." youtube.com (LzgPzQud0zA), 2026-05-28. https://www.youtube.com/watch?v=LzgPzQud0zA
11Claude Platform. "Effort controls." platform.claude.com, 2026-05-28. https://platform.claude.com/effort
12Finout. "Claude Opus 4.7 tokenizer / effective-cost story." finout.io, 2026-04-16. https://finout.io
13Claude Code Changelog. "v2.1.130–v2.1.156 — 27 point releases in 87 days." Per public Claude Code changelog. code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
14Claude Code Changelog. "v2.1.142 — fast mode default model." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
15MarkTechPost. "Anthropic Ships Claude Opus 4.8 alongside Dynamic Workflows and cheaper Fast Mode, workflows capped at 1,000 subagents." marktechpost.com, 2026-05-28. https://www.marktechpost.com/2026/05/28/anthropic-ships-claude-opus-4-8-alongside-dynamic-workflows-and-cheaper-fast-mode-with-workflows-capped-at-1000-subagents/
16TechCrunch. "Anthropic releases Claude Opus 4.8." techcrunch.com, 2026-05-28. https://techcrunch.com
17ReleaseBot / Coursiv. "1M context default; Foundry 200k exception." releasebot.io, 2026-05-28. https://releasebot.io/updates/anthropic
18Claude Platform. "MCP connector — stdio environment variables." platform.claude.com, 2026-05. https://platform.claude.com/mcp-connector
19Anthropic. "Claude Code plugins documentation — creator plugin family." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
20KellerAI internal codebase. "kellerai-skill-creator plugin identity (v2.1.8)." KellerAI's kellerai-skill-creator plugin, plugin.json manifest (non-public). See also /kellerai-skill-creator. Illustrative; not verbatim: { "name": "kellerai-skill-creator", "version": "2.1.8", "description": "..." }
21KellerAI internal codebase. "agent-creator plugin identity (v1.6.1)." KellerAI's agent-creator plugin, plugin.json manifest (non-public). Illustrative; not verbatim: { "name": "agent-creator", "version": "1.6.1", "description": "..." }
22KellerAI internal codebase. "skill-creator plugin identity (v1.6.1)." KellerAI's skill-creator plugin, plugin.json manifest (non-public). Illustrative; not verbatim: { "name": "skill-creator", "version": "1.6.1", "description": "..." }
23Anthropic / Claude. "Hebbia CTO testimonial — 61% cheaper per-task token cost." claude.com, 2026-05-28. https://claude.com/opus
24CodeRabbit. "Independent Opus 4.8 test — +4% recall, flat precision." youtube.com (LzgPzQud0zA), 2026-05-28. https://www.youtube.com/watch?v=LzgPzQud0zA
25Every. "Vibe check — Senior Engineer Benchmark." every.to, 2026-05-28. https://every.to/vibe-check
26Claude Code. "Agent SDK overview — SDK billing change June 15, 2026." code.claude.com, 2026-05. https://code.claude.com/docs/en/agent-sdk/overview.md
27Claude Code Changelog. "v2.1.154 — lean system prompt default." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
28Claude Code Changelog. "Dynamic Workflow script syntax — not yet publicly documented." Per public Claude Code changelog and Anthropic release materials; workflow control-flow primitives unverified as of 2026-05-29. code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
29Anthropic. "Tool-search variants and deferred-loading behavior." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
30Claude Code Changelog. "v2.1.154 — alwaysLoad MCP flag; gateway session surface apply." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
31Claude Code. "Skills documentation — commands/skills unification." code.claude.com, 2026-05. https://code.claude.com/docs/en/skills
32Claude Code Changelog. "v2.1.152 — disallowed-tools, /reload-skills, reloadSkills hook." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
33Claude Code Changelog. "v2.1.154 — defaultEnabled plugin flag." code.claude.com, 2026-05-28. https://code.claude.com/docs/en/changelog
34Claude Code Changelog. "v2.1.143-2.1.145 — /plugin projected context cost, components, dependency enforcement." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
35Claude Code. "Plugins documentation — plugin.json schema and auto-discovery." code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
36Anthropic / public benchmark reporting. "Terminal-Bench 2.1 regression vs GPT-5.5; SWE-bench Pro / OSWorld competitive standing." Per Anthropic Opus 4.8 release materials and SWE-bench Verified / Terminal-Bench public leaderboard data, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
37SWE-bench Verified leaderboard / vals.ai. "Harness divergence: 87.6% vendor vs 82.0% vals.ai (prior model)." Per SWE-bench Verified public leaderboard and Anthropic Opus 4.8 release notes / model card. https://www.anthropic.com/news/claude-opus-4-8
38Internal review artifacts (KellerAI). "11 plugins hardcoding claude-sonnet-4-6." Per internal marketplace audit, 2026-05-29.
39Internal review artifacts (KellerAI). "53 plugins with /agents; 6 with 5+ agents." Per internal marketplace audit, 2026-05-29.
40Internal review artifacts (KellerAI). "25 plugins with /hooks." Per internal marketplace audit, 2026-05-29.
41Internal review artifacts (KellerAI). "7 plugins with MCP servers." Per internal marketplace audit, 2026-05-29.
42Claude Code Changelog. "v2.1.133 — effort.level and $CLAUDE_EFFORT in hooks." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
43Claude Code Changelog. "v2.1.139,v2.1.154 — CLAUDE_PROJECT_DIR, CLAUDECODE=1 for stdio MCP." code.claude.com, 2026-05. https://code.claude.com/docs/en/changelog
44Anthropic. "Claude Code plugins documentation — Dynamic Workflows; creator plugin architecture." Per public Anthropic plugins documentation; Dynamic Workflows assumed for parallel fan-out; graceful degradation to inline; no model hardcoding. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
45Anthropic. "Claude Code plugins documentation — prompt caching on SKILL.md / reference reads (~20% savings)." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
46Anthropic. "Claude Code plugins documentation — structured outputs for grading / comparison / analysis JSON." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
47Anthropic. "Claude Code plugins documentation — mid-conversation system message overrides per phase." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
48Anthropic. "Claude Code plugins documentation — agent-spawn hook captures effort; effort→model routing opportunity." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
49Anthropic. "Claude Code plugins documentation — eval / improve / create / benchmark modes (operator-run evals)." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
50Anthropic. "Claude Code plugins documentation — baked-in assumptions: model: sonnet, no version hardcoding." Per public Anthropic plugins / marketplace documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/plugins
51Anthropic. "Opus 4.8 vendor benchmark table (SWE-bench Pro, OSWorld, Terminal-Bench, HLE, Online-Mind2Web)." Per Anthropic Opus 4.8 release notes / model card and SWE-bench Verified / Terminal-Bench public leaderboard reporting; none independently replicated for 4.8 as of 2026-05-29. https://www.anthropic.com/news/claude-opus-4-8
52Anthropic. "Claude Mythos Preview leads 4.8; \"most capable publicly available\" framing." Per Anthropic Opus 4.8 release notes / model card, 2026-05-28. https://www.anthropic.com/news/claude-opus-4-8
53Vellum. "Claude Mythos Preview benchmarks." vellum.ai, 2026-05. https://www.vellum.ai
54Internal review artifacts (KellerAI). "15-plugin blast-radius shortlist with ranks and actions." Per internal marketplace audit, 2026-05-29.
55Internal review artifacts (KellerAI). "Executive summary: counts, fragility, 4-week compatibility-pass headline." Per internal marketplace audit, 2026-05-29.
56Anthropic. "Tool-search / MCP documentation — deferred-loading config and per-server core overrides." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
57Anthropic. "Tool-search / MCP documentation — Dynamic Workflows gateway integration gap; migration requirements." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
58Anthropic. "Tool-search / MCP documentation — missing alwaysLoad / defaultEnabled / onlyFor plugin flags." Per public Anthropic tool-search / MCP and plugins documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
59Anthropic. "Tool-search / MCP documentation — context-budget signaling not modeled at turn boundaries." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
60Anthropic. "Tool-search / MCP documentation — no per-model tool strategies; haiku_deferred / sonnet_mixed / opus_all profiles." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search
61Anthropic. "Tool-search / MCP documentation — mid-conversation system messages not leveraged for surface recompute." Per public Anthropic tool-search / MCP documentation. code.claude.com, 2026-05. https://code.claude.com/docs/en/tool-search