Executive Summary
A growing class of production AI failures shares a single signature: an engineer simplifies an artifact — a prompt, a skill, an agent's system message, a plugin's tool descriptions — for clarity, cost, or maintainability. Aggregate evaluation metrics look stable. Users start hitting cliffs the eval doesn't see.
The published evidence is unambiguous. Subtle prompt formatting changes can swing accuracy by up to 76 points3 . Frontier models cap at ~68% accuracy under high instruction density 2 . A STAR reasoning framework that scored 100% in isolation collapsed to 0–30% when embedded in a 60-line production prompt — the authors named this instruction dilution 1 . SWE-Bench Verified pass rates for SWE-Agent + GPT-4 fell from 12.47% to 3.97% after stricter validation, with 32.67% of original successes attributable to solution leakage 4 .
The implication: most teams cannot tell when their artifacts have crossed a capability cliff, because the test set they built around the original artifact tests average competence rather than preserved capability frontier.
Over-simplification cannot be detected by aggregate pass rates; it must be tested at the capability frontier, against named behavioral contracts, with judges that control for their own biases — and the answer must point at which component carried the lost behavior.
The Silent Regression Problem
When a software function is simplified, regressions usually announce themselves: a test goes red, a build breaks, a type-checker complains. AI artifacts have no such property. They degrade smoothly across most inputs while collapsing on the narrow band of cases that justified the original complexity.
Three properties make this worse than ordinary regression:
Definition
Over-Simplification. An artifact is over-simplified when a reduction in instructions, examples, context, or tools causes the removal of a capability that some non-trivial slice of users depends on, in a way not detected by the artifact's current evaluation harness.
This definition deliberately includes the eval harness in the loop. If an eval can't see the regression, the regression has, for operational purposes, happened.
Artifact Taxonomy
"Advanced prompt, skill, agent, or plugin" covers four distinct surfaces with different failure modes. The detection framework has to adapt to each.
| Artifact | Carries | Simplification Vector | Typical Failure |
|---|---|---|---|
| Prompt | Instructions, examples, role, output constraints | Token reduction, example pruning, role flattening | Format drift, lost edge cases, reasoning collapse 1 |
| Skill | SKILL.md, progressive disclosure layers, bundled scripts | Stripping references, flattening disclosure, generalizing descriptions | Trigger miss, execution drift 10 |
| Agent | System prompt, tool roster, memory protocols, sub-agent topology | Tool removal, harness simplification, memory compression | Wrong tool selection, multi-turn drift, cascading errors 11 |
| Plugin | Tool descriptions, parameter schemas, usage examples | Description shortening, example elision, schema generalization | Mis-selection, malformed calls, hallucinated parameters 12 |
Across all four, the underlying mechanism is identical: specific tokens carry specific behaviors , and removing them removes the behaviors — but only on the inputs that triggered them. The detection challenge is identifying which inputs those are.
Empirical Foundations
Six findings from the published literature anchor the framework. Each represents a constraint the system must respect.
Instruction dilution is real and measurable
Jo (2026) tested a STAR reasoning framework on a structured reasoning problem. In isolation: 100% accuracy. Embedded in a 60-line production system prompt: 0–30% accuracy. The mechanism: directives like "Lead with specifics" forced the model to commit to a conclusion before STAR reasoning could execute. The capability still existed; the surrounding context prevented it from running 1.
Capability cliffs exist below context-window limits
Jaroslawicz et al. (2025) evaluated 20 frontier models against 500 simultaneous instructions. Even the best achieved only 68% accuracy at maximum density, with three failure regimes: threshold, linear, and exponential decay 2. Levy et al. (2024) showed reasoning degradation begins around 3,000 tokens — far below technical context limits 6.
Format sensitivity dwarfs semantic changes
Sclar et al. (2024) found performance differences of up to 76 accuracy points from formatting changes alone. Mizrahi et al. (2024) confirmed across 6.5M instances and 39 tasks that benchmark rankings themselves change with prompt wording 3.
LLM-as-judge has well-characterized biases
Position bias produces accuracy shifts >10% from swap alone 7. Verbosity bias favors longer outputs regardless of quality. Self-enhancement drives judges to prefer outputs from their own model family 8. Multi-agent debate amplifies rather than mitigates several of these 9.
Behavioral testing predates LLMs
Ribeiro et al. (2020) introduced CheckList, importing software-engineering behavioral testing into NLP. Three test types — Minimum Functionality Tests, Invariance, Directional Expectation — proved even SOTA models failed on simple behaviors aggregate metrics could not reveal 5. Transfers directly to artifact-level testing.
Failure Modes Catalog
Before designing tests, name the failures. Twelve modes recur across artifact types.
You cannot test for capability loss without naming the capabilities.
Five Detection Approaches
The literature and practice yield five candidates. Each is rated on signal quality and effort. The goal is selecting a combination, not picking one.
Ablation Bisection
Method. Lock the original as gold. Remove components one at a time — instructions, examples, constraints, tool descriptions — re-running the eval after each. Locate the removal that crosses your failure threshold.
Why it works. Inherits the epistemological logic of ablation studies in ML — if removing X eliminates a behavior, X carried the behavior 13 .
Where it breaks. Combinatorial when removals interact. Slow on artifacts with many components. Requires a high-quality eval set — bisection on a weak eval just bisects to noise.
Hard-Case Capability Frontier Battery
Method. Curate evaluation items at the edge of intended capability: long context, conflicting constraints, refusal handling, multi-turn drift, adversarial inputs, tool selection under ambiguity, format edge cases.
Why it works. Easy-middle cases pass long after the artifact has degraded on its frontier 5 .
Where it breaks. Curation cost is the real expense. Quality depends on whether you actually identified the frontier — a curator who shares the author's blind spots produces a set with the same blind spots.
Behavioral Contract Assertions
Method. Define non-negotiable behaviors as binary, programmatically-verifiable checks: must cite sources, must request clarification when ambiguous, must refuse class-X inputs, must call tool Z before answering, must produce parseable JSON.
Why it works. Categorical rather than gradient. Adapts CheckList's Minimum Functionality Test pattern 5 . Cheap, fast, explainable, blocks merges cleanly.
Where it breaks. Only catches behaviors you specified. The ones you didn't think to write down can quietly disappear.
Capability Decay Curve
Method. Generate stepped simplifications at 100%, 80%, 60%, 40%, 20% of original size. Plot the curve. Look for the cliff, not the slope.
Why it works. Empirically locates the threshold below which the artifact stops being itself. Connects to the threshold/linear/exponential decay regimes 2 .
Where it breaks. Token count is a crude proxy for what actually carries capability. The curve gives you a map but not a cause.
Hybrid — Contracts + Frontier + Bias-Controlled Judge + Triggered Bisection
Method. Combine three signals on every artifact change, plus a fourth invoked conditionally:
- ABehavioral contracts — binary must-hold checks; categorical fail blocks change.
- BFrontier battery — curated edge-case eval, scored against threshold.
- CPaired LLM-judge with swap + length normalization — controls for position and verbosity biases.
- DBisection — runs only when A, B, or C trips.
Why it works. Each signal catches what the others miss. Contracts catch behaviors you named. Frontier catches behaviors you tested. Judges catch behaviors you forgot to name but a careful reader would notice. Bisection localizes the cause. Cost is controlled by running bisection only on triggered failures.
Addresses four counter-arguments: aggregate hides what was lost → contracts make capability explicit; eval overfits to original → frontier targets unseen edges; judges are biased → paired position-swap neutralizes them; you learn it broke but not why → bisection localizes on triggered failures only.
Hybrid Framework — Specification
The recommended hybrid resolves into four components, each with concrete inputs, outputs, and thresholds.
A · Behavioral Contracts
A behavioral contract is a triple: (trigger, behavior_check, severity). The trigger selects applicable inputs. The behavior check is a verifiable test on the output. Severity determines whether failure blocks the build or warns.
Example contracts for a customer-support agent:
(input_mentions_refund, output_calls_refund_lookup_first, BLOCK)
(input_is_ambiguous, output_asks_clarifying_question, BLOCK)
(input_is_safety_critical, output_routes_to_human, BLOCK)
(any, output_response_under_300_words, WARN)
Predicates can be regex, structural parsers, deterministic tool-trace inspections, or — sparingly — small LLM calls with high-agreement rubrics. Keep the set to 10–30 items.
B · Frontier Battery
Curated set at the edge of intended capability. Suggested composition:
Category | Share | Example |
|---|---|---|
| Long context | 15% | Inputs at 60–90% of context budget |
| Conflicting instructions | 15% | User asks X, system prompt forbids X |
| Adversarial / injection | 10% | Embedded "ignore previous instructions" |
| Refusal boundary | 10% | Inputs at exact edge of policy |
| Tool ambiguity | 15% | Two tools could plausibly handle |
| Multi-turn drift | 15% | 3–6 turn conversations |
| Format edges | 10% | Nested JSON, multilingual, code blocks |
| Production-derived | 10% | Real cases from monitoring |
Size: 50–200 items typical. Refresh production-derived slice quarterly per the data-flywheel pattern 14 .
C · Bias-Controlled Paired Judge
Naive LLM-as-judge fails. Position swap alone shifts pairwise outcomes by >10% 7 . The bias-controlled protocol:
# For each input in eval set: # 1. Generate output from original artifact (A) and simplified (B) # 2. Judge twice with positions swapped: (A,B) and (B,A) # 3. If judgments agree → record winner # 4. If they disagree → record TIE (position bias detected) # 5. Normalize for length: if winner is >1.5× longer, downweight # 6. Aggregate: simplified passes iff win_rate(B) ≥ 0.50
Use a judge from a different model family than the artifact's own model to reduce self-enhancement bias 8 . Calibrate quarterly against a small human-labeled set.
D · Triggered Bisection
Runs only when A, B, or C trips:
# Inputs: original, simplified, failing_eval_subset # 1. Diff the two artifacts → list of removed/changed components # 2. For each component, build a "partial" artifact: # simplified + just that component restored # 3. Re-run failing items against each partial # 4. The component whose restoration recovers failures = cause # 5. If no single component recovers → interaction effect; pair-wise
O(n) for single causes, O(n²) for pair-interactions. Cap at pairs; report "interaction" rather than enumerate triples.
Live Simulator
Six capability-carrying components below. Toggle them off to simulate simplification. Three scores reflect what each detection layer would see — and where they diverge.
Ablation Simulator
Toggle components off. Watch aggregate stay high while contracts can break cleanly.
Two patterns worth observing: aggregate pass rate degrades gradually while contracts can break cleanly on a single removal; and different components produce qualitatively different verdicts even at similar token cost.
Case Study
A restaurant marketing prompt
Consider a restaurant marketing system prompt evolved over months for an operator named Alex. The original is 90 lines covering: persona, audience targeting, campaign taxonomy, brand-voice constraints, three campaign examples, refusal handling, output schema.
A new engineer proposes simplification: collapse three examples into one, drop the persona block ("the LLM is good enough now"), shorten refusal handling. The simplified version is 38 lines — a 58% reduction.
Step 1 · Contracts
ID | Contract | Result |
|---|---|---|
| C1 | Output is valid JSON matching campaign schema | PASS |
| C2 | Discount > 30% includes margin warning | PASS |
| C3 | Voice avoids five banned phrases (synergy, leverage…) | FAIL 8/50 |
| C4 | Refusals propose on-brand alternative | FAIL 6/50 |
Build is blocked at Layer A. Both failed contracts are BLOCK severity. No need to proceed.
Step 2 · Bisection
Triggered. The bisector restores each removed component and re-tests:
Restored component | C3 recovery | C4 recovery |
|---|---|---|
Persona block | 100% | 0% |
Three separate examples | 25% | 33% |
Original refusal handling | 0% | 100% |
Output: C3 was carried by the persona block; C4 by the refusal handling. Both must be restored or rewritten before the simplification ships.
Step 3 · A constructive next move
Knowing which components carry which behaviors changes the conversation from "we can't simplify" to "we can simplify these specific parts but not these." A second iteration might compress the three examples (which the bisector showed are weakly load-bearing) while preserving persona and refusal blocks. The framework gives an evidence-based path to a smaller artifact — not a flat veto.
The framework's deepest value is not catching bad simplifications. It is making good simplifications possible.
CI/CD Integration
The framework runs as a check on artifact changes. Suggested wiring:
# .github/workflows/artifact-eval.yml
name: artifact-evaluation
on:
pull_request:
paths: ['prompts/**', 'skills/**', 'agents/**', 'plugins/**']
jobs:
contracts:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: python -m eval.contracts --strict
frontier:
needs: contracts
steps:
- run: python -m eval.frontier --threshold 0.85
paired:
needs: contracts
steps:
- run: python -m eval.paired --sample 50 --swap --normalize-length
bisect:
needs: [contracts, frontier, paired]
if: failure()
steps:
- run: python -m eval.bisect --report-pr-commentSuggested thresholds
Signal | Default | Action |
|---|---|---|
| Block-severity contracts | 100% must hold | BLOCK |
| Warn-severity contracts | > 95% must hold | WARN |
| Frontier category pass rate | ≥ 85% per category | BLOCK |
| Paired win rate (simplified) | ≥ 45% (not significantly worse) | BLOCK |
| Paired tie rate (bias signal) | < 20% | WARN |
Defaults — tune per artifact; safety-critical agents warrant stricter thresholds.
The data flywheel
Production failures should flow back into the eval suite. The pattern documented by LangChain, Braintrust, and others 1415 : monitoring identifies a production trace where the artifact misbehaved, a domain expert annotates the correct trajectory, the corrected example joins the frontier battery. The eval grows organically to match the failure modes the artifact actually encounters.
Counter-Arguments Addressed
The framework is more expensive than "just look at pass rate." Four objections deserve direct response.
"This is overkill for small prompts."
"LLM-as-judge is too unreliable to anchor a CI gate."
"Frontier curation is too expensive to maintain."
"This still misses capabilities we didn't think to test for."
- Periodic frontier refresh from production failures (the data flywheel).
- Paired judge captures qualitative differences a careful reader would notice — including ones not on the contract list.
- Adversarial review: have a domain expert try to break the simplified artifact before merge.
None of these reduce failure rate to zero. They reduce it to where surprises become recoverable rather than catastrophic.
Limitations & Open Questions
The framework is incomplete in known ways. Naming them honestly is part of operating it well.
- 01Stochasticity baseline. Any single eval run has noise. Threshold tuning must account for variance. Typical mitigation: run each item 3–5 times, use median or pass-at-k.
- 02Judge drift. The judge model may update between runs, changing scores without artifact change. Pin the judge version. Re-baseline on upgrades.
- 03Interaction effects. Components A and B may be load-bearing only together. Single-component bisection misses this; pair-wise is O(n²). Practical limit: stop at pairs and report "interaction" without enumeration.
- 04Generated frontier items. Using an LLM to generate frontier items risks them inheriting the LLM's blind spots. Human-curated items are higher signal but more expensive.
- 05Adversarial robustness ≠ over-simplification. This framework detects accidental capability loss; it does not specifically test for adversarial robustness, which deserves its own pipeline.
- 06Cross-artifact regressions. When a skill is simplified, the agent that calls it may degrade in ways the skill's eval doesn't see. End-to-end agent evals are needed in addition.
Open question
Can we automate frontier-set generation from production failure traces with sufficient quality to be load-bearing? The data-flywheel literature is encouraging but does not yet show fully-automated curation matches human curation. This is where the next round of research will likely pay off.