Detecting Over-Simplification in Prompts, Skills, Agents & Plugins: In Depth

KellerAI

Section 01

Executive Summary

A growing class of production AI failures shares a single signature: an engineer simplifies an artifact — a prompt, a skill, an agent's system message, a plugin's tool descriptions — for clarity, cost, or maintainability. Aggregate evaluation metrics look stable. Users start hitting cliffs the eval doesn't see.

The published evidence is unambiguous. Subtle prompt formatting changes can swing accuracy by up to 76 points³ . Frontier models cap at ~68% accuracy under high instruction density ² . A STAR reasoning framework that scored 100% in isolation collapsed to 0–30% when embedded in a 60-line production prompt — the authors named this instruction dilution ¹ . SWE-Bench Verified pass rates for SWE-Agent + GPT-4 fell from 12.47% to 3.97% after stricter validation, with 32.67% of original successes attributable to solution leakage ⁴ .

The implication: most teams cannot tell when their artifacts have crossed a capability cliff, because the test set they built around the original artifact tests average competence rather than preserved capability frontier.

Over-simplification cannot be detected by aggregate pass rates; it must be tested at the capability frontier, against named behavioral contracts, with judges that control for their own biases — and the answer must point at which component carried the lost behavior.

The thesis

Section 02

The Silent Regression Problem

When a software function is simplified, regressions usually announce themselves: a test goes red, a build breaks, a type-checker complains. AI artifacts have no such property. They degrade smoothly across most inputs while collapsing on the narrow band of cases that justified the original complexity.

Three properties make this worse than ordinary regression:

01

Capability is non-uniform across inputs.A simplified prompt may handle the modal case well and the edge case catastrophically. Held-out accuracy averages over both ⁵.

02

Failure modes are non-linear in artifact length.Jaroslawicz et al. document three distinct degradation curves under instruction density: threshold decay, linear decay, exponential decay ². Knowing the shape for your artifact is itself a capability evaluators need.

03

The evaluator inherits the artifact's biases.If the eval set was constructed from the original artifact's strengths, the simplified version may look fine on it while failing on inputs the original was specifically designed to handle.

Definition

Over-Simplification. An artifact is over-simplified when a reduction in instructions, examples, context, or tools causes the removal of a capability that some non-trivial slice of users depends on, in a way not detected by the artifact's current evaluation harness.

This definition deliberately includes the eval harness in the loop. If an eval can't see the regression, the regression has, for operational purposes, happened.

Section 03

Artifact Taxonomy

"Advanced prompt, skill, agent, or plugin" covers four distinct surfaces with different failure modes. The detection framework has to adapt to each.

Artifact	Carries	Simplification Vector	Typical Failure
Prompt	Instructions, examples, role, output constraints	Token reduction, example pruning, role flattening	Format drift, lost edge cases, reasoning collapse ¹
Skill	SKILL.md, progressive disclosure layers, bundled scripts	Stripping references, flattening disclosure, generalizing descriptions	Trigger miss, execution drift ¹⁰
Agent	System prompt, tool roster, memory protocols, sub-agent topology	Tool removal, harness simplification, memory compression	Wrong tool selection, multi-turn drift, cascading errors ¹¹
Plugin	Tool descriptions, parameter schemas, usage examples	Description shortening, example elision, schema generalization	Mis-selection, malformed calls, hallucinated parameters ¹²

Across all four, the underlying mechanism is identical: specific tokens carry specific behaviors , and removing them removes the behaviors — but only on the inputs that triggered them. The detection challenge is identifying which inputs those are.

Section 04

Empirical Foundations

Six findings from the published literature anchor the framework. Each represents a constraint the system must respect.

I

Instruction dilution is real and measurable

Jo (2026) tested a STAR reasoning framework on a structured reasoning problem. In isolation: 100% accuracy. Embedded in a 60-line production system prompt: 0–30% accuracy. The mechanism: directives like "Lead with specifics" forced the model to commit to a conclusion before STAR reasoning could execute. The capability still existed; the surrounding context prevented it from running ¹.

II

Capability cliffs exist below context-window limits

Jaroslawicz et al. (2025) evaluated 20 frontier models against 500 simultaneous instructions. Even the best achieved only 68% accuracy at maximum density, with three failure regimes: threshold, linear, and exponential decay ². Levy et al. (2024) showed reasoning degradation begins around 3,000 tokens — far below technical context limits ⁶.

III

Format sensitivity dwarfs semantic changes

Sclar et al. (2024) found performance differences of up to 76 accuracy points from formatting changes alone. Mizrahi et al. (2024) confirmed across 6.5M instances and 39 tasks that benchmark rankings themselves change with prompt wording ³.

IV

LLM-as-judge has well-characterized biases

Position bias produces accuracy shifts >10% from swap alone ⁷. Verbosity bias favors longer outputs regardless of quality. Self-enhancement drives judges to prefer outputs from their own model family ⁸. Multi-agent debate amplifies rather than mitigates several of these ⁹.

V

Benchmark contamination inflates baselines

Aleithan et al. (2024) found 32.67% of SWE-Agent + GPT-4 successes involved solution leakage. Strict re-validation cut pass rates from 12.47% to 3.97% ⁴. SWE-Bench+ mutation work shows top models lose 36–54pp on realistic queries ¹².

VI

Behavioral testing predates LLMs

Ribeiro et al. (2020) introduced CheckList, importing software-engineering behavioral testing into NLP. Three test types — Minimum Functionality Tests, Invariance, Directional Expectation — proved even SOTA models failed on simple behaviors aggregate metrics could not reveal ⁵. Transfers directly to artifact-level testing.

Section 05

Failure Modes Catalog

Before designing tests, name the failures. Twelve modes recur across artifact types.

F1

Lost edge-case handling

Inputs the original artifact specifically anticipated

F2

Format drift

Output looks fine; downstream parsers fail

F3

Reasoning order collapse

Model commits to conclusion before reasoning runs

F4

Tool mis-selection

Plausible-looking tool call, wrong tool

F5

Trigger miss (skills)

Skill description no longer matches phrasing

F6

Refusal drift

More or less cautious than intended

F7

Citation loss

Claims that should be sourced no longer are

F8

Clarification regression

Model stops asking the right questions

F9

Multi-turn drift

Single-turn passes, conversation derails

F10

Persona flattening

Voice becomes generic, audience calibration lost

F11

Hallucinated parameters

Tool calls fabricate required fields

F12

Refusal tone

Refusals curt where they should be helpful

You cannot test for capability loss without naming the capabilities.

Section 06

Five Detection Approaches

The literature and practice yield five candidates. Each is rated on signal quality and effort. The goal is selecting a combination, not picking one.

Approach 01

Ablation Bisection

Signal4/5

EffortMed

Method. Lock the original as gold. Remove components one at a time — instructions, examples, constraints, tool descriptions — re-running the eval after each. Locate the removal that crosses your failure threshold.

Why it works. Inherits the epistemological logic of ablation studies in ML — if removing X eliminates a behavior, X carried the behavior ¹³ .

Where it breaks. Combinatorial when removals interact. Slow on artifacts with many components. Requires a high-quality eval set — bisection on a weak eval just bisects to noise.

Approach 02

Hard-Case Capability Frontier Battery

Signal5/5

EffortHigh

Method. Curate evaluation items at the edge of intended capability: long context, conflicting constraints, refusal handling, multi-turn drift, adversarial inputs, tool selection under ambiguity, format edge cases.

Why it works. Easy-middle cases pass long after the artifact has degraded on its frontier ⁵ .

Where it breaks. Curation cost is the real expense. Quality depends on whether you actually identified the frontier — a curator who shares the author's blind spots produces a set with the same blind spots.

Approach 03

Behavioral Contract Assertions

Signal4/5

EffortLow

Method. Define non-negotiable behaviors as binary, programmatically-verifiable checks: must cite sources, must request clarification when ambiguous, must refuse class-X inputs, must call tool Z before answering, must produce parseable JSON.

Why it works. Categorical rather than gradient. Adapts CheckList's Minimum Functionality Test pattern ⁵ . Cheap, fast, explainable, blocks merges cleanly.

Where it breaks. Only catches behaviors you specified. The ones you didn't think to write down can quietly disappear.

Approach 04

Capability Decay Curve

Signal4/5

EffortHigh

Method. Generate stepped simplifications at 100%, 80%, 60%, 40%, 20% of original size. Plot the curve. Look for the cliff, not the slope.

Why it works. Empirically locates the threshold below which the artifact stops being itself. Connects to the threshold/linear/exponential decay regimes ² .

Where it breaks. Token count is a crude proxy for what actually carries capability. The curve gives you a map but not a cause.

Recommended

Approach 05

Hybrid — Contracts + Frontier + Bias-Controlled Judge + Triggered Bisection

Signal5/5

EffortHigh → Med

Method. Combine three signals on every artifact change, plus a fourth invoked conditionally:

ABehavioral contracts — binary must-hold checks; categorical fail blocks change.
BFrontier battery — curated edge-case eval, scored against threshold.
CPaired LLM-judge with swap + length normalization — controls for position and verbosity biases.
DBisection — runs only when A, B, or C trips.

Why it works. Each signal catches what the others miss. Contracts catch behaviors you named. Frontier catches behaviors you tested. Judges catch behaviors you forgot to name but a careful reader would notice. Bisection localizes the cause. Cost is controlled by running bisection only on triggered failures.

Addresses four counter-arguments: aggregate hides what was lost → contracts make capability explicit; eval overfits to original → frontier targets unseen edges; judges are biased → paired position-swap neutralizes them; you learn it broke but not why → bisection localizes on triggered failures only.

Section 07

Hybrid Framework — Specification

The recommended hybrid resolves into four components, each with concrete inputs, outputs, and thresholds.

A · Behavioral Contracts

A behavioral contract is a triple: (trigger, behavior_check, severity). The trigger selects applicable inputs. The behavior check is a verifiable test on the output. Severity determines whether failure blocks the build or warns.

Example contracts for a customer-support agent:

(input_mentions_refund, output_calls_refund_lookup_first, BLOCK)

(input_is_ambiguous, output_asks_clarifying_question, BLOCK)

(input_is_safety_critical, output_routes_to_human, BLOCK)

(any, output_response_under_300_words, WARN)

Predicates can be regex, structural parsers, deterministic tool-trace inspections, or — sparingly — small LLM calls with high-agreement rubrics. Keep the set to 10–30 items.

B · Frontier Battery

Curated set at the edge of intended capability. Suggested composition:

Category	Share	Example
Long context	15%	Inputs at 60–90% of context budget
Conflicting instructions	15%	User asks X, system prompt forbids X
Adversarial / injection	10%	Embedded "ignore previous instructions"
Refusal boundary	10%	Inputs at exact edge of policy
Tool ambiguity	15%	Two tools could plausibly handle
Multi-turn drift	15%	3–6 turn conversations
Format edges	10%	Nested JSON, multilingual, code blocks
Production-derived	10%	Real cases from monitoring

Size: 50–200 items typical. Refresh production-derived slice quarterly per the data-flywheel pattern ¹⁴ .

C · Bias-Controlled Paired Judge

Naive LLM-as-judge fails. Position swap alone shifts pairwise outcomes by >10% ⁷ . The bias-controlled protocol:

# For each input in eval set:
# 1. Generate output from original artifact (A) and simplified (B)
# 2. Judge twice with positions swapped: (A,B) and (B,A)
# 3. If judgments agree   → record winner
# 4. If they disagree     → record TIE (position bias detected)
# 5. Normalize for length: if winner is >1.5× longer, downweight
# 6. Aggregate: simplified passes iff win_rate(B) ≥ 0.50

Use a judge from a different model family than the artifact's own model to reduce self-enhancement bias ⁸ . Calibrate quarterly against a small human-labeled set.

D · Triggered Bisection

Runs only when A, B, or C trips:

# Inputs: original, simplified, failing_eval_subset
# 1. Diff the two artifacts → list of removed/changed components
# 2. For each component, build a "partial" artifact:
#    simplified + just that component restored
# 3. Re-run failing items against each partial
# 4. The component whose restoration recovers failures = cause
# 5. If no single component recovers → interaction effect; pair-wise

O(n) for single causes, O(n²) for pair-interactions. Cap at pairs; report "interaction" rather than enumerate triples.

Section 08

Live Simulator

Six capability-carrying components below. Toggle them off to simulate simplification. Three scores reflect what each detection layer would see — and where they diverge.

Ablation Simulator

Live

Toggle components off. Watch aggregate stay high while contracts can break cleanly.

Aggregate Pass

100

/100

Frontier Battery

100

/100

Contracts Held

6

/ 6

All components intact. Artifact is the baseline; all detection layers green.

Two patterns worth observing: aggregate pass rate degrades gradually while contracts can break cleanly on a single removal; and different components produce qualitatively different verdicts even at similar token cost.

Section 10

Case Study

A restaurant marketing prompt

Consider a restaurant marketing system prompt evolved over months for an operator named Alex. The original is 90 lines covering: persona, audience targeting, campaign taxonomy, brand-voice constraints, three campaign examples, refusal handling, output schema.

A new engineer proposes simplification: collapse three examples into one, drop the persona block ("the LLM is good enough now"), shorten refusal handling. The simplified version is 38 lines — a 58% reduction.

Step 1 · Contracts

ID	Contract	Result
C1	Output is valid JSON matching campaign schema	PASS
C2	Discount > 30% includes margin warning	PASS
C3	Voice avoids five banned phrases (synergy, leverage…)	FAIL 8/50
C4	Refusals propose on-brand alternative	FAIL 6/50

Build is blocked at Layer A. Both failed contracts are BLOCK severity. No need to proceed.

Step 2 · Bisection

Triggered. The bisector restores each removed component and re-tests:

Restored component	C3 recovery	C4 recovery
Persona block	100%	0%
Three separate examples	25%	33%
Original refusal handling	0%	100%

Output: C3 was carried by the persona block; C4 by the refusal handling. Both must be restored or rewritten before the simplification ships.

Step 3 · A constructive next move

Knowing which components carry which behaviors changes the conversation from "we can't simplify" to "we can simplify these specific parts but not these." A second iteration might compress the three examples (which the bisector showed are weakly load-bearing) while preserving persona and refusal blocks. The framework gives an evidence-based path to a smaller artifact — not a flat veto.

The framework's deepest value is not catching bad simplifications. It is making good simplifications possible.

Section 11

CI/CD Integration

The framework runs as a check on artifact changes. Suggested wiring:

# .github/workflows/artifact-eval.yml
name: artifact-evaluation
on:
pull_request:
  paths: ['prompts/**', 'skills/**', 'agents/**', 'plugins/**']

jobs:
contracts:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - run: python -m eval.contracts --strict

frontier:
  needs: contracts
  steps:
    - run: python -m eval.frontier --threshold 0.85

paired:
  needs: contracts
  steps:
    - run: python -m eval.paired --sample 50 --swap --normalize-length

bisect:
  needs: [contracts, frontier, paired]
  if: failure()
  steps:
    - run: python -m eval.bisect --report-pr-comment

Suggested thresholds

Signal	Default	Action
Block-severity contracts	100% must hold	BLOCK
Warn-severity contracts	> 95% must hold	WARN
Frontier category pass rate	≥ 85% per category	BLOCK
Paired win rate (simplified)	≥ 45% (not significantly worse)	BLOCK
Paired tie rate (bias signal)	< 20%	WARN

Defaults — tune per artifact; safety-critical agents warrant stricter thresholds.

The data flywheel

Production failures should flow back into the eval suite. The pattern documented by LangChain, Braintrust, and others ¹⁴¹⁵ : monitoring identifies a production trace where the artifact misbehaved, a domain expert annotates the correct trajectory, the corrected example joins the frontier battery. The eval grows organically to match the failure modes the artifact actually encounters.

Section 12

Counter-Arguments Addressed

The framework is more expensive than "just look at pass rate." Four objections deserve direct response.

01

"This is overkill for small prompts."

It is, for small prompts. The framework scales: a 200-token prompt with no tools needs maybe three contracts, a frontier of 20 items, and the paired-judge layer; bisection is trivial. The expensive components earn their cost on artifacts whose simplification has real consequences. Decision rule: if the artifact is in production and a regression would matter, run the full framework. If it is a personal one-off, run contracts only.

02

"LLM-as-judge is too unreliable to anchor a CI gate."

It is, used naively. The literature is explicit about position bias, verbosity bias, self-enhancement bias, stochastic variability ⁷⁸⁹. The bias-controlled paired protocol — swap, length-normalize, cross-family — addresses each. Position-bias-induced ties are reported separately as a diagnostic. The judge is a signal, not the only signal; it sits alongside categorical contracts and pre-curated evals.

03

"Frontier curation is too expensive to maintain."

For the up-front 50-item set, this is real — typically a person-week of domain-expert time. Maintenance is much lower: refresh the production-derived slice (10–20%) quarterly. Most categories — long context, conflicting instructions, format edges — are reusable across artifacts in the same domain. Cost amortization: a frontier set built for one customer-support agent transfers ~70% to the next.

04

"This still misses capabilities we didn't think to test for."

True. No black-box test catches what wasn't named. Three mitigations:

Periodic frontier refresh from production failures (the data flywheel).
Paired judge captures qualitative differences a careful reader would notice — including ones not on the contract list.
Adversarial review: have a domain expert try to break the simplified artifact before merge.

None of these reduce failure rate to zero. They reduce it to where surprises become recoverable rather than catastrophic.

Section 13

Limitations & Open Questions

The framework is incomplete in known ways. Naming them honestly is part of operating it well.

01Stochasticity baseline. Any single eval run has noise. Threshold tuning must account for variance. Typical mitigation: run each item 3–5 times, use median or pass-at-k.
02Judge drift. The judge model may update between runs, changing scores without artifact change. Pin the judge version. Re-baseline on upgrades.
03Interaction effects. Components A and B may be load-bearing only together. Single-component bisection misses this; pair-wise is O(n²). Practical limit: stop at pairs and report "interaction" without enumeration.
04Generated frontier items. Using an LLM to generate frontier items risks them inheriting the LLM's blind spots. Human-curated items are higher signal but more expensive.
05Adversarial robustness ≠ over-simplification. This framework detects accidental capability loss; it does not specifically test for adversarial robustness, which deserves its own pipeline.
06Cross-artifact regressions. When a skill is simplified, the agent that calls it may degrade in ways the skill's eval doesn't see. End-to-end agent evals are needed in addition.

Open question

Can we automate frontier-set generation from production failure traces with sufficient quality to be load-bearing? The data-flywheel literature is encouraging but does not yet show fully-automated curation matches human curation. This is where the next round of research will likely pay off.

Detecting Over-Simplification in Four Artifact Types

Context

The Finding

Executive Summary

The Silent Regression Problem

Artifact Taxonomy

Empirical Foundations

Instruction dilution is real and measurable

Capability cliffs exist below context-window limits

Format sensitivity dwarfs semantic changes

LLM-as-judge has well-characterized biases

Benchmark contamination inflates baselines

Behavioral testing predates LLMs

Failure Modes Catalog

Five Detection Approaches

Ablation Bisection

Hard-Case Capability Frontier Battery

Behavioral Contract Assertions

Capability Decay Curve

Hybrid — Contracts + Frontier + Bias-Controlled Judge + Triggered Bisection

Hybrid Framework — Specification

A · Behavioral Contracts

B · Frontier Battery

C · Bias-Controlled Paired Judge

D · Triggered Bisection

Live Simulator

Ablation Simulator

Case Study

Step 1 · Contracts

Step 2 · Bisection

Step 3 · A constructive next move

CI/CD Integration

Suggested thresholds

The data flywheel

Counter-Arguments Addressed

"This is overkill for small prompts."

"LLM-as-judge is too unreliable to anchor a CI gate."

"Frontier curation is too expensive to maintain."

"This still misses capabilities we didn't think to test for."

Limitations & Open Questions