Skip to main content
kellerai.blog

Detecting Over-Simplification in Four Artifact Types

When you compress an artifact, aggregate pass rates lie.

KellerAI White Paper · In-Depth · Prompt & Artifact Engineering · May 2026 · ~29 min read

Context

A growing class of production AI failures shares a single signature: an engineer simplifies an artifact—a prompt, a skill, an agent's system message, a plugin's tool descriptions—for clarity, cost, or maintainability. Aggregate evaluation metrics look stable. Users start hitting cliffs the eval doesn't see. Subtle prompt formatting changes can swing accuracy by up to 76 points; frontier models cap at approximately 68% accuracy under high instruction density; a STAR reasoning framework that scored 100% in isolation collapsed to 0–30% when embedded in a 60-line production system prompt.

The implication: most teams cannot tell when their artifacts have crossed a capability cliff, because the test set they built around the original artifact tests average competence rather than preserved capability frontier. The framework must operate at the boundary where capability actually lives—not where it averages well.

The Finding

Over-simplification cannot be detected by aggregate pass rates; it must be tested at the capability frontier, against named behavioral contracts, with judges that control for their own biases—and the answer must point at which component carried the lost behavior. This analytical framework traces six empirical foundations (instruction dilution, capability cliffs, format sensitivity, judge bias, benchmark contamination, behavioral testing), catalogs twelve failure modes across four artifact types, and proposes a recommended hybrid: behavioral contracts as binary categorical checks; a hard-case frontier battery curated at the edge of intended capability; a paired LLM judge with position swap and length normalization; and triggered bisection to localize the cause.

The framework's deepest value is not catching bad simplifications—it is making good simplifications possible. When you can see which components carry which behaviors, the conversation stops being 'we can't simplify this' and becomes 'we can simplify these specific parts but not those,' grounded in evidence.

Tags:
Evaluation Framework DesignBehavioral TestingCapability Frontier
Cite this paper

KellerAI. (2026, May 1). Detecting Over-Simplification in Prompts, Skills, Agents & Plugins: In Depth. KellerAI. https://kellerai.blog/over-simplification-in-depth

Paper Details
CategoryPrompt & Artifact Engineering
AudienceEngineering teams, platform architects, and AI safety researchers governing artifact changes in production
MethodPublished literature synthesis (six empirical foundations) + failure-mode taxonomy + five-approach analysis + hybrid framework specification + CI/CD wiring + restaurant-prompt case study + counter-argument analysis
Length~7,000 · ~29 min
Reading levelTechnical
Sections12
References17
Versionv1.0 · Updated May 2026
PublishedMay 2026
Key Takeaways
  • Capability is non-uniform across inputs—a simplified artifact handles the common case well and the rare case catastrophically, and the average hides the regression.
  • Behavioral contracts, frontier batteries, bias-controlled judges, and triggered bisection together catch failures that aggregate pass rates cannot see, and localize the cause to a specific component.
  • Continuous behavioral fingerprinting and frontier refresh from production failures close the gaps aggregate metrics leave open, making simplification evidence-based rather than guessed.
Related
Placeholder — pending analytics
Section 01

Executive Summary

A growing class of production AI failures shares a single signature: an engineer simplifies an artifact — a prompt, a skill, an agent's system message, a plugin's tool descriptions — for clarity, cost, or maintainability. Aggregate evaluation metrics look stable. Users start hitting cliffs the eval doesn't see.

The published evidence is unambiguous. Subtle prompt formatting changes can swing accuracy by up to 76 points3 . Frontier models cap at ~68% accuracy under high instruction density 2 . A STAR reasoning framework that scored 100% in isolation collapsed to 0–30% when embedded in a 60-line production prompt — the authors named this instruction dilution 1 . SWE-Bench Verified pass rates for SWE-Agent + GPT-4 fell from 12.47% to 3.97% after stricter validation, with 32.67% of original successes attributable to solution leakage 4 .

The implication: most teams cannot tell when their artifacts have crossed a capability cliff, because the test set they built around the original artifact tests average competence rather than preserved capability frontier.

Over-simplification cannot be detected by aggregate pass rates; it must be tested at the capability frontier, against named behavioral contracts, with judges that control for their own biases — and the answer must point at which component carried the lost behavior.

The thesis
Section 02

The Silent Regression Problem

When a software function is simplified, regressions usually announce themselves: a test goes red, a build breaks, a type-checker complains. AI artifacts have no such property. They degrade smoothly across most inputs while collapsing on the narrow band of cases that justified the original complexity.

Three properties make this worse than ordinary regression:

01
Capability is non-uniform across inputs.A simplified prompt may handle the modal case well and the edge case catastrophically. Held-out accuracy averages over both 5.
02
Failure modes are non-linear in artifact length.Jaroslawicz et al. document three distinct degradation curves under instruction density: threshold decay, linear decay, exponential decay 2. Knowing the shape for your artifact is itself a capability evaluators need.
03
The evaluator inherits the artifact's biases.If the eval set was constructed from the original artifact's strengths, the simplified version may look fine on it while failing on inputs the original was specifically designed to handle.

Definition

Over-Simplification. An artifact is over-simplified when a reduction in instructions, examples, context, or tools causes the removal of a capability that some non-trivial slice of users depends on, in a way not detected by the artifact's current evaluation harness.

This definition deliberately includes the eval harness in the loop. If an eval can't see the regression, the regression has, for operational purposes, happened.

Section 03

Artifact Taxonomy

"Advanced prompt, skill, agent, or plugin" covers four distinct surfaces with different failure modes. The detection framework has to adapt to each.

ArtifactCarriesSimplification VectorTypical Failure
PromptInstructions, examples, role, output constraintsToken reduction, example pruning, role flatteningFormat drift, lost edge cases, reasoning collapse 1
SkillSKILL.md, progressive disclosure layers, bundled scriptsStripping references, flattening disclosure, generalizing descriptionsTrigger miss, execution drift 10
AgentSystem prompt, tool roster, memory protocols, sub-agent topologyTool removal, harness simplification, memory compressionWrong tool selection, multi-turn drift, cascading errors 11
PluginTool descriptions, parameter schemas, usage examplesDescription shortening, example elision, schema generalizationMis-selection, malformed calls, hallucinated parameters 12

Across all four, the underlying mechanism is identical: specific tokens carry specific behaviors , and removing them removes the behaviors — but only on the inputs that triggered them. The detection challenge is identifying which inputs those are.

Section 04

Empirical Foundations

Six findings from the published literature anchor the framework. Each represents a constraint the system must respect.

I

Instruction dilution is real and measurable

Jo (2026) tested a STAR reasoning framework on a structured reasoning problem. In isolation: 100% accuracy. Embedded in a 60-line production system prompt: 0–30% accuracy. The mechanism: directives like "Lead with specifics" forced the model to commit to a conclusion before STAR reasoning could execute. The capability still existed; the surrounding context prevented it from running 1.

II

Capability cliffs exist below context-window limits

Jaroslawicz et al. (2025) evaluated 20 frontier models against 500 simultaneous instructions. Even the best achieved only 68% accuracy at maximum density, with three failure regimes: threshold, linear, and exponential decay 2. Levy et al. (2024) showed reasoning degradation begins around 3,000 tokens — far below technical context limits 6.

III

Format sensitivity dwarfs semantic changes

Sclar et al. (2024) found performance differences of up to 76 accuracy points from formatting changes alone. Mizrahi et al. (2024) confirmed across 6.5M instances and 39 tasks that benchmark rankings themselves change with prompt wording 3.

IV

LLM-as-judge has well-characterized biases

Position bias produces accuracy shifts >10% from swap alone 7. Verbosity bias favors longer outputs regardless of quality. Self-enhancement drives judges to prefer outputs from their own model family 8. Multi-agent debate amplifies rather than mitigates several of these 9.

V

Benchmark contamination inflates baselines

Aleithan et al. (2024) found 32.67% of SWE-Agent + GPT-4 successes involved solution leakage. Strict re-validation cut pass rates from 12.47% to 3.97% 4. SWE-Bench+ mutation work shows top models lose 36–54pp on realistic queries 12.

VI

Behavioral testing predates LLMs

Ribeiro et al. (2020) introduced CheckList, importing software-engineering behavioral testing into NLP. Three test types — Minimum Functionality Tests, Invariance, Directional Expectation — proved even SOTA models failed on simple behaviors aggregate metrics could not reveal 5. Transfers directly to artifact-level testing.

Section 05

Failure Modes Catalog

Before designing tests, name the failures. Twelve modes recur across artifact types.

F1
Lost edge-case handling
Inputs the original artifact specifically anticipated
F2
Format drift
Output looks fine; downstream parsers fail
F3
Reasoning order collapse
Model commits to conclusion before reasoning runs
F4
Tool mis-selection
Plausible-looking tool call, wrong tool
F5
Trigger miss (skills)
Skill description no longer matches phrasing
F6
Refusal drift
More or less cautious than intended
F7
Citation loss
Claims that should be sourced no longer are
F8
Clarification regression
Model stops asking the right questions
F9
Multi-turn drift
Single-turn passes, conversation derails
F10
Persona flattening
Voice becomes generic, audience calibration lost
F11
Hallucinated parameters
Tool calls fabricate required fields
F12
Refusal tone
Refusals curt where they should be helpful

You cannot test for capability loss without naming the capabilities.

Section 06

Five Detection Approaches

The literature and practice yield five candidates. Each is rated on signal quality and effort. The goal is selecting a combination, not picking one.

Approach 01

Ablation Bisection

Signal4/5
EffortMed

Method. Lock the original as gold. Remove components one at a time — instructions, examples, constraints, tool descriptions — re-running the eval after each. Locate the removal that crosses your failure threshold.

Why it works. Inherits the epistemological logic of ablation studies in ML — if removing X eliminates a behavior, X carried the behavior 13 .

Where it breaks. Combinatorial when removals interact. Slow on artifacts with many components. Requires a high-quality eval set — bisection on a weak eval just bisects to noise.

Approach 02

Hard-Case Capability Frontier Battery

Signal5/5
EffortHigh

Method. Curate evaluation items at the edge of intended capability: long context, conflicting constraints, refusal handling, multi-turn drift, adversarial inputs, tool selection under ambiguity, format edge cases.

Why it works. Easy-middle cases pass long after the artifact has degraded on its frontier 5 .

Where it breaks. Curation cost is the real expense. Quality depends on whether you actually identified the frontier — a curator who shares the author's blind spots produces a set with the same blind spots.

Approach 03

Behavioral Contract Assertions

Signal4/5
EffortLow

Method. Define non-negotiable behaviors as binary, programmatically-verifiable checks: must cite sources, must request clarification when ambiguous, must refuse class-X inputs, must call tool Z before answering, must produce parseable JSON.

Why it works. Categorical rather than gradient. Adapts CheckList's Minimum Functionality Test pattern 5 . Cheap, fast, explainable, blocks merges cleanly.

Where it breaks. Only catches behaviors you specified. The ones you didn't think to write down can quietly disappear.

Approach 04

Capability Decay Curve

Signal4/5
EffortHigh

Method. Generate stepped simplifications at 100%, 80%, 60%, 40%, 20% of original size. Plot the curve. Look for the cliff, not the slope.

Why it works. Empirically locates the threshold below which the artifact stops being itself. Connects to the threshold/linear/exponential decay regimes 2 .

Where it breaks. Token count is a crude proxy for what actually carries capability. The curve gives you a map but not a cause.

Recommended
Approach 05

Hybrid — Contracts + Frontier + Bias-Controlled Judge + Triggered Bisection

Signal5/5
EffortHigh → Med

Method. Combine three signals on every artifact change, plus a fourth invoked conditionally:

  1. ABehavioral contractsbinary must-hold checks; categorical fail blocks change.
  2. BFrontier batterycurated edge-case eval, scored against threshold.
  3. CPaired LLM-judge with swap + length normalizationcontrols for position and verbosity biases.
  4. DBisectionruns only when A, B, or C trips.

Why it works. Each signal catches what the others miss. Contracts catch behaviors you named. Frontier catches behaviors you tested. Judges catch behaviors you forgot to name but a careful reader would notice. Bisection localizes the cause. Cost is controlled by running bisection only on triggered failures.

Addresses four counter-arguments: aggregate hides what was lost → contracts make capability explicit; eval overfits to original → frontier targets unseen edges; judges are biased → paired position-swap neutralizes them; you learn it broke but not why → bisection localizes on triggered failures only.

Section 07

Hybrid Framework — Specification

The recommended hybrid resolves into four components, each with concrete inputs, outputs, and thresholds.

A · Behavioral Contracts

A behavioral contract is a triple: (trigger, behavior_check, severity). The trigger selects applicable inputs. The behavior check is a verifiable test on the output. Severity determines whether failure blocks the build or warns.

Example contracts for a customer-support agent:

(input_mentions_refund, output_calls_refund_lookup_first, BLOCK)

(input_is_ambiguous, output_asks_clarifying_question, BLOCK)

(input_is_safety_critical, output_routes_to_human, BLOCK)

(any, output_response_under_300_words, WARN)

Predicates can be regex, structural parsers, deterministic tool-trace inspections, or — sparingly — small LLM calls with high-agreement rubrics. Keep the set to 10–30 items.

B · Frontier Battery

Curated set at the edge of intended capability. Suggested composition:

Category

Share

Example

Long context15%Inputs at 60–90% of context budget
Conflicting instructions15%User asks X, system prompt forbids X
Adversarial / injection10%Embedded "ignore previous instructions"
Refusal boundary10%Inputs at exact edge of policy
Tool ambiguity15%Two tools could plausibly handle
Multi-turn drift15%3–6 turn conversations
Format edges10%Nested JSON, multilingual, code blocks
Production-derived10%Real cases from monitoring

Size: 50–200 items typical. Refresh production-derived slice quarterly per the data-flywheel pattern 14 .

C · Bias-Controlled Paired Judge

Naive LLM-as-judge fails. Position swap alone shifts pairwise outcomes by >10% 7 . The bias-controlled protocol:

# For each input in eval set:
# 1. Generate output from original artifact (A) and simplified (B)
# 2. Judge twice with positions swapped: (A,B) and (B,A)
# 3. If judgments agree   → record winner
# 4. If they disagree     → record TIE (position bias detected)
# 5. Normalize for length: if winner is >1.5× longer, downweight
# 6. Aggregate: simplified passes iff win_rate(B) ≥ 0.50

Use a judge from a different model family than the artifact's own model to reduce self-enhancement bias 8 . Calibrate quarterly against a small human-labeled set.

D · Triggered Bisection

Runs only when A, B, or C trips:

# Inputs: original, simplified, failing_eval_subset
# 1. Diff the two artifacts → list of removed/changed components
# 2. For each component, build a "partial" artifact:
#    simplified + just that component restored
# 3. Re-run failing items against each partial
# 4. The component whose restoration recovers failures = cause
# 5. If no single component recovers → interaction effect; pair-wise

O(n) for single causes, O(n²) for pair-interactions. Cap at pairs; report "interaction" rather than enumerate triples.

Section 08

Live Simulator

Six capability-carrying components below. Toggle them off to simulate simplification. Three scores reflect what each detection layer would see — and where they diverge.

Ablation Simulator

Live

Toggle components off. Watch aggregate stay high while contracts can break cleanly.

Aggregate Pass
100
/100
Frontier Battery
100
/100
Contracts Held
6
/ 6
All components intact. Artifact is the baseline; all detection layers green.

Two patterns worth observing: aggregate pass rate degrades gradually while contracts can break cleanly on a single removal; and different components produce qualitatively different verdicts even at similar token cost.

Section 10

Case Study

A restaurant marketing prompt

Consider a restaurant marketing system prompt evolved over months for an operator named Alex. The original is 90 lines covering: persona, audience targeting, campaign taxonomy, brand-voice constraints, three campaign examples, refusal handling, output schema.

A new engineer proposes simplification: collapse three examples into one, drop the persona block ("the LLM is good enough now"), shorten refusal handling. The simplified version is 38 lines — a 58% reduction.

Step 1 · Contracts

ID

Contract

Result

C1

Output is valid JSON matching campaign schema

PASS

C2

Discount > 30% includes margin warning

PASS

C3

Voice avoids five banned phrases (synergy, leverage…)

FAIL 8/50

C4

Refusals propose on-brand alternative

FAIL 6/50

Build is blocked at Layer A. Both failed contracts are BLOCK severity. No need to proceed.

Step 2 · Bisection

Triggered. The bisector restores each removed component and re-tests:

Restored component

C3 recovery

C4 recovery

Persona block

100%

0%

Three separate examples

25%33%

Original refusal handling

0%

100%

Output: C3 was carried by the persona block; C4 by the refusal handling. Both must be restored or rewritten before the simplification ships.

Step 3 · A constructive next move

Knowing which components carry which behaviors changes the conversation from "we can't simplify" to "we can simplify these specific parts but not these." A second iteration might compress the three examples (which the bisector showed are weakly load-bearing) while preserving persona and refusal blocks. The framework gives an evidence-based path to a smaller artifact — not a flat veto.

The framework's deepest value is not catching bad simplifications. It is making good simplifications possible.

Section 11

CI/CD Integration

The framework runs as a check on artifact changes. Suggested wiring:

# .github/workflows/artifact-eval.yml
name: artifact-evaluation
on:
pull_request:
  paths: ['prompts/**', 'skills/**', 'agents/**', 'plugins/**']

jobs:
contracts:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - run: python -m eval.contracts --strict

frontier:
  needs: contracts
  steps:
    - run: python -m eval.frontier --threshold 0.85

paired:
  needs: contracts
  steps:
    - run: python -m eval.paired --sample 50 --swap --normalize-length

bisect:
  needs: [contracts, frontier, paired]
  if: failure()
  steps:
    - run: python -m eval.bisect --report-pr-comment

Suggested thresholds

Signal

Default

Action

Block-severity contracts100% must holdBLOCK
Warn-severity contracts> 95% must holdWARN
Frontier category pass rate≥ 85% per categoryBLOCK
Paired win rate (simplified)≥ 45% (not significantly worse)BLOCK
Paired tie rate (bias signal)< 20%WARN

Defaults — tune per artifact; safety-critical agents warrant stricter thresholds.

The data flywheel

Production failures should flow back into the eval suite. The pattern documented by LangChain, Braintrust, and others 1415 : monitoring identifies a production trace where the artifact misbehaved, a domain expert annotates the correct trajectory, the corrected example joins the frontier battery. The eval grows organically to match the failure modes the artifact actually encounters.

Section 12

Counter-Arguments Addressed

The framework is more expensive than "just look at pass rate." Four objections deserve direct response.

01

"This is overkill for small prompts."

It is, for small prompts. The framework scales: a 200-token prompt with no tools needs maybe three contracts, a frontier of 20 items, and the paired-judge layer; bisection is trivial. The expensive components earn their cost on artifacts whose simplification has real consequences. Decision rule: if the artifact is in production and a regression would matter, run the full framework. If it is a personal one-off, run contracts only.
02

"LLM-as-judge is too unreliable to anchor a CI gate."

It is, used naively. The literature is explicit about position bias, verbosity bias, self-enhancement bias, stochastic variability 789. The bias-controlled paired protocol — swap, length-normalize, cross-family — addresses each. Position-bias-induced ties are reported separately as a diagnostic. The judge is a signal, not the only signal; it sits alongside categorical contracts and pre-curated evals.
03

"Frontier curation is too expensive to maintain."

For the up-front 50-item set, this is real — typically a person-week of domain-expert time. Maintenance is much lower: refresh the production-derived slice (10–20%) quarterly. Most categories — long context, conflicting instructions, format edges — are reusable across artifacts in the same domain. Cost amortization: a frontier set built for one customer-support agent transfers ~70% to the next.
04

"This still misses capabilities we didn't think to test for."

True. No black-box test catches what wasn't named. Three mitigations:
  • Periodic frontier refresh from production failures (the data flywheel).
  • Paired judge captures qualitative differences a careful reader would notice — including ones not on the contract list.
  • Adversarial review: have a domain expert try to break the simplified artifact before merge.

None of these reduce failure rate to zero. They reduce it to where surprises become recoverable rather than catastrophic.

Section 13

Limitations & Open Questions

The framework is incomplete in known ways. Naming them honestly is part of operating it well.

  • 01Stochasticity baseline. Any single eval run has noise. Threshold tuning must account for variance. Typical mitigation: run each item 3–5 times, use median or pass-at-k.
  • 02Judge drift. The judge model may update between runs, changing scores without artifact change. Pin the judge version. Re-baseline on upgrades.
  • 03Interaction effects. Components A and B may be load-bearing only together. Single-component bisection misses this; pair-wise is O(n²). Practical limit: stop at pairs and report "interaction" without enumeration.
  • 04Generated frontier items. Using an LLM to generate frontier items risks them inheriting the LLM's blind spots. Human-curated items are higher signal but more expensive.
  • 05Adversarial robustness ≠ over-simplification. This framework detects accidental capability loss; it does not specifically test for adversarial robustness, which deserves its own pipeline.
  • 06Cross-artifact regressions. When a skill is simplified, the agent that calls it may degrade in ways the skill's eval doesn't see. End-to-end agent evals are needed in addition.

Open question

Can we automate frontier-set generation from production failure traces with sufficient quality to be load-bearing? The data-flywheel literature is encouraging but does not yet show fully-automated curation matches human curation. This is where the next round of research will likely pay off.

References
  1. 1Jo, H. (2026). Prompt Complexity Dilutes Structured Reasoning: A Follow-Up Study on the Car Wash Problem. arXiv:2603.13351.
  2. 2Jaroslawicz, D., et al. (2025). How Many Instructions Can LLMs Follow at Once? arXiv:2507.11538. IFScale benchmark; threshold/linear/exponential decay regimes.
  3. 3Sclar, M., et al. (2024). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design. See also Mizrahi et al. (2024) — 6.5M instances showing benchmark rankings shift with prompt wording.
  4. 4Aleithan, R., et al. (2024). SWE-Bench+: Enhanced Coding Benchmark for LLMs. Solution leakage 32.67%; pass rate revised 12.47% → 3.97%.
  5. 5Ribeiro, M. T., et al. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. ACL 2020. MFT/INV/DIR taxonomy.
  6. 6Levy, M., et al. (2024). Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of LLMs. FLenQA benchmark; reasoning degradation ~3,000 tokens.
  7. 7Shi, L., et al. (2025). Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. IJCNLP-AACL 2025.
  8. 8Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Position, verbosity, self-enhancement biases.
  9. 9Huang, Y., et al. (2024–2025). Judging with Many Minds: Do More Perspectives Mean Less Prejudice? arXiv:2505.19477.
  10. 10Anthropic (2025–2026). Agent Skills Documentation; Effective Context Engineering for AI Agents. agentskills.io · platform.claude.com.
  11. 11Yehudai, A., et al. (2025). Evaluation and Benchmarking of LLM Agents: A Survey. arXiv:2507.21504.
  12. 12Garg, S., et al. (2025). Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation. arXiv:2510.08996.
  13. 13Newell, A. (1975). A tutorial on speech understanding systems. Origin of ablation methodology in CS.
  14. 14LangChain (2026). LLM Evaluation Framework: Trajectories vs. Outputs. Production monitoring → annotation queue → regression dataset flywheel.
  15. 15Braintrust (2026). What is LLM Evaluation? A Practical Guide. Component-level vs end-to-end distinction.
  16. 16Mu, J., et al. (2024–2025). ManyIFEval / StyleMBPP. arXiv:2509.21051.
  17. 17Boosting Instruction Following at Scale (AAAI 2025). Jiang et al., 2024.