Skip to main content
kellerai.blog

Detecting Over-Simplification

When you compress an artifact, aggregate pass rates lie.

KellerAI White Paper · Prompt & Artifact Engineering · May 2026

Context

Almost every team building with AI simplifies prompts, skills, agents, and plugins for clarity, cost, and maintainability. An engineer looks at a 90-line prompt, collapses three examples into one, drops a persona block, trims the refusal section, and ships it—because the evaluation suite still passes. This is not carelessness; it is good engineering. The problem is that the signal everyone trusts—the green eval—is not measuring the thing they think it is.

The Finding

AI artifacts degrade smoothly across most inputs while collapsing on the narrow band of cases that justified the original complexity. The framework builds a layered approach: behavioral contracts define non-negotiable behaviors as binary, machine-checkable rules; a frontier battery is an evaluation set deliberately curated at the edge of intended capability; a bias-controlled judge runs twice with positions swapped to catch position bias; and bisection, run only when cheaper layers trip, restores each removed component one at a time to localize the cause.

Tags:
Artifact SimplificationEvaluation MethodologyCapability Loss Detection
Paper Details
CategoryPrompt & Artifact Engineering
AudienceEngineering teams building production AI systems with prompts, skills, agents, or plugins
MethodStructured-reasoning framework research + prompt simplification case studies + evaluation methodology analysis
Length~1,000 · 4 min
Sections6
DateMay 2026
AuthorsKellerAI
Read the full paper
Related
Placeholder — pending analytics
Section 01

The edit that looks free

Almost every team building with AI does the same thing. A prompt grows over months until it is ninety lines of persona, constraints, examples, and edge-case handling. An engineer looks at it and simplifies — collapses three examples into one, drops a persona block because “the model is good enough now,” trims the refusal section. The artifact is now dramatically shorter and easier to maintain. The evaluation suite still passes. The change ships.

This is not carelessness. Simplifying for clarity, cost, and maintainability is good engineering. The problem is that the signal everyone trusts — the green eval — is not measuring the thing they think it is. When a normal software function regresses, a test goes red, a build breaks, a type-checker complains. AI artifacts have no such property. They degrade smoothly across most inputs while collapsing on the narrow band of cases that justified the original complexity.

Section 02

Why aggregate pass rates lie

An eval reports one number: how often the artifact succeeds across the test set. That number averages over inputs, and the averaging is exactly where the regression hides. Capability is not uniform — a simplified prompt can handle the common case perfectly and the rare case catastrophically, and the average looks fine because the rare cases are rare. Worse, most test sets are built from the original artifact's strengths, so they measure average competence rather than preserved capability frontier. The eval inherits the artifact's blind spots and cannot see what the simplification removed.

The published evidence makes the size of this gap concrete. A structured-reasoning framework that scored 100% in isolation collapsed to 0–30% once embedded in a sixty-line production prompt — the capability still existed, but surrounding instructions stopped it from running. Subtle formatting changes alone can swing accuracy by up to 76 points. Frontier models cap near 68% accuracy under high instruction density. The lesson is the same each time: small edits move capability far more than intuition expects, and the headline metric does not show it.

Section 03

Four surfaces, one mechanism

Over-simplification is not a prompt-only problem. It shows up across four artifact types. Prompts lose edge cases and reasoning order when instructions and examples are pruned. Skills miss their triggers when descriptions are generalized. Agents pick the wrong tool when their roster or harness is trimmed. Plugins produce malformed calls when tool descriptions and schemas are shortened. The surfaces differ, but the underlying mechanism is identical: specific tokens carry specific behaviors, and removing those tokens removes the behaviors — but only on the inputs that triggered them. The detection challenge is figuring out which inputs those are.

Section 04

Test the frontier, not the average

If the average hides the regression, the fix is to stop testing the average. The technical paper builds a layered approach, and three ideas carry most of the weight. First, behavioral contracts: non-negotiable behaviors written as binary, machine-checkable rules — must cite sources, must ask a clarifying question when the input is ambiguous, must call a lookup tool before answering. A contract either holds or it does not; there is no average to hide behind. Second, a frontier battery: an evaluation set deliberately curated at the edge of intended capability — long context, conflicting instructions, tool ambiguity, multi-turn conversations — because the easy and middle cases keep passing long after the artifact has broken on its hard ones. Third, a bias-controlled judge: when an AI grades the comparison, it is run twice with the two outputs swapped and normalized for length, so position and verbosity biases are caught instead of trusted.

Section 05

Point at the component that broke

Knowing that something regressed is useful. Knowing which part of the artifact carried the lost behavior is what makes the finding actionable. The fourth layer, bisection, does exactly that — and because it is the expensive step, it runs only when one of the cheaper layers has already tripped. It restores each removed component one at a time and re-tests the failing cases; the component whose restoration recovers the failures is the cause. In the paper's restaurant-prompt case study, bisection showed one failing contract was carried by the deleted persona block and another by the trimmed refusal handling — while the collapsed examples turned out to be only weakly load-bearing.

Section 06

The point

That last result is the one worth keeping. The framework's deepest value is not catching bad simplifications — it is making good ones possible. When you can see which components carry which behaviors, the conversation stops being “we can't simplify this” and becomes “we can simplify these specific parts but not those.” You get an evidence-based path to a smaller, cleaner artifact instead of a flat veto. None of this is free, and none of it catches a capability nobody thought to name. But it moves the failure mode from a silent cliff users discover in production to a visible signal the team sees before merge — which is the whole point of having a test at all.

This brief is the short version. The full argument — the empirical foundations, the failure-mode catalog, the four-layer framework specification, a reference implementation, the CI/CD wiring, and the counter-arguments — lives in the companion technical whitepaper, Detecting Over-Simplification: In Depth .