The edit that looks free
Almost every team building with AI does the same thing. A prompt grows over months until it is ninety lines of persona, constraints, examples, and edge-case handling. An engineer looks at it and simplifies — collapses three examples into one, drops a persona block because “the model is good enough now,” trims the refusal section. The artifact is now dramatically shorter and easier to maintain. The evaluation suite still passes. The change ships.
This is not carelessness. Simplifying for clarity, cost, and maintainability is good engineering. The problem is that the signal everyone trusts — the green eval — is not measuring the thing they think it is. When a normal software function regresses, a test goes red, a build breaks, a type-checker complains. AI artifacts have no such property. They degrade smoothly across most inputs while collapsing on the narrow band of cases that justified the original complexity.
Why aggregate pass rates lie
An eval reports one number: how often the artifact succeeds across the test set. That number averages over inputs, and the averaging is exactly where the regression hides. Capability is not uniform — a simplified prompt can handle the common case perfectly and the rare case catastrophically, and the average looks fine because the rare cases are rare. Worse, most test sets are built from the original artifact's strengths, so they measure average competence rather than preserved capability frontier. The eval inherits the artifact's blind spots and cannot see what the simplification removed.
The published evidence makes the size of this gap concrete. A structured-reasoning framework that scored 100% in isolation collapsed to 0–30% once embedded in a sixty-line production prompt — the capability still existed, but surrounding instructions stopped it from running. Subtle formatting changes alone can swing accuracy by up to 76 points. Frontier models cap near 68% accuracy under high instruction density. The lesson is the same each time: small edits move capability far more than intuition expects, and the headline metric does not show it.
Four surfaces, one mechanism
Over-simplification is not a prompt-only problem. It shows up across four artifact types. Prompts lose edge cases and reasoning order when instructions and examples are pruned. Skills miss their triggers when descriptions are generalized. Agents pick the wrong tool when their roster or harness is trimmed. Plugins produce malformed calls when tool descriptions and schemas are shortened. The surfaces differ, but the underlying mechanism is identical: specific tokens carry specific behaviors, and removing those tokens removes the behaviors — but only on the inputs that triggered them. The detection challenge is figuring out which inputs those are.
Test the frontier, not the average
If the average hides the regression, the fix is to stop testing the average. The technical paper builds a layered approach, and three ideas carry most of the weight. First, behavioral contracts: non-negotiable behaviors written as binary, machine-checkable rules — must cite sources, must ask a clarifying question when the input is ambiguous, must call a lookup tool before answering. A contract either holds or it does not; there is no average to hide behind. Second, a frontier battery: an evaluation set deliberately curated at the edge of intended capability — long context, conflicting instructions, tool ambiguity, multi-turn conversations — because the easy and middle cases keep passing long after the artifact has broken on its hard ones. Third, a bias-controlled judge: when an AI grades the comparison, it is run twice with the two outputs swapped and normalized for length, so position and verbosity biases are caught instead of trusted.
Point at the component that broke
Knowing that something regressed is useful. Knowing which part of the artifact carried the lost behavior is what makes the finding actionable. The fourth layer, bisection, does exactly that — and because it is the expensive step, it runs only when one of the cheaper layers has already tripped. It restores each removed component one at a time and re-tests the failing cases; the component whose restoration recovers the failures is the cause. In the paper's restaurant-prompt case study, bisection showed one failing contract was carried by the deleted persona block and another by the trimmed refusal handling — while the collapsed examples turned out to be only weakly load-bearing.
The point
That last result is the one worth keeping. The framework's deepest value is not catching bad simplifications — it is making good ones possible. When you can see which components carry which behaviors, the conversation stops being “we can't simplify this” and becomes “we can simplify these specific parts but not those.” You get an evidence-based path to a smaller, cleaner artifact instead of a flat veto. None of this is free, and none of it catches a capability nobody thought to name. But it moves the failure mode from a silent cliff users discover in production to a visible signal the team sees before merge — which is the whole point of having a test at all.
This brief is the short version. The full argument — the empirical foundations, the failure-mode catalog, the four-layer framework specification, a reference implementation, the CI/CD wiring, and the counter-arguments — lives in the companion technical whitepaper, Detecting Over-Simplification: In Depth .