Skip to main content
kellerai.blog

Why Your AI Skills Need Evidence

An AI skill that demos well can still be quietly useless — because a stochastic model run once proves nothing.

KellerAI White Paper · Engineering Discipline & Verification · May 2026

Context

You write a skill—a reusable instruction module that shapes how an AI agent behaves—test it once, and ship it. That single demonstration proved it can work in one case. It did not prove that it works on thousands of inputs in a stochastic system, where giving the same input twice can produce two different answers. This gap between 'I tested it' and 'it is reliable' is the quiet problem underneath a lot of AI tooling right now, and it is worth being honest about.

The Finding

A measured approach to skill authoring is not complicated: define what working means before you write the skill, judge it on many runs with a reported pass rate, compare competing versions blind, keep the best version as a baseline, and always check against no skill at all. This is the thinking behind kellerai-skill-creator—a tool built to make the measured path the easy path, so that defining expectations, testing across many runs, and comparing versions blind is simply what happens by default.

Tags:
AI Skill AuthoringMeasurement & EvidenceQuality Gates
Paper Details
CategoryEngineering Discipline & Verification
AudienceEngineering teams, platform architects, and skill authors building AI agent systems at scale
MethodSynthesis of experimental design, software-engineering testing practices, evaluation science, and cognitive load theory—grounded in flaky-test literature, peer-review randomized trials, and LLM evaluation variance research
Length~1,050 · 4 min
Sections6
DateMay 2026
AuthorsKellerAI
Read the full paper
Related
Placeholder — pending analytics
Section 01

You can't see a skill working

You write a skill — a reusable instruction module that tells an AI agent how to handle a recurring task. You give it a test prompt. The agent does exactly what you wanted. You ship it.

That skill will now run thousands of times, on inputs you never saw, inside a model that is stochastic: give it the same input twice and you can get two different answers. The demonstration you ran proved that the skill can work once. It did not prove that it works.

This is the quiet problem underneath a lot of AI tooling right now, and it is worth being honest about.

A skill is text. You can read it, and it will look reasonable. That is the trap.

A skill's quality is not a property of its words. It is a property of how a model behaves when it reads them, across every input the skill will meet in use. You cannot check that by reading. You can only check it by running the skill — many times — and measuring what comes out.

So a confident “this looks good” is not evidence. It is a guess wearing the costume of a decision.

Section 02

Three traps that catch almost everyone

The single demo. One good run, on a system that is random by design, is an anecdote — not evidence. Software engineers learned this the hard way with “flaky” tests, which pass and fail with no change to the code. Researchers ended up running tests hundreds of times just to learn what a single result was worth. The answer turned out to be: not much. A skill that passed once may have simply gotten a lucky draw.

You can't grade your own work. You wrote the skill. You know which version is the new, improved one — so when you read its output, you are already expecting it to be better, and that expectation bends your judgment. This is not a personal failing. It is the reason medicine runs double-blind trials and serious journals blind their peer review: the person who made the thing cannot be its neutral judge. Skill versions have to be compared blind — by an evaluator who does not know which output came from which version.

Weak tests feel like safety. Suppose you do write a checklist of things the output should contain. Real progress — until a check passes for the wrong reason. The output names the right file, but the contents are wrong. The check goes green anyway. The skill looks validated. It is not. A weak test is worse than no test, because no test leaves you honestly uncertain, while a weak test sells you confidence you never earned.

Section 03

Why this is about to matter much more

For a long time a skill was a personal thing — a prompt you used yourself. That is changing fast. Organizations are building libraries of skills, shared across teams and invoked by fleets of agents that call other agents. A skill is becoming infrastructure.

And an unmeasured piece of infrastructure is unmanaged risk — multiplied by every single time it runs. Skills rarely fail loudly. They produce a plausible answer that is subtly wrong, or one that is right most of the time and wrong in a way nobody is watching for. The cost is real but diffuse, spread thin across thousands of runs, which is exactly the kind of cost that never gets traced back to its source.

Section 04

A transition software already made

Software used to be considered done when it compiled and ran once. Then the industry adopted automated testing, and “it ran on my machine” stopped counting as finished. That shift took years and was resisted the whole way.

Prompt and skill authoring is early in the same transition — from “it worked when I tried it” to “here is the evidence that it works.”

Section 05

What good looks like

A measured approach to skill authoring is not complicated. It comes down to a few habits:

  • Decide what “working” means

    before

    you write the skill — as concrete, checkable expectations.

  • Judge the skill on

    many runs

    and a reported pass rate, never on a single demo.

  • Compare versions blind , so the verdict is not your hope talking.

  • Keep the best version as a baseline, and make every new version beat it on the evidence.

  • Always check the skill against

    no skill at all

    — if it does not beat the model running without it, it is not earning its place in the context.

Section 06

The point

This is the thinking behind kellerai-skill-creator — a tool built to make the measured path the easy path, so that defining expectations, testing across many runs, and comparing versions blind is simply what happens by default. The goal is not more process. It is the ability to answer one plain question honestly: does this skill actually work?

For the full argument — with the research from evaluation science, experimental design, and software engineering behind it — read the companion technical whitepaper, From Craft to Discipline: The Case for Measured AI Skill Authoring .