Skip to main content
kellerai.blog

From Craft to Discipline: Measured AI Skill Authoring

Why skill authoring must become a measured discipline, not a craft.

KellerAI White Paper · In-Depth · Engineering Discipline & Verification · May 2026 · ~31 min read

Context

A skill is a reusable instruction module that shapes how an AI agent behaves: a written specification of how a model should approach a recurring class of task. Skills are fast becoming the unit in which organizations package and share AI capability. Yet most skills are authored the way software was written before automated testing existed—by hand, validated by a single demonstration, and shipped on the strength of one good run. This whitepaper argues that the practice is unsound, and that it fails for reasons already documented at length in adjacent fields.

A large language model is stochastic, so a single passing run is an anecdote rather than evidence. The author of a skill cannot be its impartial judge, because knowing which version you produced bends evaluation. Model instructions are sensitive to changes in wording and formatting that carry no meaning, so hand-tuning does not reliably generalize. And uncontrolled iterative improvement can quietly regress without an external evaluation gate.

The Finding

Drawing on research in LLM and agent evaluation, the reliability of using models as judges, experimental blinding, reproducibility under nondeterminism, test-driven development, prompt sensitivity, and cognitive load theory, we make the case that skill authoring must become a measured discipline: expectations defined before authoring, behavior tested across many runs with reported variance, competing versions compared blind, and improvement governed by a quality gate.

This is not a plea for more process. It is the ability to answer one plain question honestly: does the skill you just shipped—and will ship ten thousand more times—actually work?

Tags:
Skill Authoring DisciplineExperimental Blinding & EvaluationQuality Gates & Baselines
Cite this paper

KellerAI. (2026, May 21). From Craft to Discipline: The Case for Measured AI Skill Authoring. KellerAI. https://kellerai.blog/kellerai-skill-creator-in-depth

Paper Details
CategoryEngineering Discipline & Verification
AudienceEngineering leaders, platform architects, and skill governance teams building measured AI skill authoring infrastructure
MethodSynthesis across four fields—software engineering (flaky tests, test-driven development), medicine & peer review (blinding, randomized trials), machine learning evaluation (variance reporting, reliability assessment), and cognitive load theory—grounded in primary literature with regulatory overlay (NIST AI RMF, EU AI Act, ISO 42001)
Length~7,500 · ~31 min
Reading levelTechnical
Sections11
References32
Versionv1.0 · Updated May 2026
PublishedMay 2026
Key Takeaways
  • A single demonstration on a stochastic system is an anecdote, not evidence: the flaky-test literature and LLM evaluation variance research both show that reliability requires multiple runs, reported pass rates, and variance quantification.
  • The author of a skill cannot be its impartial judge: experimental psychology, medicine, and peer review all adopted blinding as the countermeasure to expectation-driven bias; skill evaluation must compare versions blind to the evaluator.
  • Skill authoring must shift from craft to discipline: expectations before authoring, measurement across runs, blind comparison, a defended baseline, comparison against no skill at all, evaluation of the evaluation itself, and layered artifact design are the principles that make skill quality measurable and trustworthy.
Related
  • Why Your AI Skills Need Evidence
  • The Eval That Doesn't Follow the Model to Production
  • Citations or Guesses: The Five-Pass Rule and the Standard Behind It
Placeholder — pending analytics
Section 01

Abstract

A skill is a reusable instruction module that shapes how an AI agent behaves: a written specification of how a model should approach a recurring class of task. Skills are fast becoming the unit in which organizations package and share AI capability. Yet most skills are authored the way software was written before automated testing existed — by hand, validated by a single demonstration, and shipped on the strength of one good run.

This whitepaper argues that the practice is unsound, and that it fails for reasons already documented at length in adjacent fields. A large language model is stochastic, so a single passing run is an anecdote rather than evidence — a fact the software-engineering literature on flaky tests established through experiments that ran each test hundreds of times. The author of a skill cannot be its impartial judge, because knowing which version you produced bends evaluation; this is the reason medicine and scholarly peer review adopted blinding decades ago. Model instructions are sensitive to changes in wording and formatting that carry no meaning, so hand-tuning does not reliably generalize. And uncontrolled iterative “improvement” can quietly regress without an external evaluation gate.

Drawing on research in LLM and agent evaluation, the reliability of using models as judges, experimental blinding, reproducibility under nondeterminism, test-driven development, prompt sensitivity, and cognitive load theory, we make the case that skill authoring must become a measured discipline: expectations defined before authoring, behavior tested across many runs with reported variance, competing versions compared blind, and improvement governed by a quality gate. We close by describing, at the level of principle, what such a discipline requires — and why the cost of not adopting it compounds with every agent an organization deploys.

Skill authoring must become a measured discipline: expectations defined before authoring, behavior tested across many runs with reported variance, competing versions compared blind, and improvement governed by a quality gate.

The thesis
Section 02

The Skill Is the New Unit of Delegated Capability

For most of the short history of applied large language models, the unit of work was the prompt: a block of text, written for one task, used by the person who wrote it. That is changing. The unit is becoming the skill — a named, reusable instruction module that tells an agent how to approach a whole class of task. A skill for drafting a pull request description. A skill for triaging an incident. A skill for producing a financial summary in a house format. When an agent picks up a matching task, the skill is loaded into its context and shapes what it does.

The shift matters because of who and what now depends on a skill. A skill written once is invoked by many people, by many agents, on many inputs the author never anticipated. It is shared across teams. It is composed with other skills. It is, in every meaningful sense, infrastructure.

That reframes a question that sounds trivial and is not: does this skill actually work? When a skill was a personal prompt, the answer could be “well, it worked for me.” When a skill is infrastructure, that answer is a reliability claim about a component other people and other systems are building on. This whitepaper is about why that claim is far harder to substantiate than it appears, and what substantiating it actually requires.

Section 03

The Problem in One Sentence: You Cannot See a Skill Working

A skill is text. You can read it, and it will look reasonable. That is the trap.

The quality of a skill is not a property of its text. It is a property of the behavior the text produces in a model, across the distribution of inputs the skill will face in use. Two characteristics make that quality invisible to inspection.

The first is that it is behavioral. A skill's value shows up only in what a model does downstream — the actions it takes, the output it produces — and not in any feature of the skill you can examine directly. A skill can be well written, internally consistent, and persuasive on the page while still steering the model badly. It can also be terse and inelegant while steering the model well. The page does not tell you which.

The second is that it is stochastic. The same skill, on the same input, run twice, can produce different behavior. The model samples; the output is a draw from a distribution, not a fixed value.

Together these mean a skill cannot be validated the way a configuration file is validated — by careful reading. It can only be validated by running it and measuring what happens, and because the behavior is a distribution rather than a point, it has to be run many times. Every section that follows is a specific, well-documented way that the intuitive shortcut — run it once, look at the result, ship it — fails.

Section 04

Why a Single Good Demo Proves Nothing

Start with the most common move in skill authoring: write the skill, give it a test prompt, watch the agent do the right thing, and conclude the skill works.

The conclusion does not follow, because the system is stochastic by design. The 2025 survey Evaluation and Benchmarking of LLM Agents states the consequence without hedging: “Because LLM-based agents are inherently stochastic, measuring consistency requires executing the same task multiple times and observing the variation in outcomes” 29 . The same survey draws a distinction that single-run testing erases entirely — between task performance, the best case of what an agent can do, and reliability, how it behaves on average and in the worst case. A demonstration shows you something close to the best case. It tells you almost nothing about reliability.

The closest engineering analogue is the flaky test: a test that passes and fails across runs with no change to the code under it. Software engineering took flakiness seriously enough to study exactly the question skill authors should be asking — how many runs do I need before one outcome means anything? The empirical answer is sobering. Lam et al. (2020) ran each test 500 times, far beyond the handful of reruns earlier work assumed was sufficient, because fewer runs simply did not characterize the failure distribution 11 . The FlakeFlagger study asked directly what confidence a small number of reruns provides and found it low 1 . Luo et al. (2014), in the foundational empirical analysis of flaky tests, gave the field its taxonomy of causes 13 . Dutta et al. (2020) examined flakiness in probabilistic and machine-learning systems specifically and found the dominant cause to be algorithmic nondeterminism — and the most common fix to be abandoning exact-match assertions in favor of threshold-based ones 6 . That last detail is the important one: the remedy for nondeterminism is to model the variance explicitly rather than pretend it away.

The same lesson has now been measured inside language-model evaluation. Madaan et al. (2025) quantified benchmark “seed variance” and showed that differences in benchmark scores are frequently not meaningful unless the variance behind them is reported 15 . The broader evaluation literature points the same direction: HELM argued that models must be assessed across many scenarios and metrics because no single number captures behavior 12 .

The conclusion for skill authoring is blunt. A skill that passed once may simply have landed on a favorable draw from a wide output distribution. “It worked when I tried it” is the single most common — and most misleading — sentence in skill development. A reliability claim requires multiple runs, a reported pass rate, and the spread around it.

Section 05

Why the Author Cannot Be the Judge

Suppose an author takes section 4 to heart, runs the skill many times, and grades the outputs. A second problem now appears, and it is structural rather than careless: the author wrote the skill, and therefore is not a neutral grader of it.

The mechanism is well understood. You know which version is the new one, the one you just improved, the one you expect to be better. That expectation shapes how you read its output — what you notice, what you forgive, what you count as a pass. Experimental psychology calls this observer bias and confirmation bias, and the documented countermeasure is blinding. A single-blind design, as the standard reference literature notes, “does not prevent observer bias, confirmation bias, or bias due to demand characteristics” — which is precisely why double-blind designs, in which the evaluator also does not know the condition, became the standard for rigorous trials 5 .

The most transferable evidence does not come from drug trials but from scholarly peer review, because a manuscript is an authored artifact judged by an evaluator — structurally the same situation as a skill. The field tested blinding on itself, with randomized controlled trials. McNutt et al. (1990) and Godlee, Gale & Martyn (1998), both in JAMA, randomized manuscripts to blinded and unblinded review 169 . Godlee's design is worth describing precisely, because it doubles as a template: the researchers took a real accepted paper, introduced eight deliberate weaknesses, and measured how many each reviewer caught under each condition. A later systematic review found that double-blind review mitigates provenance-driven bias, such as the tendency to favor prestigious institutions and known authors 8 . Across thirty years, a field whose entire job is evaluating authored work concluded that the evaluation has to be insulated from knowledge of provenance.

The lesson imports directly. When two versions of a skill are compared, the comparison must be blind: the evaluator should not know which output came from which version, and should not know which version is the incumbent and which is the challenger.

There is one more wrinkle, because many skill-evaluation loops now use a model to do the grading. This helps — Zheng et al. (2023), in the foundational MT-Bench and Chatbot Arena work, showed that a strong LLM judge agrees with human experts roughly 80% of the time, about the rate at which two human experts agree with each other 31 . But the same work, and a wave of studies after it, documented the judge's own biases. LLM judges exhibit position bias, systematically preferring the answer in a particular slot 2724 . They exhibit verbosity bias, preferring longer answers independent of quality 22 . They exhibit self-preference bias: an evaluator can recognize its own generations and rate them higher 20 . And the most dangerous property of all is quiet — a judge always returns a score, even when it lacks the context to judge well, so a generic judge can match average human ratings while missing every genuine failure.

None of this disqualifies model-based grading. It disciplines it. An LLM judge must be controlled the way a human evaluator is controlled: swap the positions of the answers it compares, ground it with explicit reference criteria, calibrate it against human spot-checks, and never let the model that produced an output be the sole judge of that output.

Section 06

Why Weak Tests Are Worse Than No Tests

Suppose, again, that an author has done the work so far — many runs, blind comparison — and has also written down a set of expectations the output is checked against. This is real progress. It also introduces a failure mode of its own, and it is a quiet one.

An expectation can be satisfied for the wrong reason. The check says the output must reference a particular file; the output names the file, but the surrounding content is wrong. The check says a summary must contain a named section; the summary mentions the heading and says nothing true about it. The expectation passes. The dashboard goes green. The skill looks validated.

A weak test that passes is worse than no test at all. No test leaves an author honestly uncertain, and honest uncertainty invites caution. A weak passing test manufactures confidence that has not been earned, and confidence suppresses caution exactly where caution was needed. The danger is not the absence of a signal; it is a false signal that reads as a true one.

This means evaluating a skill and evaluating the evaluation are two distinct jobs, and the second is the one most often skipped. The expectations are themselves an authored artifact. They can be trivial — checking outcomes that any output would satisfy. They can be incomplete — leaving the failure modes that actually matter entirely unchecked. A disciplined process treats the test suite as something to be critiqued and strengthened, not a fixed backdrop. The question is not only “did the skill pass?” but “would these expectations have caught the skill failing?”

The evaluation literature makes the same point from the metrics side. HELM's argument for many metrics rather than one is, at heart, an argument that a single number hides failure modes 12 . A recent benchmarking study of agent systems in a medical setting found that prompt-engineering and tool-augmentation gains were “modest” and “consistently accompanied by persistent hallucinations” 28 — the headline metric moved while a serious failure stayed exactly where it was. A metric going up is not the same event as a problem being solved, and only an evaluation designed to catch the real failure can tell the two apart.

Section 07

Why Instructions Are More Fragile Than They Look

There is a deeper reason inspection fails, beneath stochasticity and beneath bias: the relationship between what a skill says and how a model behaves is not as stable as a careful reader assumes.

Models are sensitive to changes in their instructions that carry no semantic content at all. Sclar et al. (2024) demonstrated that prompts which are semantically identical — differing only in formatting choices such as casing, separators, and spacing — can produce large swings in accuracy 23 . The brittleness is real enough that researchers have built instruments to measure it: POSIX, a prompt sensitivity index that treats fragility as a quantifiable property of a model rather than an anecdote 3 , and dedicated benchmarks for prompt sensitivity besides.

This has two consequences for anyone authoring a skill. The first is that you cannot reason your way to a good skill from first principles. A phrasing you are confident is clearer, or firmer, or better organized may simply not produce better behavior, and the only way to find out is to measure. The second is that a skill hand-tuned against a few examples may be tuned to spurious features of those particular examples, and fail to generalize to the inputs it will actually meet. Tuning to the test is not improvement; it is overfitting wearing improvement's clothes.

The field's own response is the strongest available evidence here. Because hand-tuning instructions is unreliable and does not scale, an entire subfield of automatic prompt optimization now exists — APE framed the model itself as a prompt optimizer 32 , APO performed a gradient-descent-style search over prompt space 21 , and a 2025 survey catalogs the resulting landscape as a mature area of work 4 . The people closest to the problem stopped trusting hand-tuning and started measuring. Skill authoring is the same problem at a larger grain.

Section 08

Why Improvement Without a Gate Goes Backward

The natural response to everything above is a loop: run the skill, see what failed, revise it, run it again. The loop is sound, and it is well precedented. Self-Refine — generate, critique, refine — improved performance by roughly 20 points on average across seven tasks with no additional training 14 . Reflexion gave agents a memory of their own reflections so they improved across trials 25 . DSPy went further and reframed prompt construction itself as something a compiler optimizes rather than something a human hand-tunes 10 .

But the loop has a sharp edge that authors should not walk past. Recent work found that language models show “no clear evidence of inherent self-refinement” and can degrade in quality after a self-refinement step when the loop lacks the right structure 30 . Iteration is not the same thing as improvement. A loop that revises a skill based on the skill's own self-assessment can wander steadily downhill while reporting progress at every step.

What keeps iteration honest is an external gate. Each candidate version must beat the current best on a measured, blind comparison before it is accepted — and the survivor is the best version, not the latest. The best version is kept as a baseline and defended against every challenger. Newness earns no privilege. The loop only moves in one direction because the gate, not the author's optimism, decides which way is up.

This is the insight test-driven development imported into software, and it is worth being precise about the evidence rather than overselling it. The strongest industrial result is Nagappan et al. (2008): four teams at Microsoft and IBM saw defect density fall by 40–90% under TDD, at a cost of 15–35% more initial development time 18 . But the evidence is genuinely mixed, and a whitepaper that hid that would not deserve to be believed on anything else. A rigorous multi-site industrial study by Fucci et al. (2016) found no significant advantage from test-first ordering specifically, and attributed the observed benefits instead to the process being incremental and steered by explicit checks 7 . The defensible claim is therefore not “test-first wins.” It is narrower and sturdier: writing down an explicit, checkable specification of the desired behavior before authoring, and then iterating in small steps each governed by that specification, is what produces the gains. That is exactly the move skill authoring needs to make.

Section 09

The Pattern Other Fields Already Found

Step back from the individual failure modes and a single pattern becomes visible across four fields that rarely cite one another.

  • Medicine and scholarly peer review learned that the author of an artifact cannot impartially evaluate it, and built blinding into the method.
  • Software engineering learned that one run of a nondeterministic test carries almost no information, and built multi-run statistics and threshold-based assertions into the method.
  • Machine learning evaluation learned that benchmark numbers without variance are not comparable, and built variance reporting into the method.
  • Educational psychology and human-computer interaction learned that working memory is sharply capacity-limited, and built progressive disclosure into the design of instructions.

That last field deserves a moment, because it addresses something the others do not: a skill is also a document a reader must hold in mind under a hard memory limit, and that reader is a model with a bounded context window. Cognitive Load Theory (Sweller, 1988) 26 , building on Miller's classic finding that working memory holds only a handful of items at once 17 , holds that comprehension and learning degrade when working memory is overloaded with detail that does not serve the immediate task. Progressive disclosure — the HCI pattern of layering information so only the essentials appear first, traceable to Carroll's “training wheels” research 2 and later codified as a usability principle 19 — is the design response. A skill that front-loads every edge case, caveat, and reference into one undifferentiated wall of text makes the same mistake a badly written textbook makes. A well-built skill is layered: a short description that lets the reader decide whether the skill is even relevant, core instructions next, and reference detail available on demand. The theory was developed for human learners with bounded working memory; it transfers cleanly to a model with a bounded context window. Same constraint, same remedy.

Two cross-disciplinary connections are worth naming explicitly, because they turn rhetoric into something testable and something usable.

The first: the peer-review randomized controlled trial is a ready-made experiment for skill evaluation. The Godlee and McNutt design — seed an artifact with known defects, randomize evaluators to blinded versus author-aware conditions, and measure the detection rate — applies to skills with almost no translation. It converts the question “does blind evaluation of skills actually help?” from a rhetorical flourish into a falsifiable experiment any team can run.

The second: Cognitive Load Theory, a 1988 result from educational psychology, becomes a quantitative design constraint for AI instruction modules. The “learner” is now a model, the “working memory” is now the context window, and the prescription — minimize extraneous load, layer the information, defer non-essential detail — is unchanged. A theory built for human students becomes an engineering rule for skills.

The point of the section is deflationary on purpose. None of what skill authoring needs is novel. It is the disciplined import of practices that adjacent fields already paid for in their own failures.

Section 10

From Craft to Discipline: What Measured Skill Authoring Requires

The threads pull together into a small set of requirements. They are stated here as principles rather than procedures — the point is what a sound discipline must guarantee, not the particular machinery any one tool uses to guarantee it.

  1. 01Expectations before authoring. Before a skill is written or revised, the behavior it must produce is written down as explicit, checkable expectations. Without them there is no definition of “working,” and with no definition of working there is nothing to improve toward.
  2. 02Measurement across runs, with variance. A skill is judged on repeated runs and a reported pass rate with its spread — never on a single demonstration. A 70% pass rate with wide run-to-run variance is a different, and worse, thing than a steady 70%.
  3. 03Blind comparison. Competing versions of a skill are judged by an evaluator that does not know which output came from which version, and does not know which version is the incumbent.
  4. 04A defended baseline. Every candidate version must beat the current best on a measured comparison before it is accepted. The best version is kept; the most recent version is given no privilege for being recent.
  5. 05Always compare against no skill. A skill occupies space in a model’s context, and that space has a cost. A skill is only worth keeping if it beats the same model running without it. The true baseline is not the previous skill — it is the skill’s absence.
  6. 06Critique the expectations, not only the output. The test suite is itself an artifact and can be weak. Expectations that pass for trivial reasons, and important failure modes that nothing checks, are found and corrected.
  7. 07Layer the artifact. A skill is structured for progressive disclosure, so it stays within the reader’s working memory — whether that reader is a human maintainer or a model with a finite context window.

This is the discipline that kellerai-skill-creator was built to make routine. It is an extension of an existing skill-creation tool, and its purpose is deliberately narrow: to make the measured path the path of least resistance — so that defining expectations, running many trials, comparing blind, holding a baseline, and gating each change on evidence is simply what happens by default, rather than what a conscientious author has to remember to do.

We are deliberately not documenting the tool's internals here. This whitepaper is about why the discipline is necessary, not how any one implementation delivers it. The point that matters is the shift the discipline represents: from a skill as prose — written once, admired, and shipped — to a skill as a measured, versioned, evidence-backed artifact with a performance history that can be inspected, defended, and trusted.

From a skill as prose — written once, admired, and shipped — to a skill as a measured, versioned, evidence-backed artifact with a performance history that can be inspected, defended, and trusted.

The shift
Section 11

Why This Matters Now: Skills as Shared Infrastructure

The argument would be largely academic if skills were still one-off prompts used by their authors. They are not, and the gap between how skills are built and what now depends on them is the reason this is urgent rather than interesting.

Three forces sharpen the timing.

The scale of invocation. A skill's defects are not paid for once. They are paid every time the skill runs, by every agent that loads it, on every input it meets. An unmeasured skill is unmanaged risk multiplied by its invocation count — and across an organization moving real work onto agents, that multiplier only grows.

Silent, distributed failure. A skill rarely fails loudly. It produces a plausible output that is subtly wrong, or an output that is right most of the time and wrong in a way no one is watching for. The cost is diffuse: a slightly worse outcome, spread thinly across thousands of runs. That is precisely the kind of cost that is never traced back to its source, because no single instance is bad enough to investigate.

Composition. Agents increasingly invoke other agents, and skills are composed with other skills. An unmeasured skill three layers down in a workflow is a foundation no one is checking. Reliability does not compose upward out of components that were never measured; a system is only as trustworthy as the least examined skill it stands on.

Software made this transition once already. For years, software was considered done when it compiled and ran once. Then the industry adopted automated testing, and “it ran on my machine” stopped counting as evidence of anything. The shift took years and was resisted at nearly every step. Prompt and skill authoring is early in the same transition — moving from “it worked when I tried it” toward “here is the evidence that it works.” The fields skill authoring borrows from have already run the experiments and absorbed the failures. The only open question for any given organization is whether it adopts the discipline deliberately, or waits for an unmeasured skill to make the case the expensive way.

Conclusion

A skill is easy to write and easy to admire. That is the danger in it. Its quality lives in behavior, not in text, and behavior in a stochastic system is invisible both to inspection and to a single demonstration. Every field that has had to evaluate an authored artifact under uncertainty — medicine, peer review, software testing, machine learning, instructional design — arrived independently at the same short list of countermeasures: define success in advance, measure repeatedly, report the variance, blind the evaluation, defend a baseline, and layer the artifact to the reader's capacity.

Skill authoring needs all of them. Treating skill creation as a measured discipline rather than a craft is not bureaucratic caution, and it is not process for its own sake. It is the minimum required to answer one plain question honestly: does the thing you just shipped — and will ship ten thousand more times — actually work?

A skill is easy to write and easy to admire. That is the danger in it. Its quality lives in behavior, not in text.

Closing
References
  1. 1Alshammari, A.; Morris, C.; Hilton, M.; Bell, J. (2021). FlakeFlagger: Predicting Flakiness Without Rerunning Tests. International Conference on Software Engineering (ICSE).
  2. 2Carroll, J. M. (1990). The Nurnberg Funnel: Designing Minimalist Instruction for Practical Computer Skill. MIT Press.
  3. 3Chatterjee, A.; Renduchintala, H. S. V. N. S. K.; Bhatia, S.; Chakraborty, T. (2024). POSIX: A Prompt Sensitivity Index for Large Language Models. Findings of EMNLP.
  4. 4Cui, W.; et al. (2025). A Systematic Survey of Automatic Prompt Optimization Techniques. EMNLP.
  5. 5David, S.; Khandhar, P. B. (2023). Double-Blind Study. StatPearls.
  6. 6Dutta, S.; Selvam, A.; Jain, A.; Misailovic, S. (2020). Detecting Flaky Tests in Probabilistic and Machine Learning Applications (FLASH). International Symposium on Software Testing and Analysis (ISSTA).
  7. 7Fucci, D.; et al. (2016). An External Replication on the Effects of Test-Driven Development Using a Multi-Site Blind Analysis Approach. Empirical Software Engineering.
  8. 8Garg, A. K.; et al. (2022). The Impact of Double-Blind Peer Review on Gender Bias in Scientific Publishing: A Systematic Review. American Journal of Obstetrics & Gynecology.
  9. 9Godlee, F.; Gale, C. R.; Martyn, C. N. (1998). Effect on the Quality of Peer Review of Blinding Reviewers and Asking Them to Sign Their Reports: A Randomized Controlled Trial. JAMA, 280(3), 237–240.
  10. 10Khattab, O.; Singhvi, A.; Maheshwari, P.; Zhang, Z.; Santhanam, K.; et al. (2024). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. International Conference on Learning Representations (ICLR).
  11. 11Lam, W.; et al. (2020). A Study on the Lifecycle of Flaky Tests. International Conference on Software Engineering (ICSE).
  12. 12Liang, P.; Bommasani, R.; Lee, T.; et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110.
  13. 13Luo, Q.; Hariri, F.; Eloussi, L.; Marinov, D. (2014). An Empirical Analysis of Flaky Tests. ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE).
  14. 14Madaan, A.; Tandon, N.; Gupta, P.; et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems (NeurIPS).
  15. 15Madaan, L.; Singh, A. K.; Schaeffer, R.; et al. (2025). Quantifying Variance in Evaluation Benchmarks. International Conference on Learning Representations (ICLR).
  16. 16McNutt, R. A.; Evans, A. T.; Fletcher, R. H.; Fletcher, S. W. (1990). The Effects of Blinding on the Quality of Peer Review: A Randomized Trial. JAMA, 263(10), 1371–1376.
  17. 17Miller, G. A. (1956). The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. Psychological Review, 63(2), 81–97.
  18. 18Nagappan, N.; Maximilien, E. M.; Bhat, T.; Williams, L. (2008). Realizing Quality Improvement Through Test Driven Development: Results and Experiences of Four Industrial Teams. Empirical Software Engineering, 13(3), 289–302.
  19. 19Nielsen, J. (2006). Progressive Disclosure. Nielsen Norman Group.
  20. 20Panickssery, A.; Bowman, S. R.; Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. Advances in Neural Information Processing Systems (NeurIPS).
  21. 21Pryzant, R.; Iter, D.; Li, J.; Lee, Y. T.; Zhu, C.; Zeng, M. (2023). Automatic Prompt Optimization with “Gradient Descent” and Beam Search. Conference on Empirical Methods in Natural Language Processing (EMNLP).
  22. 22Saito, K.; Wachi, A.; Wataoka, K.; Akimoto, Y. (2023). Verbosity Bias in Preference Labeling by Large Language Models. arXiv:2310.10076.
  23. 23Sclar, M.; Choi, Y.; Tsvetkov, Y.; Suhr, A. (2024). Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design. International Conference on Learning Representations (ICLR).
  24. 24Shi, L.; et al. (2025). A Systematic Study of Position Bias in LLM-as-a-Judge. IJCNLP-AACL.
  25. 25Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS).
  26. 26Sweller, J. (1988). Cognitive Load During Problem Solving: Effects on Learning. Cognitive Science, 12(2), 257–285.
  27. 27Wang, P.; Li, L.; Chen, L.; et al. (2023). Large Language Models Are Not Fair Evaluators. arXiv:2305.17926.
  28. 28Yan; et al. (2026). Benchmarking Large Language Model-Based Agent Systems. npj Digital Medicine.
  29. 29Yehudai, A.; et al. (2025). Evaluation and Benchmarking of LLM Agents: A Survey. arXiv:2507.21504.
  30. 30Yu; et al. (2025). Evolving LLMs’ Self-Refinement Capability via Synergistic Training-Inference Optimization. arXiv:2502.05605.
  31. 31Zheng, L.; Chiang, W.-L.; Sheng, Y.; et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS).
  32. 32Zhou, Y.; Muresanu, A. I.; Han, Z.; et al. (2023). Large Language Models Are Human-Level Prompt Engineers (APE). International Conference on Learning Representations (ICLR).