Skip to main content
kellerai.blog

KOTH and Oracle: Selection and Improvement at Scale

When an ecosystem grows to hundreds of agents, picking the right one and making it better require infrastructure, not intuition.

KellerAI White Paper · Earned Autonomy & Agents · May 2026

Context

When an organization runs hundreds of active agents, naive routing heuristics—recency, habit, alphabetical order—stop accumulating evidence and start compounding error. Selection breaks down. At the same time, the developer who built an agent reads its output through the lens of what they intended, seeing confirmation of their design rather than evidence of failure. That anchoring bias is structural, not personal, and it reliably produces agents that plateau. Both problems share a root: replacing systematic measurement with human judgment at the exact point where human judgment is least reliable—pattern-matching across hundreds of performers, and self-evaluation of work you authored.

The Finding

KOTH ranks agents competitively using ELO scores derived from production telemetry, while Oracle routes tasks via Thompson Sampling over per-agent Beta distributions, automatically balancing exploration of newer agents against exploitation of proven performers. Together they close a feedback loop: every execution produces telemetry, telemetry updates rankings, rankings shift routing decisions. The evaluation pipeline removes anchoring bias through blind A/B comparison, where a comparator judges two versions without knowing which is the candidate, eliminating the structural bias that makes self-evaluation unreliable.

Tags:
Agent Ranking & SelectionFeedback Systems & MeasurementOrganizational Learning
Paper Details
CategoryEarned Autonomy & Agents
AudienceEngineering teams, product leaders, and platform architects managing agent ecosystems and selection systems
MethodClosed-loop system design (KOTH-Oracle-evaluation-improvement pipeline) + Thompson Sampling mechanics + blind comparison methodology + organizational-capital framing
Length~1,600 · 7 min
Sections5
DateMay 2026
AuthorsKellerAI
Read the full paper
Related
Placeholder — pending analytics
Section 01

The twin bottlenecks

Build enough agents and two problems arrive together. The first is selection. When an organization runs 888 active agents — some specialist, some generalist, some overlapping in scope — naive heuristics stop working. Routing to the newest agent favors recency over quality. Routing to the last-used agent favors habit over fit. Alphabetical ordering is, at best, a joke. None of these approaches accumulate evidence. None of them get better with use.

The second problem is quality. Improving an agent by hand sounds straightforward until you try it. The same person who wrote the agent reads its output through the lens of what they intended it to produce. They see confirmation of their design, not evidence of its failures. That anchoring bias is not a character flaw — it is a structural feature of self-evaluation, and it reliably produces agents that plateau.

Both problems share a root. They are the result of replacing systematic measurement with human judgment at the point where human judgment is least reliable: pattern matching across hundreds of performers, and incremental evaluation of work you authored yourself. The thesis of KOTH and Oracle is that a closed-loop system of measurement, ranking, and structured improvement beats manual tuning because it accumulates evidence — and evidence compounds where intuition does not.

Section 02

KOTH and Oracle — the measurement and routing layer

KOTH stands for King of the Hill. The name describes the mechanism: agents compete on their track records, and the rankings reflect sustained performance rather than a single lucky run. The implementation is a three-stage pipeline. Telemetry is extracted from every agent execution — latency, success signals, task category tags, structured output quality markers. That raw telemetry is aggregated by task category so that a specialized agent is measured only against peers in its domain, not against generalists it was never designed to beat. The aggregated data is then fed into an ELO algorithm, the same competitive ranking mechanism used in chess and competitive gaming, producing a stable total ordering that is responsive to sustained performance change but resistant to single-run noise.

The ELO scores are useful on their own, but routing purely by rank has a well-known failure mode: a new agent that happens to be better than anything currently deployed can never demonstrate that fact, because it is never selected. Oracle solves this with Thompson Sampling. Each of the 1,300-plus tracked agents carries a Beta distribution representing its estimated success probability for a given task type. A Beta distribution has two parameters — one counting observed successes, one counting observed failures — and it starts wide, encoding genuine uncertainty about an agent that has not been tested much, and narrows as evidence accumulates.

When a routing request arrives, Oracle samples from each candidate agent's Beta distribution and routes the task to the agent whose sampled value is highest. This single operation automatically balances exploration against exploitation. A new agent with a wide distribution gets sampled aggressively — its uncertainty means it might outperform, and the only way to know is to try. A mature agent with a tight, high distribution gets selected consistently because the evidence is strong. No tuning knob, no exploration rate to configure: the distributions handle it.

The feedback loop closes itself. Every task Oracle routes produces telemetry; that telemetry updates the ELO pipeline; updated ELO scores shift the Beta distribution parameters; Oracle's next routing decision is marginally sharper. The loop is self-tightening. It is also self-revealing about failures: if any stage is broken — if telemetry stops arriving, if aggregation silently drops records, if ELO updates stall — the system does not crash. It degrades. Bad agents keep receiving traffic because their distributions were last updated with stale evidence. Good agents are starved because the evidence of their quality stopped flowing. That silent degradation is the most dangerous failure mode in a measurement system, and operating KOTH and Oracle correctly requires treating the loop's health as a first-class concern.

Section 03

The improvement engines

KOTH identifies which agents are underperforming. Oracle deprioritizes them in routing. Neither tells you how to make them better. That is the work of the evaluation pipeline, which is built from four building blocks designed to remove the anchoring bias that makes self-evaluation unreliable.

The first building block is the Executor: it runs the agent against a fixed, reproducible test harness. The same input, the same environment, every time. The Executor produces output; it does not judge it. The second is the Grader: it evaluates the Executor's output against a set of structured assertions — typed criteria that produce pass, fail, or partial-credit evidence, not a score from a gut feeling. The Grader's output is structured, not narrative.

The third building block is the Comparator. This is where blind comparison happens. The Comparator receives two versions of an agent — the current production version and a candidate improvement — and runs them side by side, then resolves the contest via majority vote across three independent comparator instances. The key design decision is that the Comparator never knows which version is the candidate and which is the control. Labels are stripped before the contest begins. This single constraint eliminates the anchoring bias: the comparator has no prior to confirm. It resolves the contest on evidence.

The fourth building block is the Analyzer. After the Comparator produces a result, the Analyzer examines the evidence and proposes ranked improvement suggestions. The output is not “this agent is better.” It is: here is the rubric, here are the votes from each independent comparator, here is the specific failure mode the losing version exhibited, here are three changes ordered by estimated impact. Structured evidence in, structured improvement candidates out.

Section 04

Acceleration and autonomy

The evaluation pipeline improves existing agents. The agent-evolution-framework accelerates the creation of new ones. It does this through pattern extraction: scanning mature production codebases for recurring structural idioms — decorator chains, validation patterns, hook sequences, typed context objects — and packaging them as reusable agent components. A new specialist agent assembled from extracted patterns starts significantly closer to production quality than one built from scratch, because the patterns carry the accumulated engineering decisions of systems that have already been running in production.

The more consequential capability is the autonomous improvement cycle. The loop runs five phases without manual phase gates: Discovery surfaces underperforming agents by monitoring KOTH ranking drift and Oracle's routing depression signals. Filter removes noise — agents that appear underperforming due to data sparsity rather than genuine quality decline. Experiment generates candidate improvements by applying the Analyzer's ranked suggestions. Evaluate runs the full four-building-block pipeline against each candidate. Integrate ships the winner by updating the production agent and triggering an ELO recalculation.

Each phase closes itself. Discovery does not wait for a human to notice a ranking drift. Filter does not require a manual review of which signals are meaningful. Integrate does not require a deployment approval for an improvement that the blind Comparator already validated. The loop turns continuously, and its cadence is bounded only by the time required to run evaluations.

Section 05

The compounding effect

The pieces are individually useful. Together they create a virtuous cycle. Better agents get more Oracle traffic because their Beta distributions have tighter, higher concentrations of probability mass. More traffic produces more telemetry. More telemetry sharpens ELO scores. Sharper ELO scores drive more precise routing. More precise routing surfaces more nuanced failure modes, which the evaluation pipeline converts into more targeted improvements. The cycle accelerates as the evidence base grows.

Cold-start agents enter at the beginning of this curve. A brand new agent carries a wide Beta distribution encoding genuine uncertainty, and Oracle samples it aggressively — not as a courtesy, but because uncertainty is statistically indistinguishable from potential. As the agent accumulates runs, the distribution narrows to reflect its actual quality. If the quality is high, the distribution settles at the top of the range and the agent earns consistent routing. If it is low, the distribution settles at the bottom and the evaluation pipeline targets it for improvement. No human makes these determinations. The math does.

The organizational-capital framing applies here as it does to reasoning traces. The moat is not any single agent in the ecosystem. It is the accumulated telemetry from every execution, the ELO history encoding which agents outperformed under which conditions, the extracted patterns distilled from years of production-hardened code, and the closed-loop infrastructure that continues to learn from every execution that flows through it. A competitor can copy the mechanism. They cannot copy the history. And the history can only be built by running the loop — forward, continuously, over time.

For the complete architecture — the extract-aggregate-ELO pipeline, Thompson Sampling Beta distribution mechanics, the four-building-block evaluation framework with blind A/B comparison, the agent-evolution-framework's eight production patterns and KOTH integration, and the autonomous Discovery-Filter-Experiment-Evaluate-Integrate loop — read the companion technical whitepaper, KOTH and Oracle: Technical Deep Dive .