Skip to main content
kellerai.blog

KOTH and Oracle: Technical Deep Dive

Selection, measurement, and improvement in agent ecosystems at scale — ELO-based ranking, Thompson Sampling adaptive routing, blind A/B evaluation, and autonomous improvement loops across 1,300+ tracked agents.

KellerAI White Paper · In-Depth · Earned Autonomy & Agents · May 2026 · ~31 min read

Context

Plugin ecosystems for Claude Code expose a scaling problem that does not exist when a developer works with a small, hand-curated set of agents. As the number of available agents and skills grows into the hundreds—and the marketplace's Oracle currently tracks 1,300 agents with 888 active—two bottlenecks become critical: selection and quality improvement. The first is selection: given a task description, which agent is most likely to succeed? Answering this correctly requires historical performance data, a model of uncertainty about newly registered agents, and a mechanism that improves as more outcomes are recorded.

The second is quality improvement: once an agent is selected and executes, how does the system detect that its behavior has regressed, or that a competing skill formulation would produce better outputs? Without a structured evaluation pipeline, quality improvement depends entirely on the developer's subjective assessment, which is subject to anchoring bias and inconsistent application.

The Finding

KOTH (King of the Hill) is the performance analytics layer that transforms raw execution telemetry into ranked, comparable agent scores via a three-stage extract-aggregate-ELO pipeline. Oracle consumes KOTH scores and maintains per-agent Beta distributions, applying Thompson Sampling to balance exploration of newer agents against exploitation of proven performers across the active tracked set. Two families of enhancement plugins close the quality loop: the skill-creator and agent-creator plugins provide composable evaluation pipelines (executor, grader, comparator, analyzer) that apply blind A/B comparison to eliminate evaluator bias; the agent-evolution-framework and thoughtbox plugins operate at higher abstraction, extracting production-proven patterns from mature codebases and coordinating autonomous discovery-filter-experiment-evaluate-integrate cycles.

Together, these systems form a closed feedback loop where Oracle routes tasks to the best-performing agents, those agents execute and produce telemetry, KOTH scores the telemetry, and Oracle updates its Beta distributions to reflect new evidence. This self-reinforcing architecture produces compounding quality gains without requiring manual intervention after initial configuration.

Tags:
Agent Performance & TelemetryBayesian Ranking & SelectionAutonomous Quality Improvement
Cite this paper

KellerAI. (2026, May 22). KOTH and Oracle: Technical Deep Dive. KellerAI. https://kellerai.blog/koth-oracle-in-depth

Paper Details
CategoryEarned Autonomy & Agents
AudienceEngineering teams, platform architects, security leads, and compliance officers responsible for AI system governance, agent evaluation, and quality assurance infrastructure
MethodProtocol-level architecture analysis (extract-aggregate-ELO, Thompson Sampling, blind A/B comparison) + multi-system integration design (KOTH, Oracle, skill-creator, agent-creator, agent-evolution-framework, thoughtbox) + autonomy-loop design (five-phase discovery-filter-experiment-evaluate-integrate) + regulatory overlay (NIST AI RMF, SR 11-7, EU AI Act, ISO/IEC 42001)
Length~7,500 · ~31 min
Reading levelTechnical
Sections13
References47
Versionv1.0 · Updated May 2026
PublishedMay 2026
Key Takeaways
  • A closed-loop feedback architecture—from task routing through telemetry collection, ELO scoring, and blind comparative evaluation—produces compounding quality gains where each execution sharpens the next routing decision.
  • Thompson Sampling over Beta distributions eliminates the need for manual exploration-rate tuning, automatically balancing aggressively sampling uncertain agents against consistently selecting proven performers.
  • Blind A/B comparison, where a comparator judges two agent versions without knowing which is the candidate, structurally eliminates anchoring bias in evaluation—the bias that makes self-evaluation of work you authored unreliable.
Related
Placeholder — pending analytics
Section 01

Abstract

As plugin ecosystems for large language model (LLM) orchestration platforms grow, two problems emerge that resist ad-hoc solutions: selecting the right agent for an incoming task, and systematically improving agent and skill quality over time. This paper documents the architecture, mechanisms, and integration points of four cooperating systems in the kellerai-dev-marketplace that address these problems.

KOTH (King of the Hill) provides an ELO-based competitive leaderboard derived from telemetry data, implementing an extract-aggregate-ELO pipeline that surfaces performance regressions and ranks agents objectively. Oracle consumes KOTH scores and maintains per-agent Beta distributions, applying Thompson Sampling to balance exploration of newer agents against exploitation of proven performers across 888 active tracked agents. 1

Two families of enhancement plugins close the quality loop. The skill-creator and agent-creator plugins provide composable evaluation pipelines — executor, grader, comparator, and analyzer agents — that apply blind A/B comparison to eliminate evaluator bias and drive iterative improvement. 2 The agent-evolution-framework and thoughtbox plugins operate at higher abstraction: the former extracts production-proven patterns from mature codebases and generates KOTH-wired framework components, reducing agent creation time from two to three days to two to three hours; the latter coordinates autonomous Discovery-Filter-Experiment-Evaluate-Integrate cycles that require no manual phase-by-phase orchestration. 3

Together, these systems form a closed feedback loop: Oracle routes tasks to the best-performing agents, those agents execute and produce telemetry, KOTH scores the telemetry, and Oracle updates its Beta distributions to reflect new evidence. This self-reinforcing architecture produces compounding quality gains without requiring manual intervention after initial configuration.

Companion brief: KOTH and Oracle →

Section 02

Introduction

Plugin ecosystems for Claude Code expose a scaling problem that does not exist when a developer works with a small, hand-curated set of agents. As the number of available agents and skills grows into the hundreds — and kellerai-dev-marketplace's Oracle currently tracks 1,300 agents with 888 active 4 — two bottlenecks become critical.

The Selection Problem

The first is selection: given a task description, which agent is most likely to succeed? Answering this question correctly requires historical performance data, a model of uncertainty about newly registered agents, and a mechanism that improves as more outcomes are recorded. Naive approaches — alphabetical ordering, manual tagging, or always selecting the most recently installed agent — fail quickly at scale.

Consider a simple illustrative scenario: a platform has 800 agents across 40 domains. Without performance data, a task routing system must guess. With Oracle's Thompson Sampling over Beta distributions, each agent's probability of success is estimated from historical outcomes, uncertainty is quantified, and selections improve measurably with every completed task.

The Quality-Improvement Problem

The second is quality improvement: once an agent is selected and executes, how does the system detect that its behavior has regressed, or that a competing skill formulation would produce better outputs? Without a structured evaluation pipeline, quality improvement depends entirely on the developer's subjective assessment, which is subject to anchoring bias (favoring the version they most recently worked on) and inconsistent application. 5

Why Naive Heuristics Fail

Three naive heuristics fail systematically at scale. First, recency-based selection (always pick the newest agent) ignores accumulated evidence and treats every update as an improvement by assumption. Second, manual A/B comparison falls victim to the evaluator knowing which version they wrote, introducing structural anchoring bias. Third, threshold-based routing (pick the agent with the highest historical win rate) ignores uncertainty: an agent with 2 wins out of 2 is less trustworthy than one with 90 wins out of 100, yet a naive comparison would rank them the same at 100%.

The four systems documented in this paper address these bottlenecks in a unified architecture. Sections 03 and 04 cover KOTH and Oracle, the measurement and routing layer. Sections 05 through 08 cover the skill-creator evaluation pipeline. Sections 09 and 10 cover agent-creator and agent-evolution-framework. Sections 11 and 12 cover Thoughtbox and the autonomous improvement loop. Sections 13 and 14 document system integration and references.

Section 03

KOTH Architecture

KOTH (King of the Hill) is the performance analytics layer that transforms raw execution telemetry into ranked, comparable agent scores. It is implemented as an MCP server accessible via the Archangel gateway at the koth_ tool prefix. 6

Extract-Aggregate-ELO Pipeline

KOTH implements a three-stage pipeline declared in its tool interface:

KOTH pipeline stages
Python
extract → aggregate → elo

The pipeline is invoked on demand rather than continuously, accepting a configurable timeout (default 120 seconds, call timeout 180 seconds):

Pipeline invocation
Python
koth_koth_run_pipeline(timeout_seconds=120)

Telemetry data is stored at ~/.claude/telemetry/ and serves as the raw input to the extract phase. The aggregate phase consolidates per-execution records by agent identity, computing win rates and confidence intervals. The ELO phase applies a standard competitive rating algorithm, producing a total ordering of agents that is stable under small sample fluctuations but responsive to sustained performance changes. 7

Four Tools

KOTH exposes four tools that cover the full analytics surface:

Tool

Purpose

koth_koth_run_pipelineExecute the full extract-aggregate-ELO pipeline
koth_koth_statusRetrieve current standings and top 10 agents by ELO
koth_koth_agent_statsDetailed per-agent performance with trend analysis
koth_koth_check_regressionsDetect agents with declining ELO, win rate, or confidence

Regression detection accepts a severity threshold ( warning or critical ), making it suitable for integration into CI pipelines or pre-commit hooks that should alert on quality degradation before deployment. 8

Telemetry Sources

Telemetry flows into KOTH from two sources. Agents generated by the agent-evolution-framework carry a @koth_tracked decorator that records success or failure automatically on every execution. All other agents rely on explicit oracle_oracle_feedback calls from the orchestrating code after task completion.

This dual-source design reflects the practical reality of a mixed ecosystem: legacy agents and agents created before the framework shipped still contribute telemetry if callers record feedback correctly, while newly generated agents wire themselves in from the first execution. The extract phase normalizes both formats into a unified record schema before aggregation.

Relationship to Oracle

KOTH and Oracle are complementary but separate systems. KOTH handles the analytics and leaderboard side: it processes historical data into scores. Oracle handles the selection side: it uses those scores, combined with Bayesian uncertainty estimates, to route incoming tasks. This separation of concerns allows either system to be updated independently. 9

Section 04

Oracle & Thompson Sampling

Oracle is the agent selection recommender. It maintains a Beta distribution for each tracked agent and applies Thompson Sampling to solve the multi-armed bandit problem inherent in agent selection: how to balance trying newer, less-tested agents against relying on proven performers. 10

Beta(α, β) Distributions

Thompson Sampling treats each agent's success probability as a Beta(α, β) distribution. Each time Oracle receives feedback that an agent succeeded, α is incremented. Each failure or partial success increments β (or increments α by the quality score for partial outcomes). When a query arrives, Oracle samples a probability from each candidate agent's Beta distribution and returns the agents with the highest samples.

The Beta distribution is analytically tractable as the conjugate prior to the Bernoulli likelihood. This means posterior updates are closed-form: no numerical integration is required to update the distribution after each observed outcome, making per-task feedback updates computationally negligible. 11

Oracle Thompson Sampling pseudocode
Python
# Oracle Thompson Sampling
for agent in candidates:
  p_success ~ Beta(alpha[agent], beta[agent])
for task:
  samples = {agent: sample_from_beta(agent) for agent in candidates}
  selected = argmax(samples)
  return selected

Cold-Start Mechanics

A new agent with no history has a flat Beta(1, 1) distribution, which is equivalent to a uniform distribution over [0, 1]. This means it is occasionally sampled and given a chance to demonstrate its capabilities, but it does not crowd out established performers with proven track records.

As feedback accumulates, confidence concentrates and the distribution narrows. An agent with 100 successes and 5 failures has a Beta(101, 6) distribution — highly concentrated around 0.94. Its Thompson Sampling draws are almost always near that value, so it reliably beats newer agents in head-to-head sampling comparisons. But it can be displaced by an agent that demonstrates even higher success rates over sufficient trials. 12

1,300+ Agents at Scale

Oracle currently tracks 1,300 total agents, of which 888 are active. Top performers as of the current dataset include:

  • Bash:bd — 95.4% success rate

  • Edit — 91.4% success rate

  • Read — 87.4% success rate

These statistics are from live production data reported via oracle_oracle_status. At this scale, the Thompson Sampling algorithm remains efficient because sampling from a Beta distribution is O(1) per agent, and the argmax over candidates is linear in the candidate set size. 13

Oracle Tool Interface

Oracle tool surface
Python
oracle_oracle_query(task_description, limit=3, min_confidence=0)
oracle_oracle_feedback(agent_name, task_type, outcome, quality_score=1.0)
oracle_oracle_search(query, search_type="hybrid", limit=10)
oracle_oracle_status()

The oracle_oracle_feedback tool is the critical link in the feedback loop. It accepts outcome as "success", "failure", or "partial", with an optional quality_score for nuanced feedback on partial completions. The database is stored at ~/.claude/oracle/oracle.db. 14

Oracle's oracle_oracle_search tool supports three search modes — keyword, semantic, and hybrid — making it useful both for exact-match routing (find agents with a specific tool in their name) and semantic routing (find agents that handle "data visualization tasks" even if those words do not appear verbatim in the agent name).

Section 05

The Feedback Loop

The end-to-end feedback cycle operates across eight steps that span all four major systems. Understanding this cycle is prerequisite to understanding why the individual system designs take the forms they do.

Eight-Step Cycle

  1. Telemetry arrives. oracle_oracle_query samples from per-agent Beta distributions and returns the top-ranked candidates for the incoming task description.

  2. Agent executes. The selected agent runs the task. If it carries a @koth_tracked decorator, the outcome is recorded to ~/.claude/telemetry/ automatically.

  3. Feedback is recorded. oracle_oracle_feedback is called with the outcome. Oracle immediately updates the agent's Beta(α, β) distribution.

  4. ELO recalculation runs. Periodically (or on demand), koth_koth_run_pipeline processes accumulated telemetry through the extract-aggregate-ELO pipeline, updating the global leaderboard.

  5. Regressions are detected. koth_koth_check_regressions surfaces agents with declining ELO, win rate, or confidence intervals.

  6. Improvement is triggered. For flagged agents, skill-creator or agent-creator is invoked to run the evaluation-improvement pipeline.

  7. Improved agents re-enter Oracle. The winning version replaces the previous best. Oracle's existing Beta distribution for the agent name continues accumulating feedback; improved performance shifts α upward.

  8. Cycle compounds. Better agents get more tasks, more tasks produce more telemetry, more telemetry sharpens ELO scores, sharper scores drive better routing.

The Criticality of oracle_feedback

The loop is not fully automatic — calling oracle_oracle_feedback after each task completion is the developer's responsibility, and it is marked "CRITICAL" in the Oracle reference documentation. 15 Failing to call feedback breaks the loop and causes Oracle's distributions to stagnate. An agent that executes 1,000 tasks without feedback remains at Beta(1, 1) — indistinguishable from a brand-new, untested agent in Oracle's eyes.

This is the single most common integration failure in practice: teams that route through Oracle but neglect the feedback call see no improvement in routing quality over time, and typically conclude (incorrectly) that Oracle "doesn't work." The problem is always the missing feedback step.

What Happens When the Loop Breaks

When feedback is missing, Thompson Sampling degrades to approximate uniform random selection from the uninitialized pool. When KOTH pipeline runs are skipped, ELO scores freeze and Oracle loses its signal from the leaderboard dimension. When skill-creator improvement is not triggered on regressions, the quality floor erodes silently. Each break is individually tolerable but collectively they eliminate the compounding advantage the architecture was designed to provide.

Better agents get more tasks. More tasks produce more telemetry. More telemetry sharpens the ELO scores. Sharper scores drive better routing. The cycle compounds.

The compounding loop
Section 06

Skill-Creator Building Blocks

The skill-creator plugin provides a structured pipeline for creating, evaluating, and iteratively improving Claude Code skills. It is distributed as a marketplace plugin (KellerAI's skill-creator plugin ) and installed via the kellerai-dev-marketplace. 16

The skill-creator's design is modular. Four composable building blocks, each implemented as a dedicated subagent, form the foundation of all higher-level workflows.

Executor (Transcript + Metrics)

The executor runs a single eval case: it reads the skill's SKILL.md and referenced files, stages input files, executes the skill's instructions against the eval prompt, saves outputs, and produces three artifacts:

  • transcript.md — a detailed step-by-step execution log

  • metrics.json — tool call counts, step count, output character count, and error count

  • user_notes.md — uncertainties, workarounds, and suggestions for human review

The executor is explicitly designed to surface problems that might otherwise be masked by a "successful" execution. The user_notes.md file is mandatory even when execution is clean, with the instruction to write "No uncertainties, issues, or suggestions to report" rather than omitting the file. 17

Grader (Structured Assertions)

The grader has two responsibilities: grade the outputs, and critique the evals themselves. For grading, the grader reads the transcript and output files, then evaluates each assertion from evals.json as PASS or FAIL — no partial credit. Evidence is required for every verdict.

The second responsibility — eval critique — is specifically designed to prevent false confidence. The grader flags assertions that would pass even for clearly wrong outputs, outcomes that no assertion checks, and assertions that cannot be verified from the available evidence. 18

Grading output schema
Python
{
"expectations": [{"text": "...", "passed": true, "evidence": "..."}],
"summary": {"passed": 2, "failed": 1, "total": 3, "pass_rate": 0.67},
"claims": [{"claim": "...", "type": "factual|process|quality", "verified": true, "evidence": "..."}],
"user_notes_summary": {"uncertainties": [], "needs_review": [], "workarounds": []},
"eval_feedback": {"suggestions": [], "overall": "..."}
}

Comparator (3-Run Majority Vote)

The comparator is the bias elimination mechanism. It receives two output directories labeled only "A" and "B" — it does not know which version produced which output — and judges quality using a structured rubric covering content (correctness, completeness, accuracy) and structure (organization, formatting, usability). Each criterion is scored 1-5, dimension averages are computed, and an overall score scaled to 1-10 is produced. Assertion pass rates are treated as secondary evidence, not the primary decision factor. 19

Analyzer (Ranked Suggestions)

After the comparator declares a winner, the analyzer "unblinds" the comparison: it now reads both skill versions and both transcripts, knowing which was the winner and which was the loser. Its purpose is to extract actionable improvement suggestions by explaining why the winner won.

The analyzer categorizes suggestions by priority (high, medium, low) and type (instructions, tools, examples, error_handling, structure, references). High-priority suggestions are those that would likely have changed the outcome of the comparison. 20

Section 07

Four Evaluation Modes

The four building blocks combine into four higher-level workflows that address different phases of the skill lifecycle. 21

Mode

Workflow

Subagents

EvalExecutor → Grader → ResultsOptional
ImproveExecutor → Grader → Comparator → Analyzer → ApplyRecommended
CreateInterview → Research → Draft → Run → RefineOptional
Benchmark3× runs per config → Aggregate → AnalyzeRequired

Eval Mode

Eval mode runs a single eval case and grades expectations against it. It supports a comparison between "with skill" and "without skill" execution to measure the skill's contribution to the output quality.

Improve Mode

Improve mode is the core iteration loop. For each iteration, three executor runs are launched in parallel (when subagents are available), each independently graded. The comparator then blindly evaluates the new version against the current best. If the new version wins majority vote, it becomes the new best; otherwise the previous best is retained. The coordinator tracks a history.json file recording version progression, pass rates, and win/loss results across all iterations. 22

Create Mode

Create mode is interactive: the coordinator interviews the user to capture intent, researches edge cases and best practices, drafts the skill, creates test cases, and runs the first few evals in the main agent loop (not as subagents) so the user can observe the transcript directly. This design prioritizes transparency during the initial creation phase.

Benchmark Mode

Benchmark mode requires subagents. It runs every eval case three times per configuration, produces variance statistics (mean, stddev, min, max) across runs, always includes a no-skill baseline for comparison, and uses the most capable available model for analysis. The variance statistics are the key output: a skill with high mean performance but high variance is less reliable than one with slightly lower mean but tightly concentrated outcomes.

History Tracking

Across all modes, the coordinator maintains history.json as an append-only audit trail. Each entry records the version number, timestamp, pass rate, comparator result, and the diff summary from the analyzer. This provides complete auditability of the improvement trajectory and enables rollback to any prior version if a subsequent iteration regresses.

Section 08

Blind Comparison & Bias Elimination

The blind comparison mechanism is the most important innovation in skill-creator's design. Without it, the evaluating agent has inherent access bias: it knows which version it just wrote and is structurally predisposed to find merit in it. Even a well-intentioned evaluator will anchor to recently generated content. 23

Randomization Protocol

The randomization protocol is explicit: the coordinator randomly assigns whether the new version is "A" or "B" (50% probability each), records the assignment in assignment.json, and passes only the paths to the comparator — never the version names. The comparator receives two directories and a rubric. It has no access to commit history, version metadata, or authorship information.

Workspace Anonymization

The workspace structure preserves the chain of custody:

Workspace structure
Python
skill-name-workspace/
└── grading/
  └── v1-vs-v0/
      ├── assignment.json      # Which version is A vs B
      ├── comparison-1.json    # Run 1 blind comparison
      ├── comparison-2.json    # Run 2 blind comparison
      ├── comparison-3.json    # Run 3 blind comparison
      └── analysis.json        # Post-hoc analysis (with identities)

Double-Blind Principle

This two-phase design means the quality judgment is structurally separated from knowledge of authorship — the closest practical equivalent to a double-blind study in an automated evaluation context. The comparator produces its verdict without knowing which version it prefers in identity terms. Only the analyzer, in a second pass, reads both the verdict and the version identities together. 24

Majority Vote

The winner is determined by majority vote across three independent comparator runs. Running three comparators rather than one eliminates single-run stochasticity from the LLM evaluation process:

Majority vote rule
Python
If 2+ comparators prefer A → A wins
If 2+ comparators prefer B → B wins
Otherwise → TIE

A TIE result retains the previous best. The conservative default — preferring the known-good version under uncertainty — reflects the asymmetric cost of regression: a false improvement that degrades quality is worse than a missed improvement that would have maintained it.

Why This Destroys Anchoring Bias

Anchoring bias in LLM evaluation arises because the model has in-context access to the most recently generated content when it evaluates it. Blinding prevents this by interposing an information barrier between generation and evaluation. The comparator cannot anchor to "the version I just wrote" because it has no information about which version was written when. Its judgment is forced to derive entirely from the observable quality of the outputs themselves.

Section 09

Agent-Creator: Dual Dimensions

The agent-creator plugin is a direct analogue of skill-creator for Claude Code agents rather than skills. It shares the same four-building-block architecture (executor, grader, comparator, analyzer) and four operating modes, but with a critical distinction: agent evaluation has two independent quality dimensions that must be measured separately. 25

Triggering Accuracy

An agent's description field in its YAML frontmatter is the sole mechanism by which Claude decides whether to invoke the agent for a given user message. An agent with a perfectly written system prompt is worthless if it never triggers on the prompts it should handle, or degrades user experience by triggering on prompts it should ignore.

The agent-creator's executor explicitly records a Triggering Decision section in every transcript, quoting from the description's triggering conditions or examples that match or fail to match the prompt. A pass requires not just that the triggering decision matched the expectation, but that the reasoning cited genuine matches between the description's examples and the prompt — not superficial keyword coincidence. 26

Instruction-Following Quality

The second dimension measures what happens after the agent triggers: does it follow its system prompt? Does it produce output in the expected format? Does it complete the task?

The grader's triggering_summary field tracks whether the triggering decision was correct, while behavior expectations capture instruction-following quality independently. This separation is deliberate: an agent that always triggers but ignores its system prompt requires a different fix than one that rarely triggers. 27

evals.json Schema

The agent-creator's eval schema includes a should_trigger boolean that is absent from skill-creator's schema, reflecting this additional dimension:

evals.json with triggering dimension
Python
{
"agent_name": "example-agent",
"evals": [
  {
    "id": 1,
    "prompt": "User's triggering prompt",
    "expected_behavior": "Agent should trigger and do X",
    "should_trigger": true,
    "assertions": [
      "The agent triggered on this prompt",
      "The agent followed its system prompt instructions"
    ]
  }
]
}

Comparison Rubric

The blind comparison rubric for agent-creator extends the skill-creator rubric with triggering-specific dimensions, producing separate scores for triggering accuracy and behavior quality that combine into an overall score. 28

The grading output includes a dedicated triggering_summary block:

triggering_summary schema
Python
{
"triggering_summary": {
  "triggered": true,
  "expected_trigger": true,
  "trigger_correct": true,
  "trigger_reasoning": "Description example #2 closely matched the prompt phrasing"
}
}
Section 10

Agent-Evolution-Framework

While agent-creator addresses iterative improvement of existing agents, the agent-evolution-framework addresses agent creation velocity. Its stated mission is direct: "Reduce agent creation time from 2-3 days to 2-3 hours." 29

Three-Component Architecture

The framework consists of three components with distinct roles and model assignments:

  • evolution-worker-intake (user-invocable skill): The entry point. Accepts a codebase path and optional flags ( --analyze, --epic-only, --generate-beads), detects the framework type (Phidata, LangChain, AutoGen, CrewAI, DSPy, Instructor), and dispatches to evolution-worker.

  • evolution-worker (fork-context skill, Opus model): The deep analysis worker. Runs in a fork context with the Opus model for maximum reasoning capability. Not user-invocable — dispatched by intake.

  • evolution-orchestrator-agent (not user-invocable, Opus model): The master architect. Drives reconnaissance, deep analysis, framework generation, KOTH wiring, beads creation, and validation. 30

Eight Production Patterns

The framework's knowledge base consists of eight production-proven patterns extracted from major AI framework codebases, organized by priority and mapped to generated components:

Priority

Pattern

Generated Component

P0Two-Knowledge ArchitectureTwoKnowledgeAgent base class
P0Multi-Mode EvaluationEvaluationPipeline framework
P1Context Layer SystemContextLayerManager
P1Factory Tools with DIToolFactory pattern
P1JSON Knowledge SchemaSchema + KnowledgeLoader
P2Test Case EvolutionTestCaseEvolution system
P2Pattern DiscoveryPatternDiscovery service
P2Agent Config DSLAgentConfigLoader

Detect + Extract Pattern

Each pattern has a detection signal and an extraction target. For example, the Two-Knowledge Architecture pattern detects separate knowledge and learnings stores with different update patterns, then extracts a TwoKnowledgeAgent base class with separated static_knowledge and dynamic_learnings fields, plus a LearningMachine that handles promotion between them. 31

@koth_tracked Decorator

Every agent generated by the framework receives a @koth_tracked decorator, wiring it into the KOTH-Oracle feedback loop from the moment of creation. The decorator wraps the agent's run method to record outcomes to the Thompson Sampling telemetry system:

@koth_tracked decorator implementation
Python
def koth_tracked(category: str):
  def decorator(cls):
      original_run = cls.run

      async def tracked_run(self, *args, **kwargs):
          try:
              result = await original_run(self, *args, **kwargs)
              track_agent_outcome(
                  agent=cls.__name__, category=category,
                  success=True, metadata={"args": args}
              )
              return result
          except Exception as e:
              track_agent_outcome(
                  agent=cls.__name__, category=category,
                  success=False, metadata={"error": str(e)}
              )
              raise
      cls.run = tracked_run
      return cls
  return decorator

This means a newly generated agent begins feeding performance data into the Oracle/KOTH feedback loop immediately, without requiring manual telemetry integration. 32

Jinja2 Templates & Dependency Ordering

Framework components are generated from Jinja2 templates organized by component type. Generation follows a strict dependency order: protocols (no dependencies) are generated first, then base classes (depend on protocols), then factories (depend on base classes), then CLI scaffolding (depends on all), then tests (depend on all). This ordering prevents import failures in generated code. 33

Beads Epics & Parallel Analysis

The framework automatically generates a structured beads issue hierarchy for each extracted pattern, with labels applied for priority and component type: agent-evolution, p0-critical / p1-high / p2-medium, and core / enhancement / docs / tooling. 34

Pattern analysis runs in parallel: one Explore subagent per pattern category is launched simultaneously, and their outputs are synthesized by the orchestrator using Thoughtbox for architectural trade-off evaluation. Framework generation then runs sequentially due to dependency ordering. This hybrid parallel-then-sequential architecture maximizes analysis throughput while preserving generation correctness.

Section 11

Structured Reasoning: Thoughtbox

The thoughtbox plugin provides six specialized agents powered by the Thoughtbox MCP server's session-based reasoning engine. The server implements branching thought trees, mental models, critique validation, and learning extraction — capabilities that augment LLM reasoning by making the reasoning process explicit, auditable, and persistent across a session. 35

Cipher Notation

All thoughtbox agents use a cipher notation format that provides 2-4× token compression for structured reasoning:

Cipher notation format
Python
{num}|{type}|{refs}|{content}

Types: O (Observation), A (Assumption), H (Hypothesis), E (Evidence),
     Q (Question), P (Plan), C (Conclusion), R (Revision), X (Rejected)

All Conclusion (C-type) thoughts must have "critique": true — the Thoughtbox server challenges the conclusion and the agent must address the critique before finalizing. Dead hypotheses are explicitly killed with X-type thoughts; understanding revisions are captured with R-type thoughts. 36

Six Agents

All six agents are configured with user-invocable: false except loop-orchestrator-agent, reflecting their design as building blocks to be composed rather than invoked directly.

  • strategic-reasoner-agent (Sonnet): Orchestrates deep strategic decision-making using four mental models: trade-off-matrix, pre-mortem, steelmanning, and inversion. Frames the problem with O-A-Q thoughts, creates minimum two branches for competing options, and converges with a critique-validated Conclusion thought. 37

  • improvement-reasoner-agent (Sonnet): Analyzes a single discovery (bug, vulnerability, or performance issue) and produces a structured improvement plan with 2-3 branched approaches, each scored on FEASIBILITY (1-10), RISK (1-10), and ESTIMATED_COST (tokens). 38

  • loop-orchestrator-agent (Sonnet, user-invocable): The entry point for autonomous improvement. Coordinates the full five-phase self-improvement cycle across all other agents as subagents. The only agent in the thoughtbox plugin that is user-invocable. 39

  • architecture-planner-agent (Opus): The most capable reasoner in the thoughtbox suite. Applies six mental models and produces three mandatory deliverables: an Architecture Decision Record (ADR), an implementation blueprint, and an exported reasoning trace from Thoughtbox for audit trail purposes. 40

  • debug-investigator-agent (Sonnet): Interleaves codebase investigation with structured hypothesis tracking. Uses the rubber-duck and five-whys mental models. Root cause conclusions require critique validation before finalization. 41

  • claude-md-updater-agent (Haiku): The most lightweight agent, using the Haiku model for cost efficiency. Extracts lessons from completed improvement loop iterations and persists them to CLAUDE.md and MEMORY.md. Only learnings with confidence above 0.6 are included, preventing low-signal noise from accumulating in project memory. 42

Section 12

Autonomous Improvement Cycles

The loop-orchestrator implements a five-phase cycle that requires no manual phase-by-phase orchestration. Each phase has explicit success criteria and failure conditions that determine whether to proceed, retry, or abort. 43

Phase 1: Discovery

Scans the codebase for performance bottlenecks, security vulnerabilities, code quality issues, bug patterns, and missing tests. Uses bv --robot-triage to identify highest-leverage starting points by PageRank centrality — files with high betweenness centrality in the dependency graph are prioritized because changes there have the broadest impact. Produces a structured list of discoveries with type, severity, and source location.

Phase 2: Filter

Ranks discoveries by impact, feasibility, and risk. Explicitly rejects discoveries that are "too complex for autonomous fix," keeping only actionable items. The filter is deliberately aggressive: quality over quantity. A discovery that passes the filter must have a clear, bounded fix that can be implemented and verified without human oversight.

Phase 3: Experiment

Implements the recommended fix from the improvement-reasoner-agent's plan. The rules are strict: minimal changes, no refactoring of unrelated code, no feature additions beyond scope. If uncertain, err toward doing less. The experiment phase is intentionally conservative to preserve the ability to evaluate cleanly in Phase 4.

Phase 4: Evaluate (Strict Tiers)

Tiered verification runs in order:

  • Tier 1: Syntax check (does the code compile/parse?)

  • Tier 2: Unit tests (do existing tests pass?)

  • Tier 3: Integration tests (does it work end-to-end?)

Evaluation is strict: any doubt fails the tier. A failed tier aborts the current iteration and increments the failure counter. The orchestrator does not attempt to fix evaluation failures within the same iteration — that would risk compounding changes and muddying the signal.

Phase 5: Integrate (Beads Lifecycle)

Persists results via the beads issue lifecycle and delegates to claude-md-updater-agent to capture learnings in project memory:

Beads loop integration
Python
bd update <issue-id> --status in_progress
bd close <issue-id>
bd sync

The integration with beads provides a durable record of every autonomous improvement attempt, whether successful or failed. Failed experiments are closed with a failure note rather than silently abandoned, preserving the audit trail. 44

Discovery Filtering in Practice

The filter phase is the most consequential for system reliability. Overly permissive filtering leads to attempted fixes that fail evaluation and waste iteration budget. Overly strict filtering means the loop does nothing. The current implementation targets a 30-50% pass-through rate from Discovery to Experiment: roughly one in three discovered issues should be actionable without human assistance.

Items filtered out are not discarded — they are logged as "deferred" discoveries in the iteration output. A human reviewing the iteration log can identify patterns in what gets filtered and improve the detection signals or the agent's capabilities accordingly.

Section 13

System Integration & Cooperation

The four systems are not isolated tools — they form an integrated quality ecosystem where each system's outputs feed the next system's inputs. 45

Full Cooperation Cycle

  1. Task Routing: oracle_oracle_query samples from per-agent Beta distributions and returns top candidates. Oracle's 888 active tracked agents include skill-created skills and evolution-framework-generated agents, meaning the routing system benefits from the quality improvements produced by the other systems.

  2. Task Execution: The selected agent executes. If it carries a @koth_tracked decorator, success or failure is recorded to the telemetry layer automatically.

  3. Feedback Recording: oracle_oracle_feedback is called with the outcome. Oracle immediately updates the agent's Beta distribution.

  4. Quality Improvement: When an agent's performance plateaus or regresses (detected via koth_koth_check_regressions), the skill-creator or agent-creator pipeline is invoked. Blind A/B comparison and grader critique identify exactly what needs to change; the analyzer produces actionable suggestions.

  5. New Agent Creation: When a new domain requires a new agent, agent-evolution-framework analyzes existing production codebases, extracts relevant patterns, and generates a base implementation already including KOTH telemetry.

  6. Architectural Validation: Before committing to a new agent architecture, thoughtbox's strategic-reasoner-agent or architecture-planner-agent evaluates competing designs with branching mental models and critique-validated conclusions.

  7. Knowledge Persistence: claude-md-updater-agent captures patterns that worked and patterns that failed into CLAUDE.md and MEMORY.md. Oracle's Beta distributions persist in oracle.db; KOTH's ELO scores persist in the telemetry layer.

  8. ELO Recalculation: koth_koth_run_pipeline processes accumulated telemetry, updates ELO scores, and the cycle begins again with improved routing distributions.

Dependency Graph

System dependency map
Python
oracle.db (Beta distributions)
← koth_koth_run_pipeline (ELO computation)
  ← ~/.claude/telemetry/ (raw execution data)
    ← @koth_tracked decorators (agent-evolution-framework)
    ← manual oracle_feedback calls (after any agent execution)

skill-creator → improved skills → better oracle routing
agent-creator → improved agents → better oracle routing

thoughtbox → architectural decisions → better system design
thoughtbox loop-orchestrator → autonomous fixes → fewer regressions

Cold-Start vs. Warm-Start

When a new agent or skill is first created, the agent-evolution-framework generates a structurally sound base implementation from proven patterns. The first few Oracle queries return the new agent occasionally (Beta(1,1) sampling). Early outcomes update the Beta distribution rapidly — each result has high influence when the sample size is small. As the agent proves itself, Oracle routes it more frequently; if it underperforms, it is naturally deprioritized. This cold-start behavior is a structural property of Thompson Sampling, not a tuned parameter — it emerges from the mathematical properties of Beta distribution sampling.

When an existing agent is improved via skill-creator or agent-creator, the improved version wins the blind A/B comparison and replaces the previous best. Oracle's existing Beta distribution for the agent name continues to accumulate feedback. If the new version performs better, success rates increase and the distribution shifts toward higher alpha values. This warm-start behavior means improvements compound: an agent already trusted by Oracle sees its improved performance reflected in the Beta distribution faster than a new agent would. 46

The Compounding Effect

The compounding effect is not a claim about some future state — it is an observable consequence of the architecture as implemented. Each feedback call narrows a Beta distribution. Each KOTH pipeline run sharpens ELO rankings. Each improvement cycle raises the quality floor. Each pattern extracted by agent-evolution-framework reduces the time to create the next agent. The systems compound on each other as well as internally.

The moat is not any individual agent's quality. It is the accumulated telemetry, the refined patterns, and the closed-loop infrastructure that learns from every execution.

The moat
References
  1. 1KOTH tool interface ~/.claude/configs/references/archangel-koth.md 1-55
  2. 2skill-creator building blocks KellerAI's skill-creator plugin (internal codebase) — SKILL.md 30-71
  3. 3agent-evolution-framework mission KellerAI's agent-evolution-framework plugin (internal codebase) — README.md 7
  4. 4Oracle scale — 1,300 agents tracked ~/.claude/configs/references/archangel-oracle.md 54-60
  5. 5Anchoring bias in LLM evaluation KellerAI's skill-creator plugin (internal codebase) — SKILL.md 478-497
  6. 6KOTH MCP server prefix ~/.claude/configs/references/archangel-koth.md 1-8
  7. 7KOTH pipeline stages and timeout ~/.claude/configs/references/archangel-koth.md 9-15
  8. 8KOTH regression detection severity ~/.claude/configs/references/archangel-koth.md 17-24
  9. 9KOTH/Oracle separation of concerns ~/.claude/configs/references/archangel-koth.md 51-54
  10. 10Oracle Thompson Sampling description ~/.claude/configs/references/archangel-oracle.md 1-15
  11. 11Beta distribution conjugate prior Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another. Biometrika 25(3/4): 285–294. —
  12. 12Beta(1,1) cold-start behavior ~/.claude/configs/references/archangel-oracle.md 54-60
  13. 13Oracle scale — active agents and top performers ~/.claude/configs/references/archangel-oracle.md 20-35
  14. 14oracle_oracle_feedback criticality ~/.claude/configs/references/archangel-oracle.md 55-56
  15. 15Feedback loop break failure mode ~/.claude/configs/references/archangel-oracle.md 55-56
  16. 16skill-creator plugin — executor, grader, comparator, analyzer design KellerAI's skill-creator plugin (internal codebase) — SKILL.md 1-30
  17. 17Executor user_notes mandatory KellerAI's skill-creator plugin (internal codebase) — agents/executor.md 1-183
  18. 18Grader eval critique KellerAI's skill-creator plugin (internal codebase) — agents/grader.md 1-228
  19. 19Comparator rubric and randomization KellerAI's skill-creator plugin (internal codebase) — agents/comparator.md 1-206
  20. 20Analyzer priority categories KellerAI's skill-creator plugin (internal codebase) — agents/analyzer.md 1-189
  21. 21Four evaluation mode workflows KellerAI's skill-creator plugin (internal codebase) — SKILL.md 88-95
  22. 22Improve mode history.json tracking KellerAI's skill-creator plugin (internal codebase) — SKILL.md 290-340
  23. 23Blind comparison design rationale KellerAI's skill-creator plugin (internal codebase) — SKILL.md 478-497
  24. 24Workspace chain of custody KellerAI's skill-creator plugin (internal codebase) — SKILL.md 694-702
  25. 25agent-creator dual dimensions KellerAI's agent-creator plugin (internal codebase) — SKILL.md 1-452
  26. 26agent-creator triggering transcript section KellerAI's agent-creator plugin (internal codebase) — agents/executor.md 82-96
  27. 27agent-creator grader triggering separation KellerAI's agent-creator plugin (internal codebase) — agents/grader.md 8-14
  28. 28agent-creator comparison rubric KellerAI's agent-creator plugin (internal codebase) — references/schemas.md 306-338
  29. 29agent-evolution-framework mission statement KellerAI's agent-evolution-framework plugin (internal codebase) — README.md 7
  30. 30Three-component framework architecture KellerAI's agent-evolution-framework plugin (internal codebase) — README.md 28-76
  31. 31Two-Knowledge Architecture pattern KellerAI's agent-evolution-framework plugin (internal codebase) — agents/evolution-orchestrator-agent.md 48-66
  32. 32@koth_tracked decorator implementation KellerAI's agent-evolution-framework plugin (internal codebase) — skills/evolution-worker/SKILL.md 92-119
  33. 33Jinja2 templates and generation dependency order KellerAI's agent-evolution-framework plugin (internal codebase) — skills/evolution-worker/SKILL.md 78-84, 251-257
  34. 34Beads epic structure and labels KellerAI's agent-evolution-framework plugin (internal codebase) — README.md 149-166
  35. 35thoughtbox plugin overview KellerAI's thoughtbox plugin (internal codebase) — README.md 1-12
  36. 36Cipher notation and critique validation KellerAI's thoughtbox plugin (internal codebase) — agents/strategic-reasoner-agent.md 42-60
  37. 37strategic-reasoner workflow KellerAI's thoughtbox plugin (internal codebase) — agents/strategic-reasoner-agent.md 64-154
  38. 38improvement-reasoner scoring dimensions KellerAI's thoughtbox plugin (internal codebase) — agents/improvement-reasoner-agent.md 60-96
  39. 39loop-orchestrator user-invocable status KellerAI's thoughtbox plugin (internal codebase) — README.md 62-83
  40. 40architecture-planner-agent Opus model KellerAI's thoughtbox plugin (internal codebase) — README.md 88-118
  41. 41debug-investigator mental models KellerAI's thoughtbox plugin (internal codebase) — README.md 120-147
  42. 42claude-md-updater confidence threshold KellerAI's thoughtbox plugin (internal codebase) — README.md 149-175
  43. 43loop-orchestrator five phases KellerAI's thoughtbox plugin (internal codebase) — agents/loop-orchestrator-agent.md 60-246
  44. 44beads integration in loop-orchestrator KellerAI's thoughtbox plugin (internal codebase) — agents/loop-orchestrator-agent.md 212-228
  45. 45Full cooperation cycle /koth-oracle-in-depth §13 — System Integration & Cooperation —
  46. 46Warm-start behavior /koth-oracle-in-depth §13 — Cold-Start vs. Warm-Start —
  47. 47Chapelle, O., & Li, L. (2011). An Empirical Evaluation of Thompson Sampling. NIPS 24. proceedings.neurips.cc/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc —