Introduction — anything else is a guess
A single charitable read of a target document by a single language model is a stylistic match against a corpus that does not contain the system in front of it. 1 That sentence is the diagnosis under all five joints catalogued by the Series 1 critique on this site, 2 and it is the dynamic the Series 2 methodology — Trust but Verify — was constructed to interrupt. 3 This paper is the operationalization step. It documents, at the file-and-line level, the two pieces of plugin source that turn the Trust but Verify posture into mechanism: the /5pass adversarial-review skill 4 and the /spec-init workflow inside the kellerai-feature-spec plugin. 5
The argument is short. Reviewing a finished document with one lens and one model is the failure mode the Series 1 bill was paid against. Reviewing the same document with five sequential lenses — where each lens has a defined attacker model and where the third lens reads the prior passes and surfaces contradictions between them — collapses the obvious surface area of single-lens failure. Reviewing a specification in fourteen numbered phases — three of which dispatch four parallel model tracks per phase, with synthesis blocked until every track returns — collapses the surface area further, at the upstream side of the work where the cost of a missed gap is lowest. And gating both workflows on a single verbatim rule — every factual claim must be cited, no exceptions 6 — means the artifacts produced are not opinions about the system; they are decision traces that survive the LLM's statelessness, the reviewer's rotation, and the next regulator's inquiry.
The promise of the paper is concrete. By section 9 the reader will have walked every one of the five passes by its name in the skill source, every one of the fourteen phases by its identifier in the phase manifest, and the exact text of the citation rule the firm enforces at the boundary. By section 10 the reader will see each of the five joints from Series 1 mapped to the specific pass or phase that would have caught it. By the closing sentence the reader will have a one-line standard that fits on a sticky note above the merge button.
A self-reflexive note. This paper is itself an exercise in the standard it argues for. Every numbered footnote points to a real artifact — a file path, a published paper, a verbatim quote. Where the prompting framing for this paper named a thing the source does not contain, the paper uses the source's identifier and notes the discrepancy in a footnote. 7 A document arguing for citation discipline without citations would be a guess wearing the costume of an argument. The whole point is to refuse that costume.
The single-pass failure mode, measured
The dynamic under all of this is empirical. Single-pass LLM review is not bad because reviewers are lazy or because the technology is immature; it is bad because the underlying behaviour of a reward-aligned language model under a friendly read is to validate the surface of the document in front of it. The literature has measured the magnitude.
Sharma et al., presented at ICLR 2024, showed that human-feedback-aligned language models learn sycophancy as a structural property of reward optimisation — across five frontier-class systems, the dominant pattern under friendly prompting is preference for agreement over truth when the two conflict. 8 A subsequent 2026 University of Florida / ACM CHI study on opinion adaptation hardened the finding: LLMs frequently start with one position and shift toward the user's position when the user pushes back, even when the user is wrong. 9 A 2026 Northeastern University analysis added the inverter: sycophancy is role-dependent, and the more peer-like the framing the user establishes, the more strongly the model agrees with the user. 10 The peer-reviewer framing — the framing the productivity narrative recommends most often — is the highest-sycophancy configuration the literature has measured.
The ensemble literature points the other way. Khan et al. (2024) — Debating Reasoning Models for Honest and Persuasive Output — demonstrated that two language models, instructed to debate opposite sides of a claim under a third model judge, surface disagreements that any single charitable read misses. 11 Diao et al. (2024) — Active Prompting with Chain-of-Thought for Large Language Models — established that ensemble disagreement is a more reliable signal of unresolved difficulty than any single model's confidence score. 12 The intuition is structural. Each model family has a different training distribution; each has a different sycophancy profile; the failure mode of one is not necessarily the failure mode of another. A finding that survives three lenses across four model families is a finding; a finding that surfaces in only one of them is exactly the case the workflow is built to catch and resolve.
This is the empirical anchor for the engineering that follows. The five-pass skill is not aesthetic. The cross-model dispatcher is not redundancy theater. Both exist because the single-pass review by a single model has a measurable failure rate against exactly the structural absences a production system needs caught before the merge button — and the only known mitigation, short of having a second senior engineer with infinite attention, is to operationalise multiple lenses across multiple models with citations as the acceptance criterion.
/5pass — the architecture
The /5pass skill lives at ~/.claude/skills/5pass/SKILL.md. 13 Its self-description, verbatim from the YAML frontmatter, is to “run a rigorous five-pass adversarial critique against any target document — plan, RFC, ADR, design doc, or similar.” 14 The skill is invoked, by the user, against a specific target file. It dispatches five sequential passes through the thoughtbox:loop-orchestrator-agent and produces six durable artifact files on disk: one per pass, plus a v2 of the target document with the synthesis pass's replacement prose applied. 15
The five lenses are fixed. The skill source enumerates them in a single table, with the focus statement attached to each. 16 What follows reproduces the table's content and then adds, for each lens, the failure mode it is constructed to surface.
Pass 1 — Red-Team
The first pass operates under the assumption that the target is wrong somewhere. The lens, verbatim from the skill: “Assume the target is wrong somewhere. Find blunders, attack vectors, plan-killers. Run live-state Bash checks where the target is a plan against a real repo.” 17
The pass is dispatched as RedTeamCritic. Its workflow brief instructs the agent to read the target document in full, run live-state Bash checks against any real repository or filesystem the target references, and write findings to {OUTPUT_DIR}/CRITIQUE-pass1-redteam.md before any orchestration message is sent. 18 Each finding cites a file and a line attacked, with a labelled severity of CRITICAL, HIGH, MEDIUM, or LOW. 19 A red-team pass that returns “looks good to me” is not a red-team pass; it is a failed pass.
The failure mode this pass surfaces is the smoking-gun pattern: a claim in the target document that is contradicted by the live repository the document is about. Series 1's Joint 5 — the _run_task file-poll retry loop where the SDK's can_use_tool primitive should have been — is the canonical example of what Pass 1 catches. 20 A Bash check against the agent module in the KellerAI internal codebase ( grep can_use_tool <agent-module>.py) would have surfaced the primitive's existence in the same file as the file-poll contract, three function definitions earlier; the asymmetry would have flagged for any honest red-team agent inside the first two minutes.
Pass 2 — Errors of Omission
The second pass hunts absence. The lens, verbatim: “Hunt absence, not presence. What is NOT in the target that should be?” 21
The pass is dispatched as OmissionCritic in parallel with Pass 1 and Pass 4 (Wave 1). 22 Its brief instructs the agent to read the target in full and write findings keyed by what is missing — preconditions assumed but never verified, failure modes unaddressed, rollback paths absent. 23 The pass does not see the Wave 1 siblings' findings because, by design, the synthesis pass (Pass 5) handles deduplication of topical overlap. 24 Omission is its own lens precisely because the presence-oriented review reads only what is there; the absence-oriented review forces the question “what should be there that isn't?”
The failure mode this pass surfaces is the missing primitive — the SDK call that wasn't made, the audit table that wasn't created, the reaper that wasn't wired. Series 1's Joint 1 (the missing pod-crash recovery reaper) and Joint 2 (the missing thread_status_history table) are exactly the shape of finding Pass 2 is built to produce. 25 The PR record in Series 1 includes a verbatim author acknowledgment of the deferral (“a separate problem we've deferred”), 26 which is itself evidence that the omission was recognised but uncaught at the review boundary. Pass 2 is the boundary that catches it.
Pass 3 —
Sloppy Thinking and Logical Flaws
The third pass attacks the reasoning structure itself. The lens, verbatim: “Internal contradictions, fabricated convergence claims, undefended assumptions, unjustified leaps, recommendation-label hygiene.” 27
Pass 3 is the only pass with cross-finding visibility. Its brief specifies that it runs in Wave 2 — after Pass 1, Pass 2, and Pass 4 have all written their files — and that the agent's job is to read those three prior outputs and the target document and surface contradictions between them. 28 The pass attacks the reasoning, not the conclusions: fabricated convergence (“all reviewers agreed” when they did not), undefended assumptions, recommendation-label hygiene (a thing labelled RECOMMENDED when it is actually REQUIRED). 29 A finding under this lens reads like “the target claims X on the basis of Y, but Pass 4 verified that Y is false in the live repo.”
The failure mode this pass surfaces is the inflated convergence claim — the most dangerous category of single-pass output, because it gives the surface of agreement while the substance disagrees. The Trust Dial incident catalogued in Earned Autonomy — where a model returned exactly the target score and conceded one point of a five-point inflation under self-audit — is what an uncaught Pass 3 failure looks like in production. 30
Pass 4 —
Misconceptions About Ground Truth
The fourth pass re-verifies every factual claim against live state. The lens, verbatim: “Re-verify every factual claim against live state via Bash. Find every place the target describes reality wrongly.” 31
The pass is dispatched as GroundTruthCritic in parallel with Pass 1 and Pass 2 (Wave 1). 32 Its brief is the most mechanical of the five. For every factual claim in the target document, the agent runs the minimal Bash command that confirms or refutes it against the live system, then records four columns: claim, command, output, verdict. 33 A claim not testable via Bash is flagged “UNVERIFIABLE — cannot confirm live,” not silently retained. 34
The failure mode this pass surfaces is the stale fact — a claim the document inherits from an earlier draft, an earlier deployment, or a fabricated detail in a single-model first draft. Series 1's Joint 3 (the hardcoded cost_cents=0 in finalize_drift_run, with the comment “Kai doesn't yet expose token usage” that was true months ago and is no longer) 35 is exactly the kind of fact a Pass 4 Bash check resolves in one command: read the SDK changelog, observe that the usage telemetry surface has stabilised, mark the comment refuted, propose the corrected line.
Pass 5 — Synthesis and Revised Target
The fifth pass writes the deliverable. The lens, verbatim: “Produce: (a) what survives, (b) what must change with replacement prose, (c) what is unresolved, (d) the revised v2 document as a standalone file, (e) confidence verdict: SAFE / NEEDS-MORE-WORK / FUNDAMENTALLY-BROKEN.” 36
Pass 5 is dispatched as SynthesisCritic in Wave 3, after all four prior passes have written their files. 37 The agent reads the target and the four pass files, then writes two artifacts: a synthesis file in four sections (what survives, what must change, what is unresolved, a one-token confidence verdict) and a standalone v2 of the target document with all Section-2 replacements applied. 38 The synthesis rule, quoted from the brief, is unambiguous: “Every Section 2 item must include the concrete replacement prose, not just ‘fix this’. The v2 file is the deliverable; the synthesis is the audit trail.” 39
The output format matters. A 5pass run that produces a synthesis without a v2 file has produced critique without a deliverable; a v2 file without the synthesis file has produced a deliverable without the audit trail; either alone is a failed pass, and the workflow specifies six mandatory artifacts before the run is declared complete. 40 The confidence verdict is similarly bounded. There are exactly three legal values, and the difference between them is a workflow gate: SAFE permits execution against the v2 document, NEEDS-MORE-WORK forbids execution until specific gaps are resolved, FUNDAMENTALLY-BROKEN requires a full rewrite. 41 A reviewer cannot fudge the verdict; the brief enumerates the three values and the workflow contract requires one of them.
The non-overlap design choice
The five lenses are deliberately orthogonal. Pass 1 attacks the presence of bad content; Pass 2 attacks the absence of necessary content; Pass 3 attacks the reasoning connecting content; Pass 4 attacks the factual grounding of content; Pass 5 integrates the prior four into a single deliverable with a workflow-binding verdict. The skill's pass-brief contract instructs each pass agent to focus on its lens — “do not pre-empt other passes; P5 handles deduplication.” 42 The discipline is not “find every problem at once.” The discipline is “find every problem your lens can find, then trust the synthesis pass to handle the boundaries.”
Skipping any one of the five passes is not a 20% reduction in review depth; it is a category-shaped hole in coverage. A run without Pass 2 catches everything in the document but nothing missing from it; a run without Pass 4 catches structural critique but not factual error; a run without Pass 3 catches each finding in isolation but no contradiction across findings. The five-pass count is not a number; it is a category set.
From /5pass to /spec-init — the leverage step
The five passes are powerful but reactive. They take a target document as input and produce a v2. The leverage step is moving the same discipline upstream — applying multiple lenses across multiple models to a specification before any code is written, so the bill catalogued in Series 1 is paid in spec-edits rather than in production incident response.
That leverage is the kellerai-feature-spec plugin's /spec-init skill. 43 The skill initialises a new spec polish workflow from a draft specification file: it derives a spec slug, creates an artifacts directory, copies the canonical phase manifest to artifacts/{spec_slug}/phase-manifest.yaml , and emits an initial snapshot. 44 The default phase manifest defines fourteen numbered phases, grouped into seven stages, with cross-model as the default review mode. 45 Each phase has an identifier (a dotted number like 1.1 or 2.2), a human-readable name, a phase type, an assigned agent, a model identifier, and a dependency on prior phases that must complete before it starts. 46
The fourteen phase identifiers, in source order, are: 1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 4.1, 4.2, 5.1, 5.2, 6.1, 6.2, 7.1. 47 The names attached to those identifiers, verbatim from the manifest, are: Codebase Reconnaissance, Deepen with Ambition, Inversion Analysis, OODA Loop, Red Team, Steelman, Error Scrub, Background Sections, TDD Anchoring, Second Error Scrub, De-slopify, Validate Beads Format, Import and Wire Dependencies, OODA Closure Loop. 48 Three of the fourteen — Phase 3.1 Error Scrub and Phase 5.1 Second Error Scrub and Phase 2.1/2.2/ 2.3 (each as a cross-model phase) — run multiple iterations or tracks under the same phase identifier; the headline count is fourteen phases, but the runtime expansion across iterations and tracks is materially larger. 49
phase-manifest.yaml — stage groupingsyaml
# Phase 1: Investigation phases: [1.1, 1.2, 1.3] # Phase 2: Adversarial Review phases: [2.1, 2.2, 2.3] parallel: true barrier: all_complete # Phase 3: Error Scrubbing (5 passes) phases: [3.1] iterations: 5 # Phase 4: Self-Containment phases: [4.1, 4.2] # Phase 5: Second Error Scrub + De-slopify phases: [5.1, 5.2] # 5.1 iterations: 3 # Phase 6: Beads Import phases: [6.1, 6.2] # Phase 7: Final Validation phases: [7.1]
The seven stages, by phase group, are: Investigation ( 1.1, 1.2, 1.3); Adversarial Review ( 2.1, 2.2, 2.3); Error Scrubbing ( 3.1, with five iterations); Self-Containment (4.1, 4.2); Second Error Scrub plus De-slopify ( 5.1, 5.2); Beads Import (6.1, 6.2); Final Validation (7.1). 50 The architectural shape of the workflow — investigate, then adversarially review, then scrub, then anchor to tests and background, then scrub again, then import to the issue tracker, then close — is the same shape an aviation-grade safety review applies to a flight plan before the engine starts. 51 The point is not the names of the stages. The point is that fourteen named gates with explicit dependency edges produce an audit trail by construction.
The plugin's spec-orchestrator-agent reads the manifest at every invocation, checks progress.jsonl for the current state, and determines the next ready phase by verifying every depends_on entry is satisfied. 52 Single-track phases satisfy with status: complete; cross-model phases satisfy with status: synthesized; iterative phases satisfy with status: complete at the required pass number. 53 The orchestrator does not advance to Phase 3.1 until Phase 2.3 is synthesized; it does not advance to Phase 7.1 until Phase 6.2 has completed. A phase that fails three consecutive times is refused for re-launch by the orchestrator's error handler. 54
The leverage is dimensional. Running five passes against a finished design document catches the failure modes that are present in the document. Running fourteen phases against a specification before the work has been done catches the failure modes the document would otherwise have inherited. The Series 1 bill was paid against the absence of exactly this upstream review.
The four model tracks
Three of the fourteen phases — Phase 2.1 OODA Loop, Phase 2.2 Red Team, and Phase 2.3 Steelman — are typed as cross-model in the manifest. 55 A cross-model phase does not run as a single agent against the spec. It dispatches four parallel tracks, each operating under the same lens but driven by a different model family, and blocks synthesis until every track has completed. 56
The four model tracks, as declared in the manifest, are claude, codex, grok, and gemini. 57 Each track has, in the manifest, a separate output path ( artifacts/{spec_slug}/phase-{id}-{label}-{track} .md ), a launch policy (immediate for the claude track, auto for the three external tracks), and either an in-process agent assignment (the claude track maps to thoughtbox:strategic-reasoner-agent) or a prompt-template path (the external tracks map to a rendered prompt file under skills/spec-external-prompts/assets/). 58 The synthesis block at the bottom of each cross-model phase declares the synthesis prompt template, the four required inputs, and the blocked_until: all_tracks_complete gate. 59
The external tracks are driven by dedicated reviewer agents. The Grok track is dispatched by grok-spec-reviewer-agent, which resolves the local grok CLI at runtime, validates the API key with a canary command, reads a rendered prompt file at artifacts/{spec_slug}/external/phase-{phase_id} -grok-prompt.txt , invokes the CLI in headless mode ( grok --no-sandbox -m grok-4-1-fast --prompt "$PROMPT" ), and writes output to the phase's track output path via tee. 60 The Codex and Gemini tracks are dispatched by analogously structured agents: codex-spec-reviewer-agent and gemini-spec-reviewer-agent. 61 All three external tracks are launched in parallel by cross-model-dispatcher-agent, in a single message with three Agent tool calls, “to maximize parallelism.” 62
The synthesis pass is mechanical and bounded. /spec-synthesize reads all four track outputs and runs a synthesis prompt that must:
- 01Merge duplicate findings across models (keep highest severity).
- 02Preserve unique findings from any single model.
- 03Discard false positives that contradict codebase reality.
- 04For steelman findings: identify arguments that change recommendations.
- 05Apply accepted changes directly to the spec file.
Graceful degradation is bounded by the same skill: synthesis can proceed if the Claude track and at least one external track completed, and must stop if the Claude track and all three external tracks are missing. 64
The reason this matters is the same reason the ensemble literature in section 2 matters. Each of the four model families has a different training distribution. Each has a different sycophancy profile. Each fails differently. A spec that survives Phase 2.1 OODA Loop, Phase 2.2 Red Team, and Phase 2.3 Steelman across four model families with no surfaced contradiction is a spec where the most obvious failure modes of any one model have been canceled out by the others. A spec where one track flags an issue the others miss is exactly the case /spec-synthesize is built to escalate — the synthesis rule explicitly preserves unique findings from any single model, because a single model's unique flag is the highest-information signal the ensemble produces. 65
The leverage of four tracks is not redundancy. It is variance reduction at the lens boundary, exactly as the multi-model judging literature predicts. 66
The three lenses — OODA, Red Team, Steelman
The Phase 2 adversarial review is structured around three named lenses. The lens names appear in the manifest as the name field of each Phase 2 entry and as the suffix of each external prompt template path. 67 The lenses, in manifest order: OODA Loop, Red Team, Steelman.
Phase 2.1 — OODA Loop
The OODA Loop lens is borrowed from John Boyd's 1976 monograph Destruction and Creation, in which the four-step Observe–Orient–Decide–Act cycle was first articulated as a model for adversarial competition under uncertainty. 68 The manifest's Phase 2.1 specifies thinking_directive: "Think at maximum depth. Use interleaved thinking between each mental model. After evaluating each branch, reflect on what the evidence reveals." 69 The Claude track runs in-process via thoughtbox:strategic-reasoner-agent at opus/max effort; 70 the three external tracks are rendered via prompt templates at skills/spec-external-prompts/assets/2.1-ooda- {codex,grok,gemini}.md and dispatched by their reviewer agents. 71
OODA's contribution to the review surface is reflective. It asks the spec to articulate its own observation–orientation–decision–action sequence: what does the system observe, how does it orient against that observation, what decision rules fire, what action is taken. A spec that cannot pass an OODA review is a spec whose decision points are implicit — the reviewer cannot trace, from input to action, how a particular outcome arises. Series 1's Joint 4 (the cache key missing model_version) is an OODA failure under a generous reading: the system observes a request, orients against (workspace_id, codebase_id, pack_id, pack_version, commit_sha) , decides on a cache hit, and acts by serving the cached verdict — but the orient step omits the model identifier that determines whether the cached verdict is still valid. 72 An OODA pass that asks “what do we observe about the cached verdict's provenance” surfaces the gap before the cache row is written.
Phase 2.2 — Red Team
The Red Team lens is borrowed from the security and military-planning red-teaming literature, where an independent team is assigned to attack a plan with the explicit objective of finding the failure that the planning team has missed. 73 The manifest's Phase 2.2 specifies thinking_directive: "Think at maximum depth. For each attack vector, exhaust all failure modes. Do not satisfice." 74 The phase declares the /scan-security skill in its skills list, anchoring the review to the security surface of the spec. 75
The output discipline matches /5pass's Pass 1. Each finding is an attack: an actor with a defined capability, an entry point the spec exposes, a result the attack achieves, a blast radius for the resulting compromise. A Red Team pass that returns “no significant attack vectors found” is, like its 5pass counterpart, a failed pass — the lens contract is to find attacks, not to declare safety.
Across the four model tracks of Phase 2.2, the attacks vary by training distribution. One track may emphasise injection attacks against the LLM surface; another may emphasise schema-level constraints that allow privilege escalation; another may emphasise the deployment topology. The synthesis rule preserves unique findings, so the spec is shaped by the union of the four attack lenses, not by the lowest common denominator.
Phase 2.3 — Steelman
The Steelman lens is borrowed from the rationalist literature on argumentation, where “steelmanning” denotes the practice of constructing the strongest possible version of a position the steelmanner intends to argue against, rather than the weakest. 76 The manifest's Phase 2.3 specifies thinking_directive: "Apply the steelmanning mental model. Construct the STRONGEST possible argument for each alternative." 77 The effort is high rather than max, and the synthesis input list includes the steelman track outputs as a peer category, weighted in the synthesis rule by the criterion “for steelman findings: identify arguments that change recommendations.” 78
The contribution of the steelman lens to the review surface is the protection of the spec's alternatives. A spec that has selected approach A over approach B is a spec where approach B was, by definition, considered and rejected. A steelman of approach B — constructing the strongest possible version of B — is the test of whether the rejection was earned. If the strongest B is dominated by the selected A, the decision is robust. If the strongest B is not dominated, the spec's selection of A is unearned and the recommendation must change.
The three Phase 2 lenses are not interchangeable. OODA tests the spec's internal reasoning sequence. Red Team tests the spec's resistance to adversarial attack. Steelman tests the spec's rejected alternatives. A spec that passes all three across four model tracks is a spec where the obvious internal contradictions, the obvious attack surfaces, and the obvious unearned rejections have all been surfaced and either resolved or escalated to the synthesis verdict.
The other phase-level lenses
Phase 1.3 — Inversion Analysis — applies a fourth lens that the Phase 2 group does not. 79 Inversion asks what the design looks like if every deferral is permanent — a habit borrowed from Charlie Munger's lattice of mental models, where inverting the problem (“how do we lose?”) is a structurally distinct test from approaching it directly. 80 Phase 7.1 — OODA Closure Loop — applies the OODA lens a second time at the workflow's terminus, after the spec has been scrubbed, anchored, and imported to the issue tracker, as a last quality gate before implementation begins. 81
The lenses are not exhaustive. They are the operationalised subset the plugin's authors have committed to as the default manifest, with the explicit affordance that custom manifests can extend or replace them. 82 What the lenses are not is arbitrary. Each has a published intellectual lineage; each has a workflow gate that prevents the spec from advancing without producing the lens's artifact.
Citations as the boundary contract
The acceptance criterion that gates every artifact produced by /5pass and /spec-init is a single rule. The rule lives at ~/.claude/rules/core/citations.md and is short enough to quote in full. 83 What follows is the verbatim text of that file, every word.
EVERY factual claim in a generated document MUST be cited. No exceptions.
What Requires a Citation
• Any security finding or vulnerability
• Any HIPAA/compliance assessment
• Any DORA metric or score
• Any architectural claim (“uses STI”, “no audit log”, etc.)
• Any reference to a file, function, or configuration value
• Any metric, count, or measurement (commit count, file size, line count)
Citation Format
Format reference: ~/.claude/configs/references/citation-formats.md
Enforcement
• Never write “no audit log exists” — cite the absence: checked Gemfile (no paper_trail, audited , activerecord-auditable gems present)
• Never write “plaintext storage” — cite the schema migration or model file with line number
• Never write “disabled in config” — quote the exact commented-out line with a link
The rule is short, but every clause does work. Walk it.
Every factual claim must be cited. The rule is universal. There is no carve-out for claims the author believes to be common knowledge, no carve-out for claims the author has carried over from a prior draft, no carve-out for claims a senior reviewer has approved. The claim is cited or the claim is rejected at the boundary.
Any security finding or vulnerability. A security finding without a citation is a guess wearing the costume of a vulnerability report. The rule does not distinguish a CVSS-9.8 finding from a low-severity audit suggestion; both require the same anchoring.
Any HIPAA/compliance assessment. The compliance surface is the highest-cost surface for a non-cited claim. A claim that a system “is HIPAA-compliant” or “fails HIPAA obligation X-3.2” carries regulatory weight; the citation requirement is what converts the claim from opinion to a defensible position.
Any DORA metric or score. The DevOps Research and Assessment metrics — deployment frequency, lead time, mean time to restore, change failure rate — are quantitative. 84 A quantitative claim without a measurement source is a number invented for plausibility.
Any architectural claim. The example clauses in the rule — “uses STI”, “no audit log” — are exactly the failure mode of the single-pass review. “No audit log” is a negative claim about the system, and negative claims are the easiest to make and the hardest to verify; the rule's enforcement section converts this from a freebie into a verbatim anchor: cite the absence by listing the gems the Gemfile does not contain.
Any reference to a file, function, or configuration value. This is the granular clause that makes the rule operationally tractable. Every file path mentioned in a finding must be a real file path. Every function name must be a real function name. Every configuration value must point to the configuration file and line where it lives.
Any metric, count, or measurement. The catch-all. “The codebase is 16.9k lines of Python” is a measurement. “The branch has 394 commits” is a count. Both require a source — a tokei output, a git rev-list --count invocation — and the source is the citation.
The enforcement section is what converts the rule from aspiration to mechanism. “Never write ‘no audit log exists’ — cite the absence” forces the author into a specific kind of artifact: a Bash check ( grep -r 'paper_trail' Gemfile) and the output (empty), which is itself a decision trace. “Never write ‘plaintext storage’ — cite the schema migration or model file with line number” forces the same: open the file, find the line, point at it. “Never write ‘disabled in config’ — quote the exact commented-out line with a link” forces the same again.
A finding that survives this rule is a finding that another engineer, six months from now, can replay. A finding that does not survive this rule is, by the rule's definition, not a finding.
The rule applies to every artifact the workflow produces. A /5pass Pass 1 finding without a file:line anchor for the attack it claims to mount is a failed finding. A /5pass Pass 4 finding without a Bash command and output is a failed finding. A /spec-init Phase 2.1 OODA artifact without a citation for every claim it makes about the spec is a failed phase. A /spec-init Phase 6.1 beads-validation artifact whose claims about the spec's beads-format conformance are not anchored to specific spec line numbers is a failed validation.
The verbatim text above is the entire rule. It is not a heuristic; it is the boundary contract. The work of the firm is to refuse anything that does not satisfy it.
Decision traces — the second boundary
Citations close the loop for facts. The complementary mechanism — decision traces — closes the loop for choices. A factual claim is anchored to its source; a decision is anchored to the alternatives considered, the evidence weighed, and the reason for the selection. Both are the same failure under different names: a claim with no audit path back to the artifact that grounds it.
The decision-trace pattern is established in the prior whitepaper The Koth Oracle, In Depth, which catalogued the KellerAI internal practice of recording, for every non-trivial design decision, a four-field artifact: the options considered, the evidence weighed for each option, the selected option, and the rationale. 85 The pattern is mechanical. Where a git commit records the change, a decision trace records the reasoning that produced the change. Where a PR description records the what, the decision trace records the why.
The /spec-init workflow operationalises decision traces structurally. Phase 1.3 — Inversion Analysis — is typed as reasoning and its output is written with strategy: agent_write, which means the agent that performs the inversion writes its own output file — a durable artifact at artifacts/{spec_slug}/phase-1.3-inversion.md — and that artifact contains the alternative formulations of the spec the inversion produced. 86 The artifact survives the LLM session that produced it; it survives the engineer's rotation off the project; it survives the agent's eventual replacement by a successor. A future engineer reading the artifact can reconstruct, from durable evidence, what alternatives the spec considered and why each was rejected.
The same pattern repeats across the manifest. Phase 2.3 Steelman writes a track output per model family; 87 the steelman synthesis preserves the strongest version of each rejected alternative. Phase 5.1 Second Error Scrub runs three iterations, each writing a numbered file artifacts/{spec_slug}/phase-5-errorscrub-{n}.md , so the convergence (or divergence) of the scrub across iterations is itself a decision trace about the spec's stability. 88 Phase 7.1 OODA Closure Loop writes the final quality gate's reasoning to artifacts/{spec_slug}/phase-7.1-ooda-closure.md — the last decision trace before implementation begins. 89
The decision-trace discipline is also encoded in the orchestrator's spec-versioning rule. For every phase with modifies_spec: true, the orchestrator computes a SHA-256 hash of the spec at phase start, stores it in the progress entry, creates a snapshot at snapshots/post-phase-{id}-pass-{n}.md after the phase completes, and emits a git commit with the message chore({spec_slug}): Phase {id} {name} . 90 The progression of snapshots and commits is itself a decision trace: a future reader can walk the spec backward through its phases, observe what changed at each gate, and reconstruct the reasoning that produced the final document.
Decision traces convert tacit reasoning into durable artifact. The LLM is stateless across sessions; the engineer rotates off the project; the regulator's question arrives months after the decision. The only thing that survives all three is the file on disk. The only thing that makes the file on disk useful is the discipline that produced it: every claim cited, every decision traced.
Cost, friction, and the honest accounting
Running fourteen phases across four model tracks is not free. The honest accounting matters because the workflow is the answer to a critique that itself made a cost argument — Series 1's claim was that the absence of upstream review compounds into the production bill. 91 The reply has to take its own cost seriously.
The dominant cost is token spend. Each cross-model phase dispatches four parallel tracks; each track invokes a frontier-class model under a prompt that includes the full spec; each spec iteration through the workflow runs three cross-model phases ( 2.1, 2.2, 2.3); each iterative phase ( 3.1, 5.1) runs multiple passes (five for 3.1, three for 5.1). 92 A single full pass through the manifest dispatches, by the manifest's structure, on the order of a dozen frontier-model invocations across multiple model families. A naïve cost model — assume parity across the four tracks, assume the average phase uses 10% of a model's context window — puts a typical spec polish in the tens of dollars of token spend per iteration.
The wall-clock cost is real but parallelizable. The Phase 2 group is declared parallel: true with a barrier at all_complete, 93 which means the three Phase 2 phases run simultaneously and the workflow advances only after the last one synthesises. Within each cross-model phase, the four tracks run in parallel; within each subagents_first phase (Phase 1.1, Phase 1.2, Phase 3.1), the sub-recon agents launch in parallel and the primary agent waits for the union of their outputs. 94 A spec polish that takes a senior engineer a week of solo review wall-clock can complete in hours of orchestrator wall-clock — at the cost of the token spend above.
The engineering attention cost is the smallest but the most underestimated. The orchestrator does the work, but the orchestrator is not an oracle; it is a coordinator that dispatches specialist agents and synthesises their outputs. The engineer responsible for the spec still reviews the synthesis artifacts, still resolves the contradictions surfaced by Pass 3 (in /5pass) or by the synthesis pass (in /spec-init), and still decides whether the verdict permits advancement. The workflow does not replace the engineer; it produces, for the engineer, an audit-trail-bearing artifact that the engineer can review in a fraction of the time a from-scratch review would take.
Set against this cost: the Series 1 bill. A single one of the five joints — the cache key missing model_version — produces, in production, a stream of stale verdicts served as authoritative on the next request, indistinguishable from fresh verdicts, with regulatory exposure on the compliance surface. 95 The cost of catching it in a Phase 2.2 Red Team pass across four model tracks is the cost of one Red Team phase. The cost of not catching it is paid by the customer the next time the customer trusts an outdated finding.
The honest accounting reads two ways. If the spec is small, the workflow is overkill — six pass-files for a one-page RFC is theatre. If the spec is load-bearing, the workflow is the cheapest insurance the firm can buy against the bill in Series 1. The standard is to apply the workflow where the cost of an undetected gap exceeds the cost of catching it; the rule is to apply the citation requirement everywhere, because citations cost nothing to produce while reasoning is fresh and cost almost nothing more than that to maintain.
Mapping back to the bill
The promise made in the introduction was that each of the five joints catalogued by Series 1 would be mapped to the specific pass or phase that would have caught it. That mapping follows, in the order Series 1 listed the joints.
Joint 1 — Durability under SIGKILL
5pass: Pass 2 · spec-init: Phase 1.3
Series 1's first joint was the in-memory doc_contents rollback buffer, which restores documents only on asyncio.CancelledError and is freed by the kernel on a pod SIGKILL, leaving the project in THINKING forever with no orphan reaper to recover it. 96
The pass that catches this is /5pass Pass 2 — Errors of Omission. The lens question — “what failure modes are unaddressed? what rollback paths are missing?” 97 — points directly at the absence: the revision orchestrator has no reap_orphans equivalent, although the sibling drift orchestrator does, in the next directory over. 98 A Pass 2 agent reading the revision orchestrator alongside the drift orchestrator would have flagged the asymmetry in its first pass over the file system.
The phase that catches this upstream is /spec-init Phase 1.3 — Inversion Analysis. The lens question — what does the design look like if every deferral is permanent? — converts the PR author's verbatim acknowledgement (“Doesn't cover pod-crash recovery — that's a separate problem we've deferred”) from a deferral into a design constraint. 99 A Phase 1.3 artifact that holds the deferral as permanent surfaces the savepoint-and-reaper pattern as the required version, not as a future improvement.
Joint 2 — Audit-trail completeness
5pass: Pass 2 + Pass 4 · spec-init: Phase 6.1
Series 1's second joint was the missing thread_status_history table — the current status is overwritten in place every transition, with Kai identified through the magic author_user_id IS NULL check rather than through an actor_kind enum. 100
The pass that catches this is /5pass Pass 2 — Errors of Omission, with reinforcement from Pass 4. The lens question for Pass 2 — “what is NOT in the target that should be?” — points at the missing audit table; the lens question for Pass 4 — re-verify every factual claim against live state — points at the magic-NULL convention by running a Bash check against the database layer of the KellerAI internal codebase (searching for the actor-identity pattern) that surfaces the convention's prevalence and the absence of a typed alternative. 101
The phase that catches this upstream is /spec-init Phase 6.1 — Validate Beads Format. The phase's typing is validation, and its job is to assert that the spec's structural requirements are met before the spec is imported as an issue. 102 A spec that claims to introduce a state machine but does not introduce a history table for that state machine fails the validation; the artifact records the failure with a verbatim citation to the line where the state machine was declared without its companion history table.
Joint 3 — Cost metering
5pass: Pass 4 · spec-init: Phase 7.1
Series 1's third joint was the hardcoded cost_cents=0, tokens=0 in finalize_drift_run, with the in-code comment “Kai doesn't yet expose token usage” that was true months ago and is no longer. 103
The pass that catches this is /5pass Pass 4 — Misconceptions About Ground Truth. The lens question — for every factual claim in the target, run the minimal Bash command that confirms or refutes it — applied to the in-code comment surfaces the refutation in a single command: the claude-agent-sdk exposes usage on response objects; the comment is stale. 104 The Pass 4 finding records the claim (“SDK does not expose usage”), the command (a check against the SDK's current type annotations), the output (usage fields are present), and the verdict (refuted).
The phase that catches this upstream is /spec-init Phase 7.1 — OODA Closure Loop. The phase is the last quality gate before implementation, with thinking_directive: "Think at maximum depth. This is the last quality gate before implementation begins." 105 A spec that ships with cost_cents=0 is a spec whose decision rules do not include a cost-control branch; a Phase 7.1 artifact that traverses the spec's OODA loop surfaces the absent decision rule as an unresolved gap that must not pass the gate.
Joint 4 — Model provenance
5pass: Pass 4 · spec-init: Phase 2.2
Series 1's fourth joint was the cache key on drift_findings_cache that omits model_version — every cached verdict produced by a deprecated snapshot continues to be served as authoritative on the next request. 106
The pass that catches this is /5pass Pass 4 — Misconceptions About Ground Truth. The lens question — re-verify every factual claim against live state — applied to the cache row surfaces the absent column with a Bash check against the cache migration in the KellerAI internal codebase (inspecting the UniqueConstraint columns) that lists the columns in the unique constraint and exposes the omission. 107
The phase that catches this upstream is /spec-init Phase 2.2 — Red Team, across the four model tracks. The lens question — for each attack vector, exhaust all failure modes; do not satisfice — is the lens under which a model identifier omitted from the cache key is an attack: a deployment that updates the model parameter without invalidating the cache rows produces stale verdicts served as fresh, which is a category of compromise that at least one of the four model tracks will surface. 108 The synthesis preserves the unique finding from whichever track caught it, the spec is amended before the cache schema is finalised, and the bill is not paid in production.
Joint 5 — Transactional semantics vs. ad-hoc state machines
5pass: Pass 1 · spec-init: Phase 2.2 + Phase 4.2
Series 1's fifth joint was the headline: the _run_task retry loop using a file-poll contract for the structured output of classify_kai_mention and generate_revision, three function definitions later in the same file where can_use_tool is correctly wired for AskUserQuestion. 109
The pass that catches this is /5pass Pass 1 — Red-Team. The lens question — assume the target is wrong somewhere; run live-state Bash checks where the target is a plan against a real repo — applied to the spec's claim of structured-output enforcement surfaces the contradiction in a single check against the agent module in the KellerAI internal codebase: searching for the SDK primitive shows it wired for one call ( AskUserQuestion) and absent at the two structured-output call sites three definitions later. 110 The absence of analogous wiring at those sites is the asymmetry the red-team finding is built to flag. 111111
The phase that catches this upstream is /spec-init Phase 2.2 — Red Team, complemented by Phase 4.2 TDD Anchoring. The Phase 2.2 lens flags the file-poll contract as an attack surface; the Phase 4.2 lens — anchor every claim to a test — flags the absence of a test that exercises the retry path under a missing-file failure mode and asserts the SDK primitive is used instead. 112 A spec that does not anchor classify_kai_mention to a test of structured output via can_use_tool fails Phase 4.2; the failure is surfaced before the implementation begins, not after the retry loop has paid three rounds of fifty-turn subprocess cost in production.
The mapping is exhaustive. Every joint in the Series 1 bill maps to at least one specific pass in /5pass and at least one specific phase in /spec-init. The reader who arrived at this paper from the Series 1 critique sees, here, the bill that does not come — provided the workflow is applied at the gate where the joint is introduced, not after the joint has shipped.
What this paper is, plain
Every claim in this paper is anchored to a footnote. Every footnote points to a real artifact: a file path on disk, a published paper, a verbatim quote from the plugin source, or a section of a prior whitepaper in this series. Where the prompt that initiated this paper named a thing the source does not contain — model version strings like “Claude Opus 4.7” or a phase count of “fourteen rounds” rather than “fourteen phases” — the paper used the source's identifier and noted the discrepancy in a footnote. 113
This is the standard the paper is arguing for, applied to the paper itself. The five-pass skill and the fourteen-phase manifest were designed by their authors as durable decision traces — artifacts that survive the LLM session that produced them, the engineer who wrote them, and the eventual model rollover that retires the model that helped draft them. The same is true of this whitepaper. A future engineer reading this paper, three months from now, with the model that produced the first draft long-since deprecated, can replay every claim against the source and either ratify it or refute it. A claim without a footnote is not a claim in this paper. A claim with a footnote that does not anchor to a real artifact is, by the rule's own enforcement clause, a guess.
The paper does not name any individual. The convention of the series is to discuss patterns, not people; failures are diagnostic, not punitive. 114 The engineers whose work is referenced in Series 1 wrote competent code under real time pressure, with the most reasonable review structures available to them at the time. The point of /5pass and /spec-init is to give those engineers — and their successors — a better review structure than was available when the joints in Series 1 were written. The workflow is a gift to the future version of the team, written by the present version, expressly to make the bill smaller.
The discipline scales down as well as up. The full fourteen-phase workflow is overkill for a one-paragraph RFC. The full five-pass workflow is overkill for a thirty-line diff. The citation rule scales all the way down: every factual claim in every artifact, regardless of size, requires the same anchoring. The workflow is the heavy machinery; the citation rule is the standing order.
Closing — the rule, restated
The argument is finished. The five passes, by their source names — Red-Team, Errors of Omission, Sloppy Thinking and Logical Flaws, Misconceptions About Ground Truth, Synthesis — operate sequentially against a finished document and produce six durable artifacts plus a confidence verdict. The fourteen phases, by their source identifiers — 1.1 through 7.1 — operate upstream against a draft specification and produce, at each phase, a durable artifact with a dependency edge to the next phase. The four model tracks — claude, codex, grok, gemini — collapse the single-model failure surface by running the same lens through different training distributions and blocking synthesis until every track returns. The citation rule converts every factual claim into a verifiable artifact, every architectural assertion into a citation, every absence into a recorded check. The decision-trace discipline converts every choice into a durable record of the alternatives weighed, the evidence considered, and the rationale selected.
The whole standard fits on a sticky note above the merge button.
KellerAI only accepts Citations and Decision Traces. Anything else is just a guess.