Super RAG v2.1: Structured Retrieval for Compound Reasoning

KellerAI

Section 01

BLUF

The dominant retrieval-augmented generation pipeline — embed the query, retrieve by cosine similarity, hand the top-k chunks to the model — fails quietly and systematically on compound queries, contradictory sources, and long-horizon reasoning tasks. It fails not because the embeddings are wrong but because the architecture treats context selection as a side effect of similarity rather than as a deliberate computational stage with its own inputs, constraints, and quality signals.

This paper argues for a different model: structured retrieval, in which the retriever is an explicit reasoner that decomposes queries into subqueries, applies prefilters based on source authority, document type, and temporal lineage before consulting vector similarity, pins every retrieved fragment to an auditable citation, and assembles the context window under an explicit token budget. The evidence base is a production AI governance agent — the KellerAI Kai agent — whose prompts and supporting code demonstrate structural retrieval principles at each stage of a multi-step compliance analysis: from initial research through hypothesis formation to final generation. ¹

The companion brief, Super RAG v2, names the thesis. This paper demonstrates it.

Section 02

The Naive Pipeline's Failure Mode

The naive RAG pipeline is seductively simple: convert the user's query to an embedding, retrieve the top-k chunks whose embeddings are nearest in cosine distance, concatenate them into the prompt, generate. Each step is technically correct. The failure is architectural — the pipeline was designed for factual lookup, and it is being applied to compound reasoning.

Compound queries decompose naturally into multiple information needs that have different optimal source types. A query such as "does our current authentication design comply with the NIST 800-53 AC-3 control, given that the user session model changed in the last sprint?" contains at least four distinct retrieval targets: the canonical NIST control text, the current authentication design, the session model as it existed before the sprint, and the delta introduced by the sprint. A single embedding of the full query string produces a vector that is semantically between all four targets and precisely aligned with none of them. The top-k results will contain fragments from each domain, ranked by proximity to the aggregate rather than by relevance to the sub-question each fragment must answer.

Contradictory sources compound the problem. A compliance corpus inevitably contains both the original specification and its amendments, both the vendor's initial documentation and a later errata notice, both a design document and the code that diverged from it. Cosine similarity ranks all four sources by their proximity to the query vector; it has no mechanism for preferring the more recent amendment over the original, the errata over the initial documentation, or the code over the design claim. The retriever surfaces both versions and leaves the model to resolve the contradiction — which the model will attempt to do, and will sometimes get wrong in ways that are invisible without a citation trail.

Long-horizon tasks expose the third failure mode: context pollution. A multi-step analysis that must hold intermediate conclusions across several retrieval rounds cannot afford to fill the context window with high-similarity but low-relevance fragments at each step. The Kai agent addresses this directly by instructing the retriever to prefer _SUMMARY.md files over verbose governance-and-compliance source documents specifically to avoid context pollution. ² The preference is not encoded in an embedding; it is an explicit structural rule applied before similarity is consulted. That is the distinction this paper is built on.

Cosine similarity ranks fragments by proximity to the aggregate query vector. It cannot prefer the more recent amendment, the errata over the initial documentation, or the code over a design claim that the code disproved.

The core failure

Section 03

Query Decomposition as Retrieval Architecture

The first structural intervention is decomposition: replace the monolithic query with a sequence of subqueries, each scoped to a single information need, each assigned its own context budget and its own ranking signal. Decomposition is not a preprocessing step applied to a single retrieval call; it is the retrieval architecture.

The Kai agent's initial research prompt makes this architecture explicit. It distinguishes between an interaction_path — the explicit, named inputs the current task requires — and a workspace_path — the background workspace materials consulted for context. ³ These are not two similarity searches over the same corpus; they are two structurally distinct retrieval targets with different scoping rules, different token allocations, and different authority signals. The interaction path sources are high-authority, task-specific, retrieved first and held intact. The workspace path sources are contextual, lower priority, and subject to compression when the context window is constrained.

Decomposition also makes the retrieval plan inspectable. When the retrieval stage is a single similarity call, there is no artifact that records which information need each retrieved chunk was meant to satisfy, and therefore no basis for evaluating whether the retrieval succeeded at the sub-question level. When the retrieval stage is a sequence of named subqueries, each subquery is a checkable claim: "I needed the NIST AC-3 control text, and I retrieved this fragment." The audit trail for the final generation follows directly from the decomposition plan.

The classification prompt demonstrates a limiting case of decomposition: when the raw document corpus is absent, the decision trace itself becomes the context index. ⁴ Rather than retrieving from documents, the prompt retrieves from the structured record of prior decisions. This is decomposition applied recursively — the prior decision was itself the product of a retrieval and reasoning step, and its trace is a compressed, high-fidelity representation of the evidence that would otherwise require a full corpus retrieval to reconstruct.

Section 04

Re-Ranking: Structural Before Similarity

Vector similarity is a useful but insufficient ranking signal. The structural retrieval model applies prefilters — by source authority, document type, temporal lineage, and path — before cosine distance is consulted. These prefilters eliminate candidates that would rank highly on similarity but fail on structural grounds: the outdated version of a control, the vendor draft that was superseded, the design document that the implementation disproved.

The Kai agent's compliance drift prompt demonstrates prefiltering by retrieval mechanism. The prompt instructs the agent to use Grep and Glob keyword retrieval to locate candidate files, then apply a selective Read to retrieve content — not a similarity search, but a keyword-bounded structural query over a known path space. ⁵ The ordering is load-bearing: Grep and Glob operate over the file system with explicit path constraints; they cannot return a file outside the scoped directory tree. The prefilter is the path constraint itself.

Design drift detection adds a second structural prefilter: the prefer-prefilter-paths-over-full-tree-traversal rule. ⁶ Full-tree traversal is the structural equivalent of a full-corpus similarity search: it considers every document and ranks by the query signal alone. Prefilter paths narrow the candidate set to documents whose provenance, type, or location makes them structurally relevant before any content-based ranking is applied. The savings are not merely computational — a narrower candidate set means fewer contradictory sources in the final context window and a lower probability of the model synthesizing evidence from structurally incompatible documents.

Source authority ranking — a principle the OWASP Top 10 for LLM Applications articulates in the context of prompt injection defense, where the trust level of an input source must be established before its content is processed — maps directly onto the prefilter model. ⁷ A retrieved fragment from a canonical regulatory specification carries higher authority than a retrieved fragment from a vendor summary; a fragment from the current code carries higher authority than a fragment from an architectural design that predates the last two releases. The prefilter encodes these authority relationships structurally, before similarity ranking can override them.

A narrower candidate set means fewer contradictory sources in the final context window and a lower probability of the model synthesizing evidence from structurally incompatible documents.

Prefilter before similarity

Section 05

Citation Grounding and Trace Binding

Structural retrieval without an auditable citation chain is still a black box. The retriever may have decomposed the query, applied prefilters, and assembled context under a budget — but if the provenance of each fragment is not preserved through to the generated output, the reasoning is unverifiable and the generation is unauditable.

The Kai agent's design-code drift prompt specifies a citation schema that enforces provenance at the fragment level: the smallest contiguous span of text that supports a claim, expressed as file:lines[a,b] with a mandatory plus-or-minus five-line context window for hashing. ⁸ The schema is not a documentation convention; it is a retrieval constraint. The citation must be computable — the hash of the cited span at the cited line range must be reproducible from the source file — which means the citation is also a tamper-evidence mechanism. If the source changes and the hash no longer matches, the citation is broken and the claim it supports requires re-verification.

Trace binding extends citation grounding across retrieval stages. In a multi-stage pipeline — initial research, hypothesis formation, refinement, execution — each stage produces intermediate conclusions that are inputs to the next stage. If those intermediate conclusions carry their citation provenance, the final generation can be traced back through every stage to the original source fragments. The Kai agent implements this through a staged prompt assembly: the initial research stage produces a structured summary with citations; the master design stage consumes that summary as a named input; the final generation stage consumes the master design output. ⁹ The citation chain from the final answer to the initial source is recoverable because each stage's inputs and outputs are named and structured, not embedded in free-form prompt text.

The decision tracing store demonstrates the endpoint of this discipline: a schema-driven, filtered retrieval layer that stores decision records with their supporting evidence, not as unstructured logs but as typed, queryable artifacts. ¹⁰ The store is not a similarity index; it is a citation database. Querying it for prior decisions on a given policy is a structured lookup, not a semantic search over decision prose. The structural retrieval model and the decision trace model converge at this point: both treat evidence as a first-class artifact with schema, provenance, and verifiable lineage.

Section 06

Iterative Retrieval and Refinement

Single-pass retrieval is appropriate for factual lookup. Compound reasoning requires iterative retrieval: each stage of the reasoning process retrieves against a hypothesis that was not fully formed at query time, because the hypothesis was produced by earlier retrieval results. The retriever must be embedded in the reasoning loop, not placed before it.

The Kai agent implements a four-stage pipeline that makes iterative retrieval explicit: initial research retrieves background context and establishes the workspace; hypothesis formation consumes that context to identify the specific claims requiring verification; refinement retrieves targeted evidence for each claim; execution assembles the final generation from the refined evidence set. ⁹ Each stage re-ranks: the initial research stage ranks by relevance to the task specification; the refinement stage ranks by relevance to the specific hypothesis under investigation, which is a narrower and more precise signal than the original query.

Re-ranking at each stage is the structural analogue of iterative search in human research practice: the analyst does not identify all sources before reading any of them; they read, update their understanding of what they need, and retrieve again. The difference is that iterative retrieval in a structured pipeline produces an auditable record of what was retrieved at each stage and why, whereas human iterative search does not. The audit trail is a byproduct of the structure, not an additional mechanism bolted on afterward.

Iterative retrieval also manages the contradiction problem more effectively than single-pass retrieval. When a contradiction surfaces during the hypothesis stage — two retrieved fragments making incompatible claims about the same policy — the refinement stage can issue a targeted retrieval specifically for the resolution: the amendment date of the conflicting documents, the canonical source that adjudicates between them, the implementation record that establishes which version is operative. Single-pass retrieval surfaces the contradiction in the final context window and leaves the resolution to the model; iterative retrieval makes contradiction resolution a retrievable fact.

Section 07

The Context Window as Explicit Constraint

The context window is a hard budget. A structural retrieval pipeline treats it as an explicit constraint that governs assembly decisions: which fragments are included at full text, which are summarized, which are compressed to a citation reference, and which are excluded. The naive pipeline treats the context window as a consequence of retrieval — the top-k results are concatenated until the window is full — which means the assembly policy is implicit in the chunk size and the value of k rather than in any explicit reasoning about information value per token.

The Kai agent's logging truncation utility enforces a hard token ceiling on individual evidence fragments: a max_length=500 character limit applied at the logging layer, not at the retrieval layer. ¹¹ The truncation is structural evidence compression — the 500-character limit is not an arbitrary display constraint but a deliberate decision that a log fragment longer than 500 characters is not yielding proportional information value for its token cost. The utility enforces this decision mechanically, before the fragment enters any downstream context window.

Budget-aware assembly follows from the prefilter and decomposition stages. When the retrieval plan is a sequence of named subqueries with authority rankings, the assembly stage can make principled token allocation decisions: high-authority, task-specific sources from the interaction path receive full-text inclusion; background workspace sources receive summary inclusion; low-authority or contradicted fragments receive citation-only inclusion. The total token cost of the assembled context is predictable from the assembly policy, not emergent from the similarity distribution of the corpus.

The practical consequence is that a structural retrieval pipeline can degrade gracefully under context pressure. When the token budget is tight — because the prompt template is long, because the task requires a long generation, or because prior iterative stages have consumed budget — the assembly stage has a principled priority ordering for what to drop: citation-only fragments first, then workspace summaries, then interaction-path summaries, with interaction-path full text as the last thing to truncate. The naive pipeline has no such ordering; it truncates from the end of the concatenated chunks, which may cut the most relevant fragment because it happened to rank fifth rather than first.

Budget-aware assembly is a principled priority ordering for what to drop under context pressure. The naive pipeline truncates from the end of the concatenated chunks — which may cut the most relevant fragment because it ranked fifth.

The assembly policy

Section 08

Production Governance

A structural retrieval pipeline is measurable in ways a similarity-only pipeline is not. Because each stage has defined inputs, outputs, and ranking criteria, the pipeline can be instrumented to report retrieval quality signals at the stage level rather than only at the final-generation level.

Coverage measures whether the subqueries derived from the decomposition stage collectively addressed all information needs the original query implied. A decomposition that produces three subqueries when the query contains five distinct information needs will produce a final generation with two unsupported claims — not because the similarity search failed but because the decomposition was incomplete. Coverage is measurable by comparing the decomposition plan against the information need taxonomy of the domain; it does not require a human evaluator to read every retrieved fragment.

Precision at the stage level measures whether the fragments retrieved for a given subquery were relevant to that subquery rather than to the aggregate query. When retrieval is decomposed, precision is a per-subquery metric: the fragments retrieved for the NIST AC-3 subquery should be about AC-3, not about AC-2 or about the authentication implementation in general. The citation schema enforces a minimum precision standard mechanically — a citation that cannot be hashed against the source file at the specified line range is a retrieval failure, and that failure is detectable without a human reading the output. ⁸

Latency in an iterative retrieval pipeline is a multi-stage measurement. The total latency is the sum of the latencies of each retrieval stage plus the generation stage; the bottleneck may be in any of them. Structural prefilters reduce latency at the candidate-retrieval stage by narrowing the corpus before similarity is computed; the cost is an additional prefilter computation that is almost always cheaper than computing similarity over the full corpus. Industry practice in latency-sensitive retrieval pipelines uses prefilters to achieve sub-second retrieval over corpora where full-corpus similarity search would exceed latency budgets. ¹²

The retriever itself must be auditable under the ISO/IEC 42001 AI management system framework, which requires that AI system components whose outputs inform decisions be documented, their design rationale recorded, and their operational behavior monitored. ¹³ A retriever whose only specification is "top-k by cosine similarity" does not meet this standard; a retriever whose decomposition rules, prefilter criteria, authority rankings, and budget policies are documented as explicit design decisions does. The documentation burden is lower for the structural pipeline than for the similarity-only pipeline, because the structural pipeline's design decisions are already explicit — they are the pipeline.

Section 09

Adoption Patterns

Structural retrieval is not a replacement for vector similarity in all contexts. The two models have different strength profiles, and production adoption requires clarity about which profile fits which task.

Vector-only retrieval wins when the query is atomic and the corpus is semantically uniform. Factual lookup — "what does the policy say about data retention?" — over a well-chunked, single-domain corpus is an appropriate similarity-search problem. The query maps to a single information need; the corpus does not contain structurally incompatible versions of the same document; the answer does not require multi-step reasoning over retrieved fragments. When these three conditions hold, the structural overhead of decomposition and prefiltering adds latency and complexity without a corresponding quality gain.

Structural retrieval wins when any of those three conditions fails: compound queries, heterogeneous corpora, or reasoning tasks. In governance and compliance contexts — the domain the Kai agent operates in — all three conditions routinely fail. Compliance queries are compound by nature (they reference a control, an implementation, and a compliance gap simultaneously). Compliance corpora are heterogeneous (they contain regulatory specifications, vendor documentation, design artifacts, code, and prior decision records, all at different authority levels). Compliance reasoning is multi-step (it requires identifying the gap between the specification and the implementation, which is a two-source comparison, not a single-fragment lookup).

The migration path from similarity-only to structural retrieval does not require replacing the vector index. The prefilter model adds structural constraints before the similarity call; the decomposition model replaces a single similarity call with a sequence of similarity calls each scoped to a subquery; the citation schema adds a post-processing step that verifies fragment provenance. None of these changes requires reindexing the corpus. The vector index remains the retrieval mechanism for each scoped subquery; the structural pipeline is the orchestration layer above it. Teams can adopt the structural model incrementally, beginning with the prefilter (the lowest-cost, highest-precision gain) and adding decomposition and citation grounding as their pipelines mature.

Section 10

Honest Limits

The platform under study — the KellerAI Kai agent — demonstrates structural retrieval principles in its prompt design and supporting utilities. It does not contain substantial vector-embedding code. The prefilters the Kai agent applies are path-based and keyword-based, not embedding-based; the authority rankings are encoded as explicit prompt instructions, not as learned representations; the context budget is enforced by a character-length truncation utility, not by a learned token-allocation policy.

This makes the Kai agent an exhibit for structure-over-embeddings as a deliberate architectural choice rather than an exhibit for the full structural retrieval stack as described in this paper. The choice is defensible — for a governance domain where source authority, document lineage, and path provenance are well-defined and stable, explicit structural rules may outperform learned embeddings because the relevant signals are not latent in the text but explicit in the file system and the schema. The paper's thesis does not require embeddings to be absent; it requires structure to be primary and similarity to be secondary. The Kai agent satisfies that requirement.

What is not measured: the paper does not report retrieval precision, recall, or coverage numbers for the Kai agent's pipeline. The platform under study does not expose benchmark results for its retrieval stages, and the paper does not have access to a labeled evaluation set against which to compute those numbers. The structural retrieval principles described here are grounded in the platform's prompt and code evidence, not in empirical retrieval quality measurements. Practitioners adopting these patterns should instrument their pipelines to collect stage-level precision and latency data; the governance argument for structural retrieval does not substitute for empirical validation of retrieval quality in a specific production context.

The external citation pool for this paper is narrow by design: the paper does not cite specific research paper titles, authors, or DOIs for structural retrieval methods that it has not verified against current sources. Where the paper refers to industry practice — latency-sensitive prefilter patterns, OWASP trust-level principles, ISO/IEC 42001 documentation requirements — it cites the standard or the framework rather than a specific implementation study. Practitioners seeking empirical benchmarks for RAG architectures should consult the current research literature, which is moving rapidly enough that any specific paper citation here would risk misrepresenting a finding that has since been revised.

The Kai agent is an exhibit for structure-over-embeddings as a deliberate architectural choice, not an exhibit for the full structural retrieval stack. What the platform does not measure, this paper does not claim.

The scope boundary

Structured Retrieval Outranks Embeddings for Compound Reasoning

Context

The Finding