Skip to main content
kellerai.blog

Structured Retrieval Outranks Embeddings for Compound Reasoning

Retrieval-augmented generation beyond the naive pipeline.

KellerAI White Paper · Prompt & Artifact Engineering · May 2026

Context

A RAG system that retrieves by vector similarity alone breaks on compound queries, contradictory sources, and long-horizon tasks. When an agent must reason across documents, similarity treats each query as independent — missing that sources contradict, or that early findings change which sources matter next. Mature retrieval is a structured computation, not a single ranking pass.

The Finding

Three patterns separate production RAG from the embedding-centric baseline: query decomposition (compound questions split into staged subqueries), structural ranking before similarity (prefilter by authority, type, date, lineage), and citation-trace binding (pin retrieved context to source so retrieval is auditable). Retrieve by structure first, then by similarity — and always record the trace.

Tags:
Query DecompositionStructural RankingCitation GroundingContext Budget Management
Paper Details
CategoryPrompt & Artifact Engineering
AudienceEngineering leaders, prompt architects, and AI teams building compound-reasoning agents in regulated or audit-sensitive domains.
MethodAnalytical · evidence-based
LengthTBD · NaN min
Sections0
DateMay 2026
AuthorsKellerAI
Read the full paper
Section 01

Where the baseline breaks

The embed → retrieve → generate pipeline has three failure modes that appear only under production load.

Compound queries. A question like "compare the Q3 2024 guidance with what the CFO said in the earnings call and flag any divergence" is not one retrieval. It is at least three. A single-shot nearest-neighbor lookup collapses the sub-questions into one vector and returns chunks that partially address each — none well. The generator then synthesizes across fragments that were never selected to cohere.

Contradictory sources. When two retrieved chunks assert incompatible facts — a policy document from February and a superseding amendment from August — the generator has no signal about which to trust. Similarity scoring does not encode recency, authority, or version lineage. Both chunks land in context with equal weight, and the model averages them.

Vector similarity answers "what is near this query?" It does not answer "what should a system trust, and in what order?"

The core problem

Long-horizon tasks. A task that requires building up context across multiple retrievals — a due-diligence sweep, a regulatory gap analysis, a contract comparison — cannot be served by a stateless top-k call. Each retrieval step needs to know what the prior step found, what is still missing, and what retrieval strategy to apply next.

Section 02

Retrieval as structured computation

Super RAG v2.1 reframes retrieval not as a lookup but as a computation with explicit stages, each producing a typed artifact that feeds the next.

Query decomposition. An incoming query is parsed into a directed acyclic graph of sub-queries. Each node carries a retrieval intent — lookup, comparison, temporal scan, aggregation. The graph determines the retrieval order and how results are merged before generation.

Structural ranking before similarity. Before similarity scoring is consulted, a structural filter applies hard constraints: document authority tier, content type (policy, transcript, amendment), effective date range, and version lineage. Chunks that fail a structural constraint are excluded regardless of embedding proximity. Similarity operates within the surviving candidate set, not across the full corpus.

Citation-trace binding. Every chunk selected for context is tagged with a provenance record: source identifier, version, retrieval stage, and the structural rules that admitted it. The generator receives context with its selection rationale attached. The final output can be traced back, chunk by chunk, to a deterministic retrieval decision.

Section 03

Why auditability is the forcing function

In a general-purpose chatbot, a retrieval failure is an inconvenience. In a regulated domain — clinical decision support, loan underwriting, compliance review — it is a liability event. The question is not only whether the answer is correct. It is whether the retrieval choices that produced it are defensible to a supervisor, an auditor, or a court.

Vector similarity is not defensible in that sense. It produces a ranked list with no explicit rationale. A structural retrieval pipeline — with decomposed queries, typed filters, and citation-trace records — produces a decision log. Every context chunk has an admission reason. Every sub-query has a result. The generation step operates on a reconstructible input.

That traceability is not a reporting convenience. In settings where an adverse output triggers regulatory review, the retrieval log is the evidence. Systems that cannot produce it are not compliant by design, regardless of how accurate their average output is.

Section 04

What the in-depth covers

This brief names the failure modes and sketches the architecture. The in-depth companion works through each component in full: the query decomposition grammar, the structural ranking schema, the citation-trace data model, and the integration points where v2.1 differs from v1.x pipelines already in production.

It also covers the cases where simpler retrieval is still correct — not every RAG deployment needs a decomposition layer — and the cost model for adding structural ranking to an existing embedding-based system.

Read the full analysis: Super RAG v2.1 — In Depth.

End of paper↑ Back to top