RAG failures at scale tend to be retrieval failures, even when the model takes the blame. The retrieval layer gets a pass. That misdiagnosis leads to the wrong fix — which is why so many teams are on their third prompt rewrite when what they need is a redesigned retrieval layer.

Why It Works in the Demo

RAG — retrieval-augmented generation — is an architecture pattern where a model’s responses are grounded in documents retrieved at query time rather than relying solely on training data. It works reliably under demo conditions because the demo retrieval problem is easy. The corpus is small, the queries are clean, the relevant documents are obvious, and the person running the demo knows what to search for. The model receives well-matched context and produces good output. Everything works.

In production, none of those conditions hold. The corpus grows. Users submit queries the system was not designed for. Relevant documents become entangled with noise. Documents that were current when indexed become stale. The retrieval quality degrades in ways that are invisible to the model — because the model does not know what it was not given. It produces outputs that look plausible and are wrong, and nobody can diagnose why because the retrieval layer was never instrumented.

The demo is not representative of the retrieval problem. It is representative of the best-case retrieval problem. The architecture works under best-case conditions and breaks under everything else.

The Retrieval Layer as Architectural Seam

The boundary between retrieval and generation is an architectural seam. It is the point where two fundamentally different systems meet: a deterministic information retrieval system and a probabilistic language model. That seam requires a contract — a specification of what the retrieval layer guarantees to deliver, what the generation layer can assume about what it receives, and what happens when retrieval quality degrades below a threshold.

Almost no RAG architectures have this contract. The retrieval layer is treated as a plumbing detail rather than a first-class system component. It has no explicit quality guarantees, no defined failure modes, no circuit breakers. When it degrades, the generation layer receives poor context and produces poor output, and the failure is attributed to the model.

For example, if a retrieval layer returns documents that are topically adjacent but not actually relevant to the query, the model will attempt to answer using that context. The answer will be coherent and wrong. Nothing in the system signaled that retrieval had failed. The model did exactly what it was designed to do with what it was given.

This misattribution is consequential. When the model is blamed for a retrieval failure, the team improves the prompt, fine-tunes the model, or switches to a more capable model. None of these interventions address the retrieval problem. The retrieval problem persists. The model improvements provide marginal gains that are swamped by continued retrieval degradation. The system improves slowly until it hits a ceiling — not because the model cannot do better, but because the context it receives is unreliable.

Chunking and Indexing as Technical Debt

Chunking and indexing decisions are made once, early, under demo conditions, and then never revisited. They are the most consequential technical decisions in a RAG architecture and receive the least architectural attention.

Chunking strategy determines what the retrieval layer can return. Chunks that are too large dilute relevant signal. Chunks that are too small lose the context needed to answer the query. Chunks that cut across document boundaries destroy semantic coherence. Every chunking strategy makes a claim about the structure of the corpus and the shape of the queries.

In the demo, that claim is usually correct. In production, the corpus grows in ways that violate those early assumptions, and the claims become wrong. Indexing decisions compound this. An index built for a small, clean corpus behaves differently at scale. Approximate nearest-neighbor search degrades as corpus size grows, particularly when the embedding space becomes crowded.

Freshness constraints that were irrelevant in the demo become urgent when documents are updated frequently. Metadata filtering that was not needed for forty documents becomes essential for four hundred thousand. Teams that treat chunking and indexing as one-time decisions accumulate retrieval debt. By the time the degradation is visible in output quality, the corpus is large enough that re-chunking and re-indexing is a significant operation with unclear tradeoffs. The team faces an expensive migration without a clear diagnosis of what went wrong.

Latency Compounding

Retrieval adds latency. At scale, that latency compounds in ways that are not visible in the demo. A retrieval operation that takes 150ms against a ten-thousand-document corpus may take 600ms against a million-document corpus, depending on index structure and query complexity. The generation step adds its own latency. Reranking adds more. Multi-hop retrieval — retrieving, generating an intermediate query, retrieving again — multiplies all of these.

The architectural problem is that latency is never part of the retrieval contract. Teams build for capability without building for latency bounds. When the system slows down under production load, the team adds caching, parallelizes retrieval calls, or reduces context length. These are local fixes to a structural problem. They address symptoms rather than causes. Each fix introduces its own tradeoffs — caching introduces staleness, context reduction reduces answer quality, parallelism increases infrastructure cost.

Latency budgets should be explicit from the start. The retrieval layer should have a defined latency SLA. When that SLA cannot be met, the failure should be visible and the system should respond predictably — returning degraded results, falling back, or failing explicitly. A system that silently exceeds its latency budget is not resilient. It is broken in a way that is difficult to observe.

What Scale Actually Means for RAG

Scale in the context of RAG has three independent dimensions, and each breaks the architecture in a different way.

Query volume tests the infrastructure: whether the index can serve concurrent requests, whether the embedding models can process query volume without becoming a bottleneck, whether latency holds under load. This is the most visible failure mode because it surfaces directly in user experience.

Corpus size tests the retrieval architecture: whether the chunking strategy remains appropriate, whether the index structure maintains quality at volume, whether freshness constraints can be met. This failure mode is insidious because it degrades gradually. Quality erodes rather than collapsing. The team may not notice until the degradation is significant.

Query diversity tests the retrieval contract: whether the system was designed for the queries users actually submit, or only for the queries the team anticipated. Every RAG deployment is optimized for a distribution of queries. Production query distributions drift. The system that worked for early adopters may not work for a broader user base asking different questions about different aspects of the corpus.

These three dimensions interact. High query volume combined with large corpus size and diverse queries is where RAG architectures fail hardest. The team built for one dimension and discovered that the other two were architectural assumptions masquerading as solved problems.

The fix is not a better model. The fix is treating the retrieval layer as a first-class system: with explicit contracts, defined failure modes, quality metrics, latency bounds, and a plan for corpus growth. The model can only work with what retrieval gives it. Design the retrieval layer accordingly.


If you’re building an AI-driven product and want a second opinion on architecture or scaling risks, I offer Architecture Discussions — focused sessions for founders and technical teams working through real decisions.

me@andrewphunter.com