Back to Blog
RAGragretrieval-evaluationhallucinationchunkingai-startups

RAG hallucinations are usually a retrieval failure you never measured

When a RAG system answers confidently and wrongly, the reflex is to fix the prompt or swap the model. The fault usually sits upstream in retrieval, and it stays invisible because almost no team measures retrieval as its own stage.

May 27, 20267 min read

RAG hallucinations are usually a retrieval failure you never measured

A retrieval-augmented system returns an answer that is fluent, specific, and wrong. The reflex is almost always the same: tighten the prompt, add an output guardrail, move to a stronger model. Each of those is a change to the generation layer, and the generation layer is rarely where the fault is. A language model asked to answer from a set of passages will produce coherent, confident text from whatever it was handed. If the passages were wrong, incomplete, or missing the one that mattered, the model does its job and the answer is still wrong. Broken retrieval produces confident wrong answers; solid retrieval produces useful ones. The difficulty is that, from the answer alone, the two are indistinguishable, and almost no team measures retrieval on its own to tell them apart.

RAG was supposed to fix hallucination, and partly did not

Grounding a model in retrieved context was meant to be the cure for hallucination. It is not a complete one. Models still produce statements unsupported by, or in direct contradiction to, the very context they were given. A corpus of nearly 18,000 RAG responses, annotated for exactly these two modes (fabrication beyond the context and contradiction of it), documents both at scale (Niu et al., 2024).

That finding cuts in two directions, and both matter. Because these failures appear even when the retrieved context is correct, retrieval quality is not the sole cause of an unfaithful answer. But it is a major and routinely under-attended driver of one: model accuracy drops sharply once the retrieved set contains noise or irrelevant passages (Chen et al., 2024). If a noisy candidate set is enough to derail an otherwise capable model, then the quality of what retrieval returns is one of the strongest levers a team has on whether the final answer can be trusted. Experience reports from teams shipping production RAG converge on the same place. Across three separate enterprise deployments, several of the seven recurring failure points sit in the retrieval pipeline rather than in the model: content missing from the index, the relevant passage ranked below the cutoff, and the right document present but chunked so the answer never arrives intact (Barnett et al., 2024).

The failure is invisible because you measure the answer, not the retrieval

Here is the structural reason these failures persist. Most RAG evaluation scores the output. It asks whether the final answer looks right, whether it is faithful to the context, whether an LLM judge approves of it. That tells a team that something went wrong. It does not tell them where. When an answer is wrong, an output score cannot separate the two possibilities that demand opposite fixes: retrieval surfaced the wrong context and the model faithfully summarized it, or retrieval surfaced the right context and the model fumbled it. Those are different bugs with different owners, and an end-to-end number collapses them into one.

The diagnostic move is to measure retrieval as its own stage, before changing anything downstream. That means scoring what the retriever returns against a set of graded relevance judgments built independently of the system, so the question becomes "did the right passages come back, in the right order" rather than "did the answer read well." Doing that credibly has two requirements worth naming, both easy to skip. The first is statistical: a single retrieval score means little without a confidence interval around it, because the variance across queries routinely swamps the gap between two configurations. (The mechanics, and why a bare point estimate is closer to a coin flip than a verdict, are the subject of a companion piece on bootstrap confidence intervals for NDCG.) The second is independence: when the audience for the number is procurement, diligence, or an investor, a score the building team produced on its own judgments answers a different question than one produced by a party with no stake in the result. (That distinction is treated at length in first-party eval tools versus independent audit.)

The reason the diagnostic is worth the trouble is that the highest-leverage fix is frequently retrieval-only, with no change to the model at all. Enriching each chunk with a short description of its place in the parent document before indexing, then indexing those enriched chunks for both vector and keyword retrieval, reduced failed retrievals by 49 percent in one set of experiments, rising to 67 percent once a reranker was added on top (Anthropic, 2024). No prompt was rewritten and no model was swapped. The improvement came entirely from giving retrieval better material to work with.

A retrieval bug that looks exactly like a model bug

Chunking is the clearest case of a retrieval-side fault that gets misread as a generation defect. The mechanism is mechanical. A passage that explains one thing gets split at an arbitrary boundary: the first half of the explanation lands in one chunk, the second half in the next. A query about that topic retrieves the chunk that matches its wording, the model receives half of an explanation, and it either completes the thought by inventing the rest or returns a partial answer that omits the part that mattered. To anyone watching the output, this reads as the model hallucinating or dodging. It is neither. The model never saw the whole answer.

Diagram of the chunk-boundary failure mode. One warranty explanation is split across two chunks: Chunk A holds "the warranty covers parts and labor" and Chunk B holds "for 24 months, excluding water damage." A query asking whether water damage is covered retrieves Chunk A, the top match, and misses Chunk B. The model, seeing only Chunk A, answers confidently that the item is covered. The exclusion that would have answered the question correctly lived in the chunk that was never retrieved. A coherent answer split at a chunk boundary: retrieval returns the matching half and misses the half that holds the answer.

The size of the effect is easy to underrate. The largest controlled comparison of chunking strategies to date, spanning thirty-six segmentation methods across six domains and five embedding models, found that naive fixed-size character chunking scored below 0.244 nDCG@5 while structure-aware paragraph chunking reached roughly 0.459 (Shaukat, Adnan & Kuhn, 2026). That is close to a doubling of retrieval quality from a change to where the text is cut, with the same embedding model and the same generator. A team that responds to the resulting bad answers by upgrading the model is paying to make a retrieval problem slightly less visible.

Why this is expensive for an AI startup specifically

For a team building on RAG, misattributing retrieval failures to the model is not only wasted engineering. It is wasted at the worst possible time. Weeks spent tuning prompts and trialing larger models are weeks not spent on the fix that would have worked, and they tend to get spent in exactly the window before a fundraise or a first enterprise pilot, when the system is under the most scrutiny and has the least slack.

The deeper cost surfaces in the room. A technically literate reviewer, whether a partner running diligence or an enterprise customer's ML lead, eventually asks how the team knows its retrieval works. A team that has only ever measured end-to-end answer quality has no separable answer. It cannot say where its system is strong or weak, because it never measured the layer where most of its failures live. That is precisely the gap a technically literate reviewer probes for, and what such a reviewer is actually looking for is independence, statistical significance, and a reproducible method. The teams that answer cleanly are the ones that measured retrieval as its own stage, on their own queries, before anyone asked, when a weak number was still a problem they had time to fix rather than a fact they had to disclose. Doing that early enough to remediate is itself a timing decision, and the usable window is four to twelve weeks ahead of the conversation.

The takeaway

Before the next prompt rewrite or model upgrade, measure what retrieval actually returned. Score the retrieved passages against graded judgments, attach a confidence interval, and find out whether the model was handed the right context or set up to fail. Most of the time the answer reorders the roadmap, and the model turns out to have been the part that was already working.


TensorOpt runs independent, fixed-fee retrieval audits that measure retrieval as its own stage: graded relevance scored with NDCG, MRR, and hit-rate@k, bootstrap confidence intervals, and paired significance tests against a documented baseline. Download the sample report to see retrieval measured in isolation on a public benchmark, or read the methodology in full in Designing Hybrid Search Systems.

L

Laszlo Csontos

Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.

Building RAG? I can help you get to production.

Book a Discovery Call