Back to Blog
RAGRAGchunkingretrievalquery-loggingproduction

RAG That Works in Production

Most RAG hallucinations trace back to retrieval failures, not the LLM. Fix chunking, logging, and evaluation before upgrading models.

March 17, 202612 min read

Abstract

Retrieval-Augmented Generation (RAG) has become the default architecture for grounding large language models in organizational knowledge. It has also become the default architecture for producing confident, authoritative wrong answers. Most RAG debugging focuses on the generation side: better prompts, output guardrails, stronger LLMs. Most of the time, the root cause is upstream. The retrieved context was wrong, incomplete, or irrelevant, and the LLM did exactly what it was trained to do: produce coherent, confident text from whatever it was given. This article covers why RAG hallucination is primarily a retrieval problem, why chunking strategy has more impact than LLM selection, why query logging is the most underrated feature in the stack, and how to layer a production-grade search system from the ground up.

RAG Hallucination Is a Retrieval Problem

When a RAG system makes things up, the instinct is always the same: "We need better prompts." "We need guardrails." "We should switch to a stronger LLM." Wrong target. Most of the time.

Consider what RAG actually does: retrieve context from a knowledge base, stuff that context into a prompt, and ask an LLM to answer based on that context. If the retrieved context is wrong, incomplete, or irrelevant, the LLM fills in the gaps. Approximately 18,000 annotated RAG hallucinations categorize into two primary types: Fabrication (unsupported gap-filling where the model invents information) and Contradiction (where the model's output opposes the retrieved evidence) (Niu et al., 2024). Both types are downstream consequences of retrieval quality.

Seven failure points identified across three enterprise RAG case studies showed the majority attributable to the retrieval pipeline: missing content, wrong chunks in top results, and wrong granularity. Retrieval failures accounted for 15 to 40% of queries examined (Barnett et al., 2024).

Anthropic's contextual retrieval research (2024) demonstrated that improving retrieval alone reduced retrieval failure rates by 49% (Anthropic, 2024). No prompt changes. No model upgrades. Just better retrieval.

The common retrieval failure patterns:

  • Wrong chunks retrieved (the retriever found thematically related but factually incorrect passages)
  • Right document, wrong section (the relevant document was in the index, but the chunk boundaries cut the answer in half)
  • Key information split across chunk boundaries (the first half of an explanation in chunk A, the second half in chunk B, retriever finds only A)
  • Context window packed with noise (enough semi-relevant passages to crowd out the one that actually answers the question)

LLMs struggle significantly with noisy or irrelevant retrieved content. Retrieval quality is the bottleneck for RAG faithfulness (Chen et al., 2024). The generation layer amplifies whatever the retrieval layer gives it, good or bad. Broken retrieval produces confident wrong answers. Solid retrieval produces useful answers.

Before investing in output validation, citation checking, or model upgrades, the diagnostic step is: log retrieved chunks alongside generated answers, compare what the LLM received versus what it produced, and identify retrieval failure patterns. Fix the retrieval. The hallucinations will follow.

Chunking Strategy Matters More Than LLM Choice

Teams regularly spend hours debating GPT-4 versus Claude versus Llama while their documents are chunked at fixed 512-token boundaries with 50-token overlap. That chunking strategy is destroying their RAG performance and nobody is looking at it.

The failure mode is mechanical. A paragraph explaining a concept gets split at an arbitrary boundary. First half in chunk A, second half in chunk B. User asks about that concept. Retriever finds chunk A. LLM gets half the explanation. It either hallucinates the rest or gives a partial answer.

The performance difference is not subtle. Fixed-size character chunking achieved nDCG@5 below 0.244, while structure-aware paragraph chunking reached approximately 0.459 (Bhat et al., 2025). That is nearly double the retrieval quality from a chunking strategy change alone.

Three chunking decisions drive most of the impact:

Chunk Boundaries Should Respect Document Structure

Token counts are the wrong unit for determining where to split. Paragraphs, sections, headings, and logical units are the right ones. Element-type-based chunking (splitting on structural boundaries like headings, paragraphs, and table boundaries) achieves the highest retrieval scores with half the number of chunks compared to structure-agnostic methods (Jimeno Yepes et al., 2024). Fewer chunks means less noise in retrieval results and less computation at query time.

Chunk Size Should Match Query Patterns

A short factual query ("What is the return policy?") needs a precise, small chunk. A complex analytical query ("How does the compensation structure compare across regions?") needs broader context. No single chunk size is optimal for all query types.

A router that dynamically determines optimal granularity based on input queries outperforms any fixed strategy (Chung et al., 2024). Dense X Retrieval proposes using propositions (atomic, self-contained statements) as the retrieval unit, producing more precise chunks that improve retrieval accuracy for fact-seeking queries (Chen et al., 2024b).

Metadata Matters as Much as Content

Chunk text alone loses the context of where it came from. Adding metadata (source document, section title, document hierarchy, date, author) allows the retrieval system to filter and boost based on structural information, not just semantic similarity. Metadata-enriched approaches achieve 82.5% precision versus 73.3% for content-only baselines in enterprise settings (Alkhalaf et al., 2024).

Anthropic's contextual retrieval research demonstrated that adding context to chunks (prepending a brief explanation of what each chunk is about within its parent document) reduced retrieval failure rates by 49%, and combining contextual embeddings with BM25 reduced failures by 67% (Anthropic, 2024). Chunk quality is foundational.

Query Logging: The Most Underrated Feature

Every query users type is simultaneously free training data for retrieval models, product feedback about what users need, and a bug report about what the system cannot handle.

Most teams capture none of this systematically. Only 7% of companies report learning from their site search data and applying those insights elsewhere in their business (Algolia, 2024).

The maturity curve for query analysis typically progresses through three stages:

Stage 1: "We don't log queries." Surprisingly common. The system processes and forgets. Every user interaction is lost.

Stage 2: "We log queries but haven't analyzed them." More common. There is a table somewhere in a data warehouse. Nobody looks at it regularly.

Stage 3: "We analyze queries weekly and feed patterns back into the system." Rare. And consistently correlated with the best search quality. Incorporating implicit feedback from user behavior (clicks, dwell time, scrolling, reformulations) improved ranking accuracy by up to 31% (Agichtein, Brill & Dumais, 2006).

Systematic query analysis reveals specific, actionable patterns:

Vocabulary gaps. Users type terms the system does not understand. These become training data for embedding fine-tuning. If "RMA" appears in queries 500 times per week and produces poor results, that is a specific, fixable gap.

Intent patterns. Clusters of queries reveal what users are actually trying to do, which often differs from what the system assumes they want.

Failure modes. Queries followed by reformulations are implicit failure signals. A query-based satisfaction model (using reformulation patterns with no click information) can indicate user satisfaction more accurately than click-based models (Hassan et al., 2013).

Seasonal patterns. Query distributions shift over time. Query popularity, relevant documents, and user intent all change (Kulkarni et al., 2011). An evaluation set that represented the query distribution six months ago may no longer reflect current production traffic. The evaluation set needs to reflect this drift.

The logging infrastructure is trivial: a message queue and a table. The analysis can start as a weekly manual review and evolve into automated monitoring. The hard part is not the technology. It is making it a habit.

The Search Stack to Build From Scratch

If building a retrieval system from scratch today, here is the layered architecture in order of priority. Each layer builds on the previous one. Skipping layers creates the failure modes described throughout this article.

Layer 1: Foundation

BM25 on a properly configured index. Good tokenization. Language-specific analyzers. Field boosting. BM25 remains a strong baseline that many dense models fail to beat on out-of-domain tasks (Thakur et al., 2021). BM25 outperforms neural passage retrieval on benchmarks like Robust04 and TREC-COVID (Chen et al., 2022). Ship this first. Get it in front of users. Start logging queries immediately.

Layer 2: Evaluation

Before changing anything else. An NDCG evaluation pipeline with 200 to 300 labeled queries. Automated runs on every change. The original NDCG metric introduced position-based discounting and graded relevance evaluation (Järvelin & Kekäläinen, 2002). This is the compass. Without it, every subsequent decision is guesswork.

Layer 3: Semantic

Fine-tuned embeddings trained on domain-specific query-document pairs. Not general-purpose embeddings. When further customized for downstream tasks, E5 achieves superior fine-tuned performance compared to existing embedding models with 40x more parameters (Wang, Yang et al., 2022). Run in parallel with BM25. Fuse with RRF. Tune the kk parameter (Bruch, Gai & Ingber, 2023).

Layer 4: Reranking

Cross-encoder reranker on the top 100 hybrid results (Nogueira & Cho, 2019). Start off-the-shelf. Fine-tune on domain data. Databricks research reports reranking can improve retrieval quality by up to 48%. The monoBERT reranking stage provides the largest single improvement over the BM25 baseline (Nogueira et al., 2019). Typically the largest single quality improvement in the stack.

Layer 5: Query Understanding

Lightweight query classification. Per-query retrieval strategy selection outperforms uniform application (Arabzadeh, Yan & Clarke, 2021). Detect query types and adjust fusion weights accordingly. Entity recognition for structured queries. Spell correction. Query expansion where it helps.

Layer 6: Monitoring

Online evaluation metrics. Online evaluation complements offline approaches and is often more realistic when measuring actual user experience (Hofmann, Li & Radlinski, 2016). Zero-result rate tracking. Query reformulation detection. Automated alerts on quality degradation. Dashboards that the team actually looks at.

Layer 7: RAG (If Needed)

Only after layers 1 through 6 are solid.

RAG amplifies retrieval quality, good or bad. The majority of RAG failure points are retrieval-side (Barnett et al., 2024). Broken retrieval produces confident wrong answers. Solid retrieval produces useful answers. Adding RAG on top of a weak retrieval stack does not fix the stack. It makes the failures more visible and more consequential.

Notice what is not in this stack: chasing the latest model, premature optimization, or starting with RAG. Build measurement and data infrastructure first. The models are the easy part.

Conclusion

RAG in production is a retrieval engineering problem, not a prompt engineering problem. The majority of hallucinations trace back to retrieval failures: wrong chunks, missing context, poor granularity, noisy candidate sets. Fixing these upstream issues eliminates most downstream generation problems.

The practical priorities: structure-aware chunking that respects document boundaries and metadata, systematic query logging that feeds patterns back into retrieval optimization, evaluation infrastructure that measures quality continuously, and a layered architecture that builds each capability on a solid foundation.

The models will continue to improve. The fundamental architecture will not change: fast, approximate first-stage retrieval; precise cross-encoder reranking; evaluation at every layer; and RAG only after retrieval proves itself. Teams that build this infrastructure compound improvements over time. Teams that skip layers spend their time debugging symptoms instead of causes.

References

  • Agichtein, E., Brill, E., & Dumais, S. (2006). "Improving Web Search Ranking by Incorporating User Behavior Information." SIGIR 2006, pp. 19-26. https://dl.acm.org/doi/10.1145/1148170.1148177
  • Algolia (2024). "40+ Stats on E-Commerce Search and KPIs." https://algolia.com/blog/ecommerce/e-commerce-search-and-kpis-statistics
  • Alkhalaf, H., et al. (2024). "A Systematic Framework for Enterprise Knowledge Retrieval." arXiv:2512.05411. https://arxiv.org/abs/2512.05411
  • Anthropic (2024). "Introducing Contextual Retrieval." https://anthropic.com/news/contextual-retrieval
  • Arabzadeh, N., Yan, X., & Clarke, C. L. A. (2021). "Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection." CIKM 2021, pp. 2862-2866. https://arxiv.org/abs/2109.10739
  • Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., & Abdelrazek, M. (2024). "Seven Failure Points When Engineering a Retrieval Augmented Generation System." IEEE/ACM CAIN 2024, pp. 194-199. https://arxiv.org/abs/2401.05856
  • Bhat, A., et al. (2025). "A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity." arXiv:2603.06976. https://arxiv.org/abs/2603.06976
  • Bruch, S., Gai, S., & Ingber, A. (2023). "An Analysis of Fusion Functions for Hybrid Retrieval." ACM TOIS, 2023. https://arxiv.org/abs/2210.11934
  • Chen, J., Lin, H., Han, X., & Sun, L. (2024). "Benchmarking Large Language Models in Retrieval-Augmented Generation." AAAI 2024, Vol. 38(16), pp. 17754-17762. https://arxiv.org/abs/2309.01431
  • Chen, J., et al. (2024b). "Dense X Retrieval: What Retrieval Granularity Should We Use?" EMNLP 2024. https://arxiv.org/abs/2312.06648
  • Chen, X., Zhang, N., Lu, K., Bendersky, M., & Najork, M. (2022). "Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models." ECIR 2022, pp. 95-110. https://arxiv.org/abs/2201.10582
  • Chung, J., et al. (2024). "Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation." arXiv:2406.00456. https://arxiv.org/abs/2406.00456
  • Hassan, A., Shi, X., Craswell, N., & Ramsey, B. (2013). "Beyond Clicks: Query Reformulation as a Predictor of Search Satisfaction." CIKM 2013, pp. 2019-2028.
  • Hofmann, K., Li, L., & Radlinski, F. (2016). "Online Evaluation for Information Retrieval." Foundations and Trends in Information Retrieval, 10(1), 1-117. https://dl.acm.org/doi/10.1561/1500000051
  • Järvelin, K., & Kekäläinen, J. (2002). "Cumulated Gain-Based Evaluation of IR Techniques." ACM Transactions on Information Systems, 20(4), 422-446. https://dl.acm.org/doi/10.1145/582415.582418
  • Jimeno Yepes, A., et al. (2024). "Financial Report Chunking for Effective Retrieval Augmented Generation." arXiv:2402.05131. https://arxiv.org/abs/2402.05131
  • Kulkarni, A., Teevan, J., Svore, K. M., & Dumais, S. T. (2011). "Understanding Temporal Query Dynamics." WSDM 2011, pp. 167-176. https://dl.acm.org/doi/10.1145/1935826.1935862
  • Niu, X., Wu, J., Zhu, Y., Xu, S., Shum, K., Zhong, H., Song, J., & Zhang, T. (2024). "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models." ACL 2024, pp. 10862-10878. https://aclanthology.org/2024.acl-long.585/
  • Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085. https://arxiv.org/abs/1901.04085
  • Nogueira, R., Yang, W., Cho, K., & Lin, J. (2019). "Multi-Stage Document Ranking with BERT." arXiv:1910.14424. https://arxiv.org/abs/1910.14424
  • Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021. https://arxiv.org/abs/2104.08663
  • Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., & Wei, F. (2022). "Text Embeddings by Weakly-Supervised Contrastive Pre-training." arXiv:2212.03533. https://arxiv.org/abs/2212.03533
L

Laszlo Csontos

Search & Retrieval Engineer: Hybrid Search, RAG, Embeddings Building systems that actually find what users want

Building RAG? I can help you get to production.

Book a Discovery Call