Bootstrap Confidence Intervals for NDCG: The Rigor Most RAG Evals Skip
A single NDCG number looks like a verdict but is often a coin flip. The variance across query subsets routinely exceeds the difference between two systems, and the IR community solved this in 1997. Most LLM-era eval tooling has not inherited the fix.
Open almost any RAG evaluation report and you will find a number that looks like a verdict. NDCG@10 = 0.42 for the current retriever. NDCG@10 = 0.44 for the candidate reranker. The candidate wins, so it ships. The decision feels objective because it is attached to a metric.
It is not objective. It is a coin flip dressed up as a measurement, and the reason is something the information retrieval community settled almost three decades ago and the LLM-era tooling has, so far, declined to inherit.
A single number hides the thing you actually need to know
A reported NDCG@10 of 0.42 is an average over a set of queries. Each query in your evaluation set produces its own per-query score, and those per-query scores vary enormously. Some queries are easy and score near 1.0 under any configuration. Some are hard and score near zero no matter what you do. The headline number is the mean of that spread, and the mean alone tells you nothing about how wide the spread is or how stable the comparison between two systems would be if you had drawn a slightly different set of queries.
This is not a subtle effect. Comparative retrieval behavior is highly variable across topics, and that variability frequently dominates the difference between the systems you are trying to compare (Voorhees and Buckley, 2002). The variance contributed by the choice of query set routinely exceeds the variance contributed by the choice of system, which means that two evaluation runs on different query samples can flip the ranking of the same two systems (Zobel, 1998). When the topic effect is larger than the system effect, a point estimate is reporting mostly noise and presenting it as signal.
The practical consequence: a +0.02 difference in NDCG@10 between two configurations might be real, or it might be entirely an artifact of which queries happened to land in your evaluation set. The point estimate cannot distinguish the two cases. It was never designed to.
The IR community fixed this in 1997
The fix is a confidence interval, and a standard way to compute one for a retrieval metric is the bootstrap. The method is old, well understood, and almost embarrassingly simple to describe.
Start with the set of per-query NDCG scores. Resample that set with replacement, drawing a new set of the same size where some queries appear more than once and some not at all. Recompute the mean NDCG on the resampled set. Repeat several thousand times. The result is a distribution of plausible NDCG values, and the 2.5th and 97.5th percentiles of that distribution give a 95 percent confidence interval. The interval answers the question the point estimate cannot: across the kinds of query sets that might have been drawn, where does this metric actually sit?
This approach was proposed for retrieval evaluation specifically because it makes no assumption about the distribution of the observations, can assess the accuracy of virtually any statistic, and can build approximate confidence intervals even with a relatively small sample size (Savoy, 1997). It was extended into a full framework for evaluating arbitrary IR metrics, including graded-relevance metrics like NDCG, through bootstrap hypothesis testing (Sakai, 2006). Comparisons against the other standard significance tests found little practical difference between the bootstrap, randomization, and t-test approaches, establishing the bootstrap as a fully legitimate choice (Smucker, Allan, and Carterette, 2007). The method is treated as standard pedagogy in the reference textbook on laboratory IR experiments (Sakai, 2018).
None of this is exotic. It is the boring, settled, decades-old practice of a field that learned the hard way that single numbers lie.
What the interval changes
Consider two reranker configurations. Configuration A reports NDCG@10 of 0.42. Configuration B reports 0.44. On the point estimates alone, B is the obvious choice.
Now attach the intervals. Suppose the 95 percent bootstrap confidence interval for A is [0.39, 0.45] and for B is [0.41, 0.47]. (These figures are illustrative, chosen to show the mechanism.) The intervals overlap heavily, as the bands below make visible.
The overlap is the warning sign. A could truly sit at 0.45 and B at 0.41, which would reverse the ranking the point estimates suggested. The +0.02 gap is inside the range the two configurations share.
One caution, because it is the exact error this article exists to correct: comparing the two marginal intervals by eye is a rough heuristic, not the rigorous test. Two overlapping marginal intervals can still hide a real, consistent per-query difference, and two barely-separated ones can still be noise. The correct procedure compares the configurations on the same queries and builds a confidence interval on the paired per-query difference, which uses the fact that an easy query is easy for both systems and cancels that shared difficulty out. (That paired procedure is the subject of a companion piece on paired significance tests for retrieval changes.) The marginal bands above are enough to make the point here: a +0.02 point-estimate gap is not, on its own, evidence of anything. Shipping B on the strength of the point estimate alone is not a data-driven decision; it is a guess with a metric stapled to it.
How big does the difference need to be before it is trustworthy? The answer depends on how many queries are evaluated, and the dependence is unforgiving at the sample sizes most teams actually use. On a 50-topic TREC collection, an absolute difference on the order of several points is required before one can be 95 percent confident the ranking of two runs would survive on a different query sample (Voorhees and Buckley, 2002; the finding is reported for MAP and generalizes in magnitude to other effectiveness metrics). Larger query sets tighten the interval and make smaller differences detectable; smaller sets widen it and swallow them (Urbano, Lima, and Hanjalic, 2019).
Modern leaderboards fail this test
This is not a hypothetical concern confined to old TREC data. The leaderboards practitioners actually consult run into exactly the same wall.
On the French MTEB leaderboard, the top nine models are statistically equivalent at p = 0.05, and separating them would require more evaluation datasets (Ciancone et al., 2024). The leaderboard prints nine distinct ranks. The statistics say those nine collapse into a single indistinguishable group. Choosing the rank-one model over the rank-nine model, on that evidence, is choosing on noise. Many of the individual datasets inside heterogeneous retrieval benchmarks carry small query counts, well inside the regime where a couple of points of NDCG sits comfortably within sampling noise (Thakur et al., 2021).
The benchmark establishment says the same thing about itself. There is currently no agreed-upon methodology for significance testing across BEIR's aggregate score: per-dataset tests are possible, but a significance test on the final macro-average is not meaningful, because the constituent datasets diverge too much in corpus size, query count, and judgment count, among other dimensions (Kamalloo et al., 2024). The practical reading is narrow and worth stating exactly: a single headline BEIR average is not a quantity you can attach a cross-domain significance claim to. Per-dataset confidence intervals are both possible and advisable, which is precisely the discipline a point-estimate report skips.
Why the eval tools you already use will not save you
The natural objection is that this must already be handled by the evaluation platforms. It is not.
A survey of the documentation for the major RAG and LLM evaluation tools, conducted in May 2026, found that none of them natively ship bootstrap confidence intervals or paired significance tests on retrieval metrics. Ragas documents context precision and context recall but no significance testing. Arize Phoenix is the most explicit of the group, recommending traditional retrieval metrics including MRR, Precision@K, and NDCG, yet its documentation contains no mention of confidence intervals, paired tests, or p-values. Galileo, Patronus AI, DeepEval, Braintrust, Vectara, and FutureAGI follow the same pattern: graded and LLM-judge metrics are first-class features, statistical uncertainty on those metrics is absent from the product documentation (tool documentation audit, May 2026).
This is not negligence on the vendors' part. These tools are excellent at what they are built for: fast dev-time iteration, continuous observability, and catching regressions across production traffic. The point is narrower and more important. The field of LLM evaluation has not yet inherited the IR-evaluation community's tooling, so the burden of distinguishing a real improvement from a lucky query sample currently falls on whoever reads the report. If your evaluation stack hands you a point estimate, the statistical work has not been done. It has been deferred to you.
What this means when someone is grading your evaluation
For a team running internal experiments, the cost of skipping confidence intervals is occasionally shipping a change that did nothing. That is recoverable. The cost rises sharply the moment the audience for your evaluation is external: an enterprise procurement team, a due-diligence reviewer, or an investor who has started asking how you measure retrieval quality (what those audiences are actually checking for is the subject of what investors look for when they ask how you evaluate retrieval).
The expectation that quantitative claims arrive with an uncertainty estimate is no longer confined to academic IR. The most influential reproducibility checklist in machine learning asks directly whether a submission reports error bars, confidence intervals, or statistical significance tests for its main experiments, and treats their presence as the default (NeurIPS Paper Checklist, 2022). The broader move toward reporting central tendency alongside variation has become a checklist item across the field (Pineau et al., 2021). The IR community has been calling for effect sizes and confidence intervals reported alongside system comparisons for over a decade (Sakai, 2014).
A retrieval evaluation that reports confidence intervals signals that the team understands the difference between a measurement and an average. One that reports a bare point estimate invites the next question, and the question after that, from exactly the audience a startup least wants to look unprepared in front of. The presence of an interval is not decoration. It is the difference between a number a reviewer can act on and a number a reviewer has to interrogate.
The methodology to produce these intervals is not proprietary and not difficult. Per-query scores come out of any standard IR evaluation library (pytrec_eval, ir_measures, or ranx), and a bootstrap interval is a single function call away in widely used scientific Python tooling:
import numpy as np
from scipy.stats import bootstrap
# per_query_ndcg: a 1-D array of NDCG@10 scores, one per evaluation query
per_query_ndcg = np.array([...])
result = bootstrap(
(per_query_ndcg,),
np.mean,
confidence_level=0.95,
n_resamples=9999,
method="BCa", # bias-corrected and accelerated
)
ci = result.confidence_interval
print(f"NDCG@10 = {per_query_ndcg.mean():.3f} "
f"(95% CI [{ci.low:.3f}, {ci.high:.3f}])")That is the whole procedure for a single configuration. scipy.stats.bootstrap has shipped since SciPy 1.7.0 and defaults to bias-corrected and accelerated (BCa) intervals (SciPy documentation). Comparing two configurations rigorously means computing the per-query difference between them and building the interval on that single array of differences. The reason most RAG evaluation reports omit any of this is not technical difficulty. It is that the tools default to point estimates, the leaderboards display point estimates, and the habit of treating a single number as a verdict is hard to break until someone shows you the interval around it.
The sample retrieval audit report uses bootstrap confidence intervals throughout, on a public benchmark, so you can see what a procurement-grade evaluation statement looks like next to the dashboard screenshot it replaces. It is email-gated and free.
For the underlying methodology, including how confidence intervals fit into a complete search-quality measurement framework, see Chapter 11 (Search Quality Metrics) of Designing Hybrid Search Systems.
Laszlo Csontos
Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.
Need help measuring search quality? Start with a baseline audit.
Book a Discovery CallRelated Posts
A reranker change moves NDCG@10 by +0.015 and the dashboard goes green. The paired bootstrap returns a 95% CI of [-0.003, +0.033], crossing zero. The point estimate said ship; the evidence says wait.
May 26, 2026
Galileo, Patronus, Phoenix, and DeepEval make a RAG system better. They cannot prove it works to procurement, diligence, or an investor. Independence is structural, not a feature.
May 26, 2026
Run an audit too late and you cannot fix what it finds. Run it too early and the system drifts before the conversation. The usable window is 4 to 12 weeks out.
May 26, 2026