Paired Significance Tests for Retrieval Changes: When NDCG Went Up Isn't Enough
A reranker change moves NDCG@10 by +0.015 and the dashboard goes green. The paired bootstrap returns a 95% CI of [-0.003, +0.033], crossing zero. The point estimate said ship; the evidence says wait.
A team changes one thing in the retrieval stack. Maybe the reranker's top-k goes from 5 to 10, an embedding model gets swapped, or the hybrid fusion weights get adjusted. The evaluation set runs, and NDCG@10 moves from 0.412 to 0.432. A +0.02 gain. The dashboard is green. The number is up. Someone asks whether to ship it.
The instinct is yes. The number went up, the change was cheap, and there is a release on Friday. But "the number went up on the eval set" and "this change improved retrieval" are not the same statement, and the gap between them is where a lot of retrieval work quietly fails to compound. A point estimate is a single draw from a noisy process. The question that actually matters is whether the improvement would survive being measured on a different set of queries, and a single number cannot answer that. A significance test can, but only the right one for the way retrieval evaluation actually produces data.
The improvement that doesn't add up
This is not a hypothetical risk. When two decades of reported ad-hoc retrieval gains were re-examined across SIGIR and CIKM publications, baselines turned out to be generally weak, often below the median original TREC system, and only a handful of experiments ever beat the best automatic TREC run (Armstrong et al., 2009). The field accumulated a long run of reported improvements that, measured against strong baselines and tested properly, did not stack up into the cumulative progress the individual results implied. Each gain looked real on its own eval set. In aggregate, many were noise.
The same dynamic operates inside a single team's CI pipeline, only faster. Every config change that ships on an unverified point estimate is a small bet that the gain is real. Some are. The ones that are not get baked into the baseline, the next change is measured against an inflated reference, and the team slowly loses the ability to tell motion from progress.
The starting point is appreciating how noisy these metrics are. Splitting a 50-topic TREC collection into two disjoint halves and comparing system rankings on each, an absolute difference of roughly 8 to 9 percent in mean average precision was required on a 25-topic set before both halves agreed on the ordering more than 95 percent of the time (Voorhees & Buckley, 2002). That exponential trend was projected forward to larger topic sets (Voorhees & Buckley, 2002); carried to 50 topics, an observed absolute difference above roughly 5 percent is enough to be 95 percent confident the ordering will hold (Sanderson & Zobel, 2005). Set that against the green dashboard: a 2-point NDCG move on a few hundred queries sits inside the range where two systems can swap rankings on a different draw of queries. The point estimate is not lying. It is just not, by itself, evidence.
The discipline reinforces the trap in two ways. It often does not test at all: roughly two thirds of experimental long papers at a major NLP venue reported no significance testing whatsoever (Dror et al., 2018). And it treats a single fixed evaluation split as ground truth when it is nothing of the kind. Re-running nine published part-of-speech taggers under randomly regenerated train-test splits failed to reproduce several of the rankings that had held on the standard split (Gorman & Bedrick, 2019). The ordering of two systems is a property of the evaluation sample, not only of the systems, and a point estimate hides that completely. The lesson from the machine learning side is older still: a test has to match where the variance actually comes from, because the wrong test inflates error rates regardless of how clean the headline numbers look (Dietterich, 1998).
Why pairing changes everything
Here is the structural fact that most retrieval evaluation gets right by accident and reasons about wrong. Configuration A and configuration B are evaluated on the same query set. Query 1 gets an NDCG value under A and an NDCG value under B. Query 2 gets two values. Every query yields a matched pair. This paired design is standard for batch-style IR experiments, and every appropriate significance test is built around it (Smucker et al., 2007).
The unit of analysis is the per-query difference. For each query q, the quantity that matters is d = NDCG_B(q) − NDCG_A(q), one number per query. The test is then a single question about those differences: is their mean distinguishable from zero?
Pairing matters because queries are wildly unequal. Some are easy and score high under any reasonable system. Some are hard and score low regardless of configuration. That between-query variation is enormous, and it is shared between A and B because they answer the same queries. The question that matters is not whether A's average is higher than B's average, which blends the real effect together with all that query-to-query noise. It is whether, on the same query, B tends to beat A, and by how much. Taking the per-query difference cancels the part of the variance that A and B share, leaving only the signal worth measuring.
Treating the two runs as independent samples throws this away. An unpaired test compares the spread of A's scores against the spread of B's scores and asks whether the two clouds are distinguishable. But the clouds overlap heavily, because both contain the same easy queries scoring high and the same hard queries scoring low. The shared structure that a paired test exploits to gain power is exactly what an unpaired test discards. The result is a test that needs a far larger query set to detect the same true effect. Paired designs require substantially fewer topics than unpaired designs for equivalent statistical power (Sakai, 2018). Ignore the pairing and real changes get systematically understated.
Two tests built for paired data
Once the per-query difference is the unit of analysis, the question becomes how to decide whether the average difference is distinguishable from zero. Two procedures are the standard answers in IR, and they agree closely at realistic query-set sizes.
The paired bootstrap treats the query set as a stand-in for the population of queries that could have been drawn. It repeatedly resamples queries with replacement, recomputes the mean per-query difference on each resample, and assembles those thousands of recomputed differences into a confidence interval (the single-configuration version of the same resampling logic is covered in bootstrap confidence intervals for NDCG). If the 95 percent interval on the per-query difference straddles zero, the data are consistent with B being no better than A. The whole procedure is short:
# illustrative, not paste-and-run
diffs = ndcg_b - ndcg_a # one difference per query
means = [resample(diffs).mean() # resample: draw len(diffs) queries
for _ in range(10_000)] # with replacement (e.g. sklearn.utils.resample)
lo, hi = np.percentile(means, [2.5, 97.5]) # 95% CI on the per-query differenceThe method rests on the broader foundation of resampling-based inference (Efron & Tibshirani, 1993) and has a settled history of use for IR system comparison (Sakai, 2006).
The randomization test (also called a permutation test) attacks the question from the null hypothesis directly. If B is truly no different from A, then for any given query it should not matter which label is attached to which score. So the test randomly swaps the A and B labels per query thousands of times, recomputing the mean difference each time, and builds the distribution of differences expected if the labels were meaningless. The observed difference is then read against that null distribution. If it sits far out in the tail, the swap-invariance assumption is implausible and the difference is significant. The randomization test is the recommended default for IR evaluation, with a test statistic matching the one used to measure the difference between systems (Smucker et al., 2007).
The choice among these is less fraught than it looks. The paired t-test, the bootstrap, and the randomization test largely agree at TREC sample sizes, and the t-test is robust to the assumption violations practitioners once worried about (Smucker et al., 2007; Sakai, 2014). What matters is avoiding the wrong tool. The Wilcoxon signed-rank and sign tests should be discontinued for IR evaluation: they test a different quantity than the one actually reported, and they both miss true differences and raise false alarms relative to the randomization test (Smucker et al., 2007). An eval harness that defaults to Wilcoxon has a defect.
What a real verdict looks like
Return to the reranker change, but make it concrete. Raising the reranker's top-k from 5 to 10 moves the point estimate by +0.015 NDCG@10. Encouraging. The paired bootstrap then returns a 95 percent confidence interval on the per-query difference of [-0.003, +0.033].
That interval is the verdict. It is mostly positive, its center of mass favors the change, and a careless reading calls it a win. But it crosses zero. The data are consistent with the change doing nothing, and they are also consistent with a +0.03 gain. Which one is true is unknown. The point estimate said "up." The interval says what the evidence actually supports: promising, unproven, not shippable on this query set alone.
The correct response is neither to ship nor to discard, but to expand the judgment set until the interval can resolve the question. How many more queries that takes depends on the size of the effect worth caring about and the variance of the chosen metric, and that variance differs sharply from metric to metric (Sakai, 2016). A small true effect on a noisy metric simply needs more queries to detect. No amount of staring at a point estimate substitutes for that.
What this means for a retrieval stack in production
A point estimate is a fine signal for an engineer iterating locally. It is the wrong gate for a deployment decision. Once retrieval changes flow through CI/CD, the natural home for a paired significance test is exactly where the unit test lives: a config change does not merge unless the paired test clears a threshold set in advance. That threshold is a policy choice. A high-stakes legal or financial retrieval system might require the lower bound of the confidence interval to sit above zero before anything ships. A fast-moving consumer product might accept a looser bar. Either way the gate is configurable, it is paired, and it replaces "the number went up" with "the change is distinguishable from noise at the confidence level the team committed to."
This is not how the popular first-party evaluation tools work today. Across the public documentation of the major RAG-evaluation platforms reviewed in May 2026, Ragas, Arize Phoenix, DeepEval, Vectara Open RAG Eval, Braintrust, Galileo, Patronus AI, and FutureAGI, the common pattern is point metrics plus side-by-side experiment comparison, and several report NDCG, MRR, recall@k, and hit-rate natively. As of this writing, none of their public documentation describes a built-in paired significance test, a paired bootstrap or randomization test, on the per-query differences between two configurations. These tools report that A scored 0.412 and B scored 0.432. They do not report whether that gap survives the noise in a given query set. That last step, the one that turns a number into a decision, is left to the team (why that gap is structural, and not something a tool roadmap can close, is the subject of first-party eval tools versus independent audit).
The sample audit report runs this example end to end on a public benchmark. The paired significance section, on page 14, shows the per-query difference distribution, the bootstrap interval, and the ship-or-hold verdict for a real configuration change.
Download the sample report. The full evaluation-pipeline treatment, including how to wire paired significance into a CI gate, is Chapter 12 of Designing Hybrid Search Systems.
Laszlo Csontos
Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.
Need help measuring search quality? Start with a baseline audit.
Book a Discovery CallRelated Posts
A single NDCG number looks like a verdict but is often a coin flip. The variance across query subsets routinely exceeds the difference between two systems, and the IR community solved this in 1997. Most LLM-era eval tooling has not inherited the fix.
May 26, 2026
Galileo, Patronus, Phoenix, and DeepEval make a RAG system better. They cannot prove it works to procurement, diligence, or an investor. Independence is structural, not a feature.
May 26, 2026
Run an audit too late and you cannot fix what it finds. Run it too early and the system drifts before the conversation. The usable window is 4 to 12 weeks out.
May 26, 2026