Measuring Search Quality: A Practical Guide to Reranking and Evaluation
Cross-encoder reranking delivers 25-48% retrieval quality gains. NDCG baselines turn search optimization from vibes into engineering.
Abstract
Two of the highest-leverage investments in search quality are also two of the most commonly skipped: cross-encoder reranking and systematic evaluation. Reranking re-scores the top candidates from first-stage retrieval with dramatically higher accuracy, often delivering the largest single relevance improvement in the entire pipeline. Evaluation infrastructure (NDCG baselines, online metrics, automated regression testing) transforms search optimization from guesswork into engineering. This article covers the retrieve-then-rerank architecture, why cross-encoders produce outsized gains for minimal effort, how NDCG works and why it matters, and why offline and online evaluation serve different purposes that are both essential.
The Retrieve-Then-Rerank Architecture
First-stage retrieval (BM25, embeddings, hybrid fusion) is fast but imprecise. It casts a wide net across millions of documents in milliseconds. The goal is recall: get the right documents into the candidate set. Missing a relevant document at this stage means it is gone forever. Speed matters. Precision is secondary.
A cross-encoder reranker takes the top 100 to 200 candidates from first-stage retrieval and re-scores each one against the original query with much higher accuracy. Instead of comparing pre-computed embeddings (which compress information into fixed-size vectors), the cross-encoder processes the query and document together through a transformer. It sees the full token-level interaction between them.
The analogy is hiring. Receive 10,000 applications. Scan resumes quickly using keyword matching, years of experience, education. In 30 seconds per resume, narrow 10,000 to 100 candidates. Then read those 100 carefully. Full review. Rank them thoughtfully. The top 10 go to interviews.
In practice:
- BM25 + semantic retrieval returns top 100 candidates (~20ms)
- Cross-encoder reranks to top 10 results (~200-400ms)
- Total latency: under 500ms
The multi-stage design's major advantage is the ability to trade off quality against latency by controlling candidate admission at each stage (Nogueira et al., 2019). Stage 1 optimizes for recall at the top 100. Stage 2 optimizes for precision at the top 10. Trying to optimize one model for both is why monolithic approaches underperform.
Why Cross-Encoders Produce Outsized Gains
Cross-encoders perform full self-attention over the concatenated query-document pair, which produces substantially better relevance judgments than bi-encoder similarity but is too slow for first-stage retrieval at scale (Humeau et al., 2020). The key difference: a bi-encoder produces independent embeddings for query and document, then computes similarity via dot product. Information is lost in the compression to fixed vectors. A cross-encoder sees both texts simultaneously and can attend to fine-grained interactions between specific query terms and document passages.
To make the distinction concrete: a bi-encoder computes where maps each text independently to a vector. The interaction between query and document happens only at the final similarity computation, which is a single number. A cross-encoder computes where is the concatenation of query and document tokens, and the transformer processes them jointly. Every token in the query can attend to every token in the document through multiple layers of self-attention. This is why cross-encoders are more accurate: they see the full interaction. It is also why they are slower: they must run the transformer for every query-document pair individually, rather than pre-computing document embeddings once.
For first-stage retrieval over millions of documents, bi-encoders are necessary because pre-computed embeddings allow sub-millisecond lookup. For reranking 100 candidates, cross-encoders are practical because the computational cost is bounded.
The initial application of BERT to passage re-ranking achieved a 27% relative MRR@10 improvement over previous state-of-the-art on MS MARCO (Nogueira & Cho, 2019). That was 2019. The gains have only compounded with larger models and better training.
Industry benchmarks reinforce the pattern. Databricks research reports reranking can improve retrieval quality by up to 48%. ZeroEntropy reports +28% NDCG@10 over baseline retrievers (ZeroEntropy, 2025). Enterprise adoption data shows 25 to 40% improvement in Precision@5 and NDCG@5 depending on baseline and domain (RAGAboutIt, Jan 2026).
Dense-only approaches severely underperform BM25 plus reranking pipelines on many out-of-domain datasets (Thakur et al., 2021). Cross-encoders largely outperform dense retrievers of equivalent size across out-of-domain tasks on the BEIR benchmark (Rosa et al., 2022).
The cost-benefit comparison is stark. Fine-tuning a bi-encoder embedding model takes weeks of data collection, curation, hard negative mining, and training. Adding an off-the-shelf cross-encoder reranker takes days. Re-ranking and late-interaction models on average achieve the best zero-shot performance across diverse tasks (Thakur et al., 2021). Fine-tuning a cross-encoder on domain data pushes results further. Domain-specific fine-tuning on cross-encoders achieved MAP of 92% and 94.3% on WikiQA and TREC-QA, up from previous bests of 83.4% and 87.5% (Garg, Vu & Moschitti, 2020).
The common trajectory: teams start with BM25, add embeddings, spend months optimizing the embedding model, and then discover that adding a reranker in a week moves the needle more than all of it. This is not a hypothetical pattern. It repeats across organizations and domains.
NDCG: The Metric That Replaces Vibes
A common pattern in search teams: optimizing the system for months, then being asked "how do you know it's improving?" The answer is usually some version of: "We check a few queries and see if the results look better."
That is not measurement.
NDCG (Normalized Discounted Cumulative Gain) quantifies whether the right results are at the top and whether the best results are ranked higher than the merely acceptable ones. The metric introduced position-based discounting and graded relevance evaluation as an alternative to binary precision/recall (Järvelin & Kekäläinen, 2002).
The formula for DCG at position :
Where is the graded relevance of the result at position . NDCG normalizes this by dividing by the ideal DCG (the DCG of a perfectly ranked list):
The intuition: a result at position 1 matters more than a result at position 10 (the logarithmic denominator discounts lower positions). A highly relevant result matters more than a marginally relevant one (the exponential numerator amplifies relevance differences). NDCG@10 captures whether the top 10 results are the right ones in the right order.
Setting up NDCG evaluation requires labeled queries: queries paired with relevance judgments for their results (typically on a scale of 0 to 3 or 0 to 4). Research on retrieval experiment error rates shows that error decreases as the topic set size increases, with 50 topics yielding approximately 5% error for MAP differences of 5% (Voorhees & Buckley, 2002). In practice, 200 to 300 labeled queries provide a usable baseline for most systems.
Once NDCG baselines exist, everything changes. Every system change can be measured. A/B tests have quantitative outcomes. Decisions are made with data instead of opinions.
The practical setup is not as burdensome as teams assume. Select 200 to 300 queries that represent the production distribution (weighted toward common query types, but including tail queries). For each query, have annotators judge the top 20 to 30 results on a graded scale: 0 (irrelevant), 1 (marginally relevant), 2 (relevant), 3 (highly relevant). Store these judgments. Run every candidate system change against this evaluation set before shipping. Automate the comparison. A regression in NDCG@10 is a blocking issue, just like a failing test in a CI pipeline.
The evaluation set is not static. It should be refreshed quarterly with new queries sampled from production logs, because query distributions drift over time. But even a stale evaluation set is vastly better than no evaluation set.
Algolia reports that only 53% of retailers with advanced search capabilities have defined KPIs for search, compared to just 13% of those with basic search (Algolia, 2024). Evaluation maturity directly correlates with search quality. The teams that measure are the teams that improve.
Online vs. Offline Evaluation: Two Different Questions
Offline evaluation uses a fixed test set with human-labeled relevance judgments. It answers the question: does the new system rank known-relevant documents correctly for these specific queries? It is essential for controlled model comparison.
Online evaluation uses production signals: click-through rates, zero-result rates, query reformulation rates, session success metrics. It answers a different question: how do real users respond to the system on the queries they actually type today?
These questions have different answers more often than teams expect.
A scenario that plays out regularly: a team fine-tunes a new embedding model. Offline NDCG@10 improves by 8%. They ship to production. Two weeks later: click-through rates are flat. Session length unchanged. Conversion slightly down.
The problem is almost always the same: the offline evaluation set does not represent the production query distribution. Across 36 offline metrics evaluated for e-commerce search, offline improvements do not always translate to online gains (Li et al., 2023).
Offline evaluation is a snapshot of a distribution that moves. Query popularity, relevant documents, and user intent all change over time (Kulkarni et al., 2011). The test set that accurately represented production queries six months ago may over-represent head queries and under-represent the long tail today.
Online evaluation complements offline approaches, which may provide more easily interpretable outcomes yet are often less realistic when measuring actual user experience (Hofmann, Li & Radlinski, 2016).
The gap between offline and online results is itself diagnostic:
- Offline wins, online flat: the test set probably over-represents head queries and under-represents the long tail where the model improvement does not help.
- Online wins, offline flat: users are responding to something the labels do not capture, often freshness, diversity, or result presentation effects.
- Both improve: genuine quality gain.
- Offline wins, online drops: the model may be overfitting to the evaluation distribution at the expense of the production distribution.
The practical requirement is running both. Offline evaluation for controlled experiments and model comparison. Online evaluation for production validation and distribution drift detection. Trust neither alone.
Conclusion
Cross-encoder reranking and systematic evaluation are the two most accessible high-impact investments in search quality. Reranking takes the approximate results from first-stage retrieval and applies precise relevance scoring at a fraction of the latency cost of running a cross-encoder over the full index. The gains are well-documented: 25 to 48% improvement in retrieval quality metrics depending on baseline and domain.
NDCG evaluation infrastructure transforms search optimization from subjective judgment into quantitative engineering. Setting up a baseline requires 200 to 300 labeled queries and an automated evaluation pipeline. Once that exists, every change is measurable, every regression is detectable, and every design decision has evidence behind it.
Online metrics close the loop by validating that offline improvements translate to real user behavior changes. The combination of offline and online evaluation catches the failure modes that either alone would miss: overfitting to static test sets, distribution drift, and user experience factors that relevance labels do not capture.
References
- Algolia (2024). "40+ Stats on E-Commerce Search and KPIs." https://algolia.com/blog/ecommerce/e-commerce-search-and-kpis-statistics
- Garg, S., Vu, T., & Moschitti, A. (2020). "TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection." AAAI 2020. https://arxiv.org/abs/1911.04118
- Hofmann, K., Li, L., & Radlinski, F. (2016). "Online Evaluation for Information Retrieval." Foundations and Trends in Information Retrieval, 10(1), 1-117. https://dl.acm.org/doi/10.1561/1500000051
- Humeau, S., Shuster, K., Lachaux, M., & Weston, J. (2020). "Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring." ICLR 2020. https://arxiv.org/abs/1905.01969
- Järvelin, K., & Kekäläinen, J. (2002). "Cumulated Gain-Based Evaluation of IR Techniques." ACM Transactions on Information Systems, 20(4), 422-446. https://dl.acm.org/doi/10.1145/582415.582418
- Kulkarni, A., Teevan, J., Svore, K. M., & Dumais, S. T. (2011). "Understanding Temporal Query Dynamics." WSDM 2011, pp. 167-176. https://dl.acm.org/doi/10.1145/1935826.1935862
- Li, X., Zha, D., Chen, W., Wen, Z., Zhang, X., & Chua, T. (2023). "How Well do Offline Metrics Predict Online Performance of Product Ranking Models?" SIGIR 2023. https://dl.acm.org/doi/10.1145/3539618.3591865
- Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085. https://arxiv.org/abs/1901.04085
- Nogueira, R., Yang, W., Cho, K., & Lin, J. (2019). "Multi-Stage Document Ranking with BERT." arXiv:1910.14424. https://arxiv.org/abs/1910.14424
- Rosa, G. M., et al. (2022). "In Defense of Cross-Encoders for Zero-Shot Retrieval." arXiv:2212.06121. https://arxiv.org/abs/2212.06121
- RAGAboutIt (Jan 2026). Enterprise reranking adoption data. https://ragaboutit.com
- Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021. https://arxiv.org/abs/2104.08663
- Voorhees, E. M., & Buckley, C. (2002). "The Effect of Topic Set Size on Retrieval Experiment Error." SIGIR 2002, pp. 316-323. https://dl.acm.org/doi/abs/10.1145/564376.564432
- ZeroEntropy (2025). "Ultimate Guide to Choosing the Best Reranking Model in 2025." https://zeroentropy.dev/articles/ultimate-guide-to-choosing-the-best-reranking-model-in-2025
Need help measuring search quality? Start with a baseline audit.
Book a Discovery Call