Measuring Search Quality: Reranking and the Evaluation Foundation

Two of the highest-leverage investments in search quality are also two of the most commonly skipped: cross-encoder reranking and systematic evaluation. Reranking re-scores the top candidates from first-stage retrieval with much higher accuracy, and it is often the single largest relevance improvement available in the entire pipeline. Evaluation infrastructure, an NDCG baseline plus production signals, is what turns search work from a sequence of guesses into something measurable. This piece covers the retrieve-then-rerank architecture, why cross-encoders produce outsized gains for minimal effort, what NDCG actually measures, and why offline and online evaluation answer two different questions that are both essential.

The Retrieve-Then-Rerank Architecture

First-stage retrieval (BM25, dense embeddings, or hybrid fusion) is fast but imprecise. It casts a wide net across millions of documents in milliseconds. The goal at this stage is recall: get every plausibly relevant document into the candidate set, because a relevant document missed here is gone for good. Speed matters; precision is secondary.

A cross-encoder reranker then takes the top candidates from that wide net and re-scores each one against the original query with much higher accuracy. Instead of comparing pre-computed embeddings, which compress each text into a fixed-size vector, the cross-encoder processes the query and document together through a transformer and sees the full token-level interaction between them.

The structure is the same one a hiring funnel uses. Ten thousand applications arrive. A fast first pass on keywords, years of experience, and education narrows them to a hundred in seconds each. Then those hundred get read carefully and ranked, and the top ten go to interviews. The cheap pass maximizes recall; the expensive pass maximizes precision on a bounded set.

Illustrative latencies for a two-stage pipeline. Stage 1 optimizes for recall at the top 100; Stage 2 optimizes for precision at the top 10. The split is what lets the pipeline trade quality against latency separately at each stage.

The major advantage of the multi-stage design is exactly this ability to trade quality against latency by controlling how many candidates each stage admits (Nogueira et al., 2019). Trying to make a single model optimize for both recall over millions of documents and precision over the final ten is why monolithic approaches tend to underperform a two-stage one.

Why Cross-Encoders Produce Outsized Gains

Reranking is the highest-leverage move available in a retrieval pipeline, and the reason is mechanical. A cross-encoder performs full self-attention over the concatenated query-document pair, which yields substantially better relevance judgments than bi-encoder similarity but is too slow for first-stage retrieval at scale (Humeau et al., 2020).

The distinction is concrete. A bi-encoder computes $\text{score} = \text{sim}(\text{enc}(q), \text{enc}(d))$ , mapping the query and the document to vectors independently and comparing them only at the final similarity step, which is a single number. Information is lost in the compression to fixed vectors. A cross-encoder computes $\text{score} = f(\text{transformer}([q; d]))$ , where $[q; d]$ is the query and document tokens concatenated and processed jointly, so every query token can attend to every document token across multiple layers. That joint attention is why cross-encoders are more accurate, and the need to run the transformer for every query-document pair individually is why they are too slow for the first stage and well suited to reranking a bounded candidate set.

The earliest application of BERT to passage re-ranking achieved a 27% relative improvement in MRR@10 over the prior state of the art on MS MARCO (Nogueira & Cho, 2019). That was 2019, and the gains have compounded with larger models and better training since.

The advantage holds up where it is hardest to hold up: out of domain. Dense-only retrieval underperforms a BM25-plus-reranking pipeline on many out-of-domain datasets (Thakur et al., 2021), and cross-encoders largely outperform dense retrievers of equivalent size across out-of-domain tasks on the BEIR benchmark, which was built specifically to test cross-domain generalization (Rosa et al., 2022).

The cost-benefit comparison is what makes that leverage real. Fine-tuning a bi-encoder embedding model is weeks of data collection, curation, hard-negative mining, and training. Adding an off-the-shelf cross-encoder reranker is days. Re-ranking and late-interaction models achieve, on average, the best zero-shot performance across diverse tasks (Thakur et al., 2021), so the off-the-shelf option is already near the top of what is achievable without in-domain training. Reaching that level of quality with a component that drops in within days, rather than weeks of embedding work, is what lets a reranker capture most of the available relevance gain for a fraction of the effort.

Fine-tuning a cross-encoder on in-domain data extends the gain further, though by how much is task-dependent. On answer-sentence-selection benchmarks specifically, a transfer-and-adapt fine-tuning approach reached a MAP of 92% on WikiQA and 94.3% on TREC-QA, up from prior bests of 83.4% and 87.5% (Garg, Vu & Moschitti, 2020). Those are answer-selection results rather than a general reranking number, but they illustrate how much headroom domain fine-tuning can add on top of a cross-encoder that already works well off the shelf.

The trajectory is a common one in practitioner accounts: teams start with BM25, add embeddings, spend months optimizing the embedding model, then find that adding a reranker in a week moves relevance more than the embedding work did.

NDCG: The Foundational Metric

Reranking changes the system. The next question is how to tell whether the change helped, and the standard answer in information retrieval is NDCG.

NDCG (Normalized Discounted Cumulative Gain) measures whether the right results sit at the top of the list and whether the best results outrank the merely acceptable ones. It introduced position-based discounting and graded relevance as an alternative to binary precision and recall (Järvelin & Kekäläinen, 2002).

Discounted Cumulative Gain at position $p$ is:

$\text{DCG}_p = \sum_{i=1}^{p} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$

where $rel_i$ is the graded relevance of the result at position $i$ . NDCG normalizes DCG by the ideal DCG, the DCG of a perfectly ordered list:

$\text{NDCG}_p = \frac{\text{DCG}_p}{\text{IDCG}_p}$

Two ideas carry the metric. Position discounting, the logarithmic denominator, means a relevant result at position 1 counts for more than the same result at position 10. Relevance amplification, the exponential numerator, means a highly relevant result counts for far more than a marginally relevant one. NDCG@10 captures, in a single number, whether the top ten results are the right ones in the right order.

NDCG depends on graded relevance judgments: queries paired with labeled results, typically on a scale like 0 (irrelevant), 1 (marginally relevant), 2 (relevant), 3 (highly relevant). The graded scale is what lets the metric credit a system for ranking the best matches first, which a binary relevant-or-not flag cannot. A single NDCG@10 figure also needs an uncertainty estimate before two systems can be compared on it, because the variance across queries routinely swamps the gap between configurations (bootstrap confidence intervals for NDCG cover that step). Those judgments have to be built on a query set that reflects what the system actually serves in production, and that requirement is where offline measurement begins to diverge from what live traffic tells you.

Online vs. Offline Evaluation: Two Different Questions

Offline evaluation runs a fixed test set with human-labeled relevance judgments. It answers one question: does the new system rank known-relevant documents correctly for these specific queries? It is the right tool for controlled comparison between two configurations.

Online evaluation reads production signals (click-through, result abandonment, query reformulation, session success) and answers a different question: how do real users respond to the system on the queries they actually type today?

These two questions have different answers more often than teams expect. The pattern that recurs: a team fine-tunes a new embedding model, offline NDCG@10 improves by several points, the change ships, and two weeks later the production signals are flat. No lift in click-through, no change in session success. The usual cause is that the offline judgment set does not represent the live query distribution. Offline gains do not reliably predict online performance; across dozens of offline metrics evaluated for product ranking, offline improvements did not consistently translate into online gains (Li et al., 2023).

The deeper reason is that an offline test set is a snapshot of a distribution that moves. Query popularity, which documents are relevant, and user intent all shift over time (Kulkarni et al., 2011). A judgment set that represented production six months ago can over-weight head queries and under-weight the long tail today. Online evaluation is the more realistic view of actual user experience, even though offline evaluation gives more interpretable, controlled outcomes (Hofmann, Li & Radlinski, 2016).

The gap between offline and online results is itself diagnostic. Reading the two together localizes the problem:

| | Online improves | Online flat or down | |---|---|---| | Offline improves | Genuine quality gain. | The eval set likely over-weights head queries the change does not help; a drop points to overfitting to the eval distribution. | | Offline flat | Users are responding to something the labels do not capture: freshness, diversity, or result presentation. | No measurable change from either view. |

The practical requirement is to run both. Offline evaluation for controlled experiments and model comparison; online evaluation for production validation and distribution-drift detection. Neither is trustworthy alone. Before an offline NDCG gain is treated as real enough to ship, it also has to clear a paired significance test on the per-query differences, not just a higher average.

Why both levers matter

Reranking and systematic evaluation are the two most accessible high-leverage investments in search quality, and they reinforce each other. A reranker captures most of the available relevance gain for a fraction of the effort of training a custom embedding model. Evaluation, offline and online, is what proves the gain is real and catches the cases where an offline win does not survive contact with production. A reranker without measurement is a change shipped on faith; measurement without a reranker is rigor applied to a system leaving its largest easy gain on the table.

The TensorOpt sample retrieval audit shows what measured search quality looks like end to end on a public benchmark: graded NDCG measurements with confidence intervals and paired significance tests, in the format a procurement or diligence reviewer expects. Download the sample report to see it, or read the full measurement and reranking methodology in Designing Hybrid Search Systems.