Back to Blog
E-commerce Searchsearch-relevancesearch-measurementndcgbm25ecommerce-search

Your E-commerce Search Is Fast. That Says Nothing About Whether It Works.

The default search stack optimizes the metric teams can see (speed) and leaves the one that drives sales (relevance) unmeasured. Fast results that miss are still misses, and the failure stays invisible until someone measures relevance on the brand's own queries.

May 27, 20267 min read

Most e-commerce search systems are in excellent operational health. Queries return in well under a second. The cluster scales horizontally and survives Black Friday. The error rate is a rounding error, and the uptime dashboard is a wall of green. By every metric the engineering team watches, search is working.

None of those metrics measures whether the shopper found the product. Speed, uptime, throughput, and error rate describe the plumbing. Whether the right items came back in the right order is a separate property, called relevance, and on most teams it is not on any dashboard at all. A search that returns the wrong products in 84 milliseconds passes every check the team has instrumented and fails the only one that determines a sale.

That gap, between a system that is fast and a system that is right, is where mid-market search quietly leaks revenue. It persists because the stack and the monitoring around it were built to optimize the metric that is easy to see.

The default stack optimizes the metric you can see

Pull apart almost any production e-commerce search and the scoring function underneath it is BM25, a term-frequency ranking function first presented at TREC-3 in 1994 (Apache Lucene) and now the default relevance algorithm in Elasticsearch, and in the Solr and OpenSearch engines built on the same Lucene core (Elastic). It is a good algorithm: fast, interpretable, and able to handle a large share of queries well enough that nobody questions it. The phrase "well enough" is doing the load-bearing work in that sentence.

BM25 became the default for reasons that have nothing to do with how well it serves any particular catalog. It is computationally cheap, it needs no training data, and it runs at scale on commodity infrastructure. Those are operational virtues, and they are precisely the virtues the surrounding monitoring was built to track. Latency, query throughput, and index health are instrumented out of the box by every search platform. Relevance is not, because relevance cannot be read off the engine. The engine has no ground truth: it can report that it ranked documents by its own formula, but it cannot tell you whether those were the documents the shopper wanted. Answering that requires a human judgment, query by query, about which results should have come back, and that judgment lives outside the system entirely.

So the typical team measures what the platform hands it for free and never builds the thing that would reveal the actual problem. Industry surveys of e-commerce teams keep finding the same pattern: only a small minority put dedicated resources into optimizing site search, and fewer still feed what the search logs reveal back into the rest of the business (Algolia). For everyone else, search gets little or no dedicated attention, which leaves the operational metrics, the plumbing, as the only thing anyone is watching.

A search monitoring dashboard with two panels. The top panel, labeled "what the team watches," lists four operational metrics in green, each with a checkmark: latency p99 at 84 milliseconds, uptime at 99.98 percent, throughput at 12,400 queries per second, and error rate at 0.01 percent. The bottom panel, labeled "what decides whether shoppers find products," lists a single metric, search relevance measured as NDCG at 10, greyed out and marked "not measured." The metrics a search platform reports by default are not the metric that determines whether a shopper finds the product. Relevance is the one that decides the sale, and it is the one most teams never instrument.

Fast and wrong is still wrong

The reason this is worth measuring is that search traffic is the traffic that converts. Across an analysis of 609 million searches at more than 100 retailers, shoppers who used site search made up about a quarter of visitors but drove close to half of revenue, and converted at more than twice the rate of non-searchers (Constructor, 2025). When the search box returns the wrong thing quickly, the loss does not arrive as an outage. It arrives as a high-intent visitor who leaves, and the operational dashboard registers nothing, because operationally everything worked. There is no error to log: the shopper rarely complains, the request returned a valid 200, and the result count was healthy. A query that surfaces four thousand near-misses in 84 milliseconds looks identical, from the plumbing's vantage point, to a query that surfaces the right product first.

The strongest evidence that this is a measurement problem rather than an algorithm problem is how little has changed as the algorithms improved. A 2014 benchmark of the top 50 US e-commerce sites found that 70% could not return useful results when shoppers used product-type synonyms, forcing them to guess the site's own vocabulary (Baymard Institute, 2014). A decade later, after vector search and hybrid retrieval became standard, 56% of more than 170 benchmarked sites still deliver a mediocre or worse search experience (Baymard Institute, 2026). Better algorithms were available the entire time. The failures persisted because the teams running them had no relevance number to tell them anything was wrong.

The failure has more than one address

Once a brand does measure relevance, the next surprise is that the gap rarely sits where the instinct says it does. The reflex is to blame the engine and shop for a new vendor, but the cause is usually somewhere cheaper and closer. It can be a catalog that stores "hair dryer" while shoppers type "blow dryer," or products whose distinguishing attributes were never indexed as searchable text (catalog data quality, not the algorithm, is the most common culprit). It can be the structural composition of the zero-result log, where a single 10% rate can be mostly typos at one store and mostly missing inventory at another, demanding completely different fixes (decomposed into four distinct causes). It can be the fusion weighting, or the absence of query routing, where exact model-number searches get diluted by a vector path that was never the right tool for them.

Each of those is a distinct diagnostic question, with a distinct owner and a distinct timeline, and each deserves its own analysis. The point here is the one upstream of all of them: none of these causes is visible, and none can be assigned to the right layer, until relevance is measured at all. A team that has not measured it is not choosing between these explanations. It is guessing, and the most expensive guess (replace the vendor) is usually the wrong one.

The first fix is a number, not a model

This reframes the first move. It is not buying a more sophisticated engine, and it is not fine-tuning a neural model on the brand's own data, even though both get proposed early and both are expensive. The first move is to produce a relevance number that does not yet exist.

That number has a specific shape. It starts with a graded judgment set built on the brand's own top queries, where each result is rated against what a shopper on that query actually wanted, because product relevance is graded rather than binary: an exact match, a reasonable substitute, and an irrelevant result are three different outcomes, not two (Järvelin and Kekäläinen, 2002). From that set comes a metric such as NDCG@10, reported with a confidence interval rather than as a single figure, and measured independently of the vendor's own scoring, since a vendor's relevance report is graded against the vendor's own notion of relevance and tends to agree with itself. The output is a number that finally says how well search works on the queries that matter, and that attributes the gap to a layer: catalog, ranking, or query understanding.

A domain-adapted embedding model is one possible remedy at the end of that process, for the specific case where the measurement shows the ranking model itself is the bottleneck. It is rarely the first thing the measurement points to, and for a mid-market brand without a dedicated ML team it is rarely the cheapest. Skipping the measurement and going straight to the model is how a team spends a quarter fine-tuning a system whose real problem was a missing synonym table.

Measure before you migrate

A search system that is fast, stable, and unmeasured is not a system known to be working. It is a system whose single most important property has never been checked. The cost of that blind spot does not appear on any operational dashboard, which is precisely why it survives: the metrics that are green were never the ones that mattered.

The TensorOpt e-commerce search diagnostic produces the number the operational dashboards leave out: graded judgments on a brand's own queries, NDCG@10 with confidence intervals, and a breakdown that names the layer to fix first. To see exactly what that looks like before any engagement, the sample report runs the full method on Wayfair's public WANDS product-search dataset.

Download the sample report to see relevance measured the way the operational dashboards never will, with the layer-by-layer breakdown that says where to fix search before spending on a migration.

L

Laszlo Csontos

Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.

Struggling with search relevance? Get an audit.

Book a Discovery Call