Hybrid Search in E-Commerce: When Keyword Still Wins

Hybrid search is the right default for most e-commerce catalogs. Combining lexical retrieval (BM25 over an inverted index) with dense vector retrieval covers more query types than either method alone, which is why the pattern is now standard at large retailers. Walmart's production system, for example, deliberately blends a traditional inverted index with embedding-based neural retrieval to handle tail queries that keyword search alone misses (Magnani et al., 2022).

But "the right default" is not the same as "the right answer for every query." A substantial share of e-commerce searches are precise, lexical, and unambiguous: a model number, a part number, a SKU string, an exact brand-plus-spec phrase like "iPhone 15 Pro Max 256GB." For those queries, sending them through a vector path does not help. It costs latency, and worse, it costs accuracy. The reason this is easy to miss is that the failure hides inside aggregate relevance scores. A pipeline can post a strong overall NDCG and still be quietly wrong on the exact queries where shoppers have the highest purchase intent.

Not all e-commerce queries are the same query

E-commerce search traffic is not uniform. It is heavily skewed, and the skew has a predictable shape. Queries fall into rough classes: head queries (broad brand or category terms, "running shoes," "nike"), torso queries (descriptive multi-term phrases, "waterproof trail running shoes"), and tail queries (long, specific, often natural-language requests).

A long-tail distribution curve of e-commerce query volume, riding high on the left and falling to the right. A separate phrasing axis runs left to right from terse and exact to long and descriptive. Keyword retrieval wins on terse, exact queries: brand terms, model numbers, SKUs. Vector retrieval wins on long, descriptive queries: paraphrase, synonyms, intent. Query volume follows a long tail, but the split that matters for retrieval is how the query is phrased. Keyword retrieval wins on terse, exact queries (left); vector retrieval wins on long, descriptive, natural-language queries (right).

The distribution is steep. In one large e-commerce query-classification dataset, head queries alone accounted for 69% of query volume, while head and torso queries together drew 81% of all clicks (eCom SIGIR, 2024). The exact percentages are dataset-specific and proprietary, but the long-tail shape is consistent across the field, and tail queries are the documented weak spot for retrievers trained on behavioral signals, precisely because there is little click data behind them.

Each retrieval method has a different comfort zone. Dense vector retrieval is strongest on the tail: paraphrases, descriptive language, and queries where the shopper's words do not match the catalog's words. Lexical retrieval is strongest on the opposite end: exact matches, rare terms, entity names, and alphanumeric tokens like serial numbers. This is not a controversial claim. Traditional lexical models are good at exact matching, especially for rare terms such as product serial numbers, while struggling with synonyms and morphological variation (Nigam et al., 2019). The complement holds for dense models, which generalize well to semantic variation but degrade on precise identifiers.

The robustness of lexical retrieval on out-of-domain and exact-match queries is one of the most replicated findings in modern information retrieval. The BEIR benchmark was built specifically to test how retrieval models generalize across domains they were not trained on (Thakur et al., 2021). Across 18 datasets and 10 systems, BM25 remained the third-best approach overall, beaten only by far more expensive neural re-ranking, and dense retrievers underperformed it substantially on several datasets. Dense embeddings have improved since 2021, so the precise numerical gap has narrowed, but the qualitative finding has held: without in-domain fine-tuning, vectors lose to keyword search on the queries that depend on exact tokens.

The structural failure mode: vectors on precise queries

The interesting case is not "vectors are worse on average." They are not. It is that vectors fail in a specific, structural way on the queries where being wrong is most expensive.

Consider an exact specification query. A dense embedding model converts "iPhone 15 Pro Max 256GB" into a single vector and looks for products near it in embedding space. The problem is that the 256GB model and the 128GB model sit extremely close together in that space, because almost every token they share is identical. A vector system asked for the 256GB model can return the 128GB one (Qdrant, 2025). To a relevance metric averaged over thousands of queries, that looks like a near miss. To the shopper who wanted 256GB, it is the wrong product.

The mechanism behind this is tokenization. Embedding models built on subword tokenizers shatter alphanumeric strings into fragments. A model-number query like "zx750" gets tokenized into the fragments z ## x ## 75 ## 0, and those fragments produce high similarity scores against thousands of unrelated items, so a query with around 300 genuine lexical matches can surface roughly 7,000 spurious neighbors in production (Vinted Engineering, 2025). The model number, the single most discriminating signal in the query, becomes noise.

This is not a quirk of one bad model. On entity-centric queries, dense retrievers generalize poorly to rare entities and remain sensitive to whether the question pattern was seen during training (Sciavolino et al., 2021). The same brittleness shows up on deceptively simple queries: encoders frequently fail at fine-grained matching regardless of training data or model size, missing specific entities and attributes that a literal match would catch immediately (Li et al., 2025). In product search, "fine-grained entities and attributes" is just another name for capacities, dimensions, model years, and part numbers. The things shoppers filter on most precisely are the things vector search handles worst.

Routing, not just blending

The common response to this is to run hybrid search with a fixed blend: retrieve from both the lexical and vector indexes, then fuse the two result lists with a single static formula such as reciprocal rank fusion. This is better than either method alone, but it leaves performance on the table, because the right balance between lexical and vector signal is not constant. It depends on the query.

There remain queries for which sparse retrievers simply outperform dense ones, even when dense wins on average (Arabzadeh et al., 2021). A fixed blend dilutes the lexical signal on exactly the precise queries where it should dominate. The better design classifies the query first, by intent and specificity, and then routes it to the appropriate path: a lexical-heavy path for exact and head queries, a dense or balanced path for descriptive tail queries.

This is not theoretical. Instacart's production search adapts retrieval to the query, routing a specific query like "pesto pasta sauce 8oz" toward keyword retrieval while sending an ambiguous query like "healthy foods" toward semantic search, and produced a 1.7% improvement in mean converting position (how high in the results purchased items rank) alongside a 1.5% reduction in latency from doing so (Gudla et al., 2024). The latency win is not incidental. Skipping the vector path on queries that do not need it removes work, which is the opposite of the usual assumption that more sophistication means more compute.

There is also a mathematical reason to prefer tuned fusion over a fixed formula. Convex-combination fusion, which weights the lexical and dense scores with a tunable parameter, is sample-efficient and consistently outperforms reciprocal rank fusion both in-domain and out-of-domain, because that single weight can be fit to a target domain from a small set of examples (Bruch et al., 2023). The fusion step itself, and why no default weight is correct in advance, is the focus of how hybrid search works. The implication for a catalog owner is direct: the fusion weight that is correct for a catalog's descriptive queries is almost certainly wrong for its model-number queries, and a single global setting cannot satisfy both.

You cannot fix what your metrics are averaging away

The reason this problem persists in production systems is measurement. An aggregate relevance score computed across all queries will not reveal a head-query or exact-match failure, because the queries that fail are diluted by the larger volume of queries that pass. The failure is real and it is concentrated, but the average smooths it flat.

The fix is to stop trusting the average. Stratify the evaluation by query class, then measure each class separately. The standard graded-relevance metric for this is NDCG, which rewards placing the most relevant results highest and discounts relevance found lower in the list (Järvelin and Kekäläinen, 2002). Measuring NDCG@10 per query class, rather than once across all traffic, surfaces the structural gaps that a single number conceals. This is exactly how serious production teams report results: Walmart's semantic retrieval work measured NDCG@10 and ran human evaluation specifically on a sample of tail search traffic, rather than on traffic as a whole (Magnani et al., 2022).

A per-class number on its own is still not enough, because a difference between two methods on a single query class might be noise. Bootstrap resampling produces confidence intervals around each metric and is the established way to quantify that uncertainty in retrieval evaluation, resting on fewer assumptions than traditional significance tests (Sakai, 2006). The discipline is to report, for each query class, both the NDCG@10 and the confidence interval around it, so that a claimed improvement on exact-match queries is backed by evidence rather than by a single point estimate.

Hybrid search alone does not close the gap

The deeper issue is that the precise-query problem is not only an algorithm problem. It is a language problem. Shoppers and catalogs do not use the same words.

This has been measurable for over a decade. A 2014 study of 50 top-grossing US e-commerce sites found that 70% of on-site search engines could not return relevant results when shoppers used product-type synonyms, effectively forcing users to search with the site's exact jargon (Baymard Institute, 2014). The natural assumption is that a decade of vector search and hybrid retrieval has solved this. It has not. The most recent benchmark of 170-plus sites finds that 56% still deliver mediocre or worse search experiences, with 20% failing to adequately support product-type queries and 54% mishandling abbreviations and symbols (Baymard Institute, 2026).

Hybrid search, by itself, does not fix this. Adding a vector index does not teach the system that "blow dryer" and "hair dryer" are the same product, and it does not teach the system that "256GB" is a hard constraint rather than a soft preference. Those are query-understanding problems: synonym handling, attribute extraction, intent and specificity classification. The vector index is a tool, not a substitute for understanding what the shopper actually asked for.

The payoff for getting this right is concrete. When a shopper types an exact model name and gets a no-results page, they tend to conclude the site does not carry the product and leave, even when it is in stock and reachable by browsing (Baymard Institute, 2026). That empty page is one of four structurally distinct zero-result causes, each with a different fix. Exact and product-type queries are also among the cheapest failures to fix, because they reward literal matching and clean attribute data rather than model retraining. The retailers winning on search are not the ones who simply turned on embeddings. They are the ones who classify the query, route it, and then measure each class to confirm it worked.

If your search vendor reports a strong relevance score but your conversion data on precise, high-intent queries tells a different story, the gap is almost certainly hiding in the aggregate. A diagnostic that stratifies your top queries by class and measures each one independently is the only way to see it.

Download the sample diagnostic report. It works through a public product-search benchmark using exactly this method: graded judgments, per-class NDCG@10, and bootstrap confidence intervals. You will see precisely where an aggregate score would have buried the failure. Get the sample report →

(Want to see the same analysis run against your own catalog? The e-commerce search diagnostic covers that.)