Hybrid Search at Catalog Scale: Vector + Keyword for Large E-Commerce Catalogs

Most hybrid search tutorials are written against a toy catalog. They spin up a single vector index, hold it in memory, bolt on a keyword index, fuse the two with reciprocal rank fusion (the fusion step is where the real choice lives, even before scale enters the picture), and the demo looks great. The relevance numbers are good. The latency is a few milliseconds. The apparent conclusion is that hybrid search is a solved problem ready to copy into production.

Then the same architecture meets a real catalog of several million SKUs, and the decisions that were invisible at small scale start to dominate. The failure does not usually show up as a relevance regression visible in a demo. It shows up as latency degradation under load, which means it shows up in Q4, on the highest-traffic days of the year, when it is most expensive and hardest to fix. An index rebuild scheduled into peak week, a facet that quietly starves the result set, a shard fan-out that grows with every node added: none of these register as a relevance bug, and all of them surface as slow pages when traffic is highest.

Latency at that moment is a revenue variable, not a vanity metric. The often-repeated claim that 100 milliseconds of added latency costs Amazon one percent of sales traces back to a single 2006 conference slide and a blog recollection rather than a published study, and should be set aside. The defensible primary numbers come from controlled experiments at Bing and Google, where a two-second slowdown reduced revenue per user by 4.3 percent (Schurman and Brutlag, 2009). The coefficient matters less than the direction: slower search costs money, and the decisions below are the ones that move search latency.

This is the gap between what tutorials surface and what catalog scale demands. Beyond roughly 500,000 SKUs, four decisions (index structure, how the index is partitioned and sharded, how filtering interacts with vector scoring, and how often the catalog is re-embedded) stop being independent knobs and start interacting. Getting the defaults wrong is cheap to do and expensive to discover.

The scale break

The first thing that breaks is the assumption that the index fits comfortably in memory. The default vector index in most engines is HNSW, a navigable graph held in RAM. Its memory footprint scales linearly with the number of vectors and with their dimensionality, and the graph structure itself adds overhead on top of the raw vectors. At typical text-embedding dimensions, a million vectors with a standard graph configuration costs several gigabytes of RAM before the keyword index, the catalog metadata, or any headroom for traffic is accounted for (OpenSearch documentation). For a 768-dimension float32 index of one million vectors with a moderately connected graph, the footprint is on the order of 4.8 GB, and it grows with every SKU added (Zilliz/Milvus reference).

At 50,000 SKUs this is a rounding error. At five million it is a provisioning decision with real cost, and a single in-memory node stops being the obvious answer. The research direction that defines the modern large-scale toolkit exists precisely because of this wall. Billion-point nearest-neighbor search became feasible on a single commodity machine only by moving the bulk of the graph to SSD and keeping a compressed representation in RAM, trading a small amount of latency for an order-of-magnitude reduction in memory (Subramanya et al., 2019). A later memory-disk hybrid that keeps cluster centroids in memory and posting lists on disk roughly doubled query throughput at the same recall level on billion-scale datasets (Chen et al., 2021). The same compression logic shows up inside engines as quantization: replacing full-precision vectors with product-quantized codes can shrink an index by more than two orders of magnitude, at a measurable cost to recall (Zilliz/Milvus reference).

None of this is exotic. It is the standard menu at scale, and the menu does not appear in a small-catalog tutorial because the small catalog never needs it. The point is not that one technique is correct. It is that at catalog scale the team is forced to choose, and each choice moves recall, latency, and cost in different directions. A team that copied the in-memory default without knowing it was a default has not made a decision. It has inherited one sized for a different problem.

Partitioning and sharding

Two different decisions get conflated under "scaling out," and keeping them separate matters. The first is how the index partitions vectors internally. The second is how those partitions, or whole copies of the index, are distributed across machines.

Internally, the choice is between a single navigable graph and a clustered structure. A graph index keeps one connected structure and walks it; a clustered index (the IVF family) groups vectors by similarity, holds the centroids in memory, and routes each query only to the few clusters nearest to it. Cluster routing is the mechanism that makes billion-scale search affordable, because any one query touches only a small fraction of the data, and assigning boundary points to more than one cluster compensates for the recall lost at the edges (Chen et al., 2021). This is an index-design choice rather than a data-placement one: it changes how a single index answers a query, independent of how many machines it runs on.

Across machines, the decision is where each SKU's vector physically lives, and here the common instincts diverge. The most intuitive is to shard by category, with furniture on one node and lighting on another, mirroring the merchandising structure. That makes some queries cheap to route, but a query rarely respects the taxonomy. A shopper searching for a reading lamp may be served well by results in lighting, furniture, and home office, so routing to a single category shard quietly drops the candidates that live elsewhere. Recency-based placement, with new and recently updated SKUs in a hot shard, suits high-churn catalogs but concentrates the most volatile and least-evaluated products in one place and can unbalance load when traffic skews toward new arrivals. A third option places vectors by their learned cluster, aligning the physical layout with the internal partitioning above; this helps relevance but decouples the layout from the catalog structure and makes operational reasoning harder.

Underneath every placement choice sits a mechanical cost that scaling out makes worse, not better. Vector queries execute per shard: every shard returns its local candidates and a coordinator merges them. Adding shards increases the fan-out and the merge overhead, so over-sharding degrades vector-search latency directly (Elastic documentation). Distributed nearest-neighbor search remains a thinly documented field, and handling skewed query load is an open problem to which selective replication, rather than naive partitioning, is one answer (Gottesbüren et al., 2025). There is also a graph-fragmentation hazard: carving a navigable graph along arbitrary metadata boundaries can disconnect it and degrade recall (Qdrant documentation). The correct strategy is workload-dependent, and the wrong one is not visible in a relevance evaluation. It is visible in a latency histogram under peak traffic.

Faceted-search interaction

E-commerce search is almost never an unconstrained query. It is a query plus a set of facets: brand, price range, color, availability, rating. The moment a vector query has to respect a filter, the interaction between scoring and filtering becomes one of the largest latency and correctness traps at scale, and it is the one that small-catalog tutorials almost never exercise.

There are three ways to combine a filter with vector search, and each fails in a different direction. The fast option, post-filtering, retrieves the nearest neighbors first and then discards the ones that fail the facet; it can return far fewer than the requested number of results when the filter is selective, sometimes almost nothing for a tight constraint (OpenSearch documentation). Pre-filtering inverts the order, computing the matching set first and searching only within it, which is correct but whose brute-force cost grows linearly in both corpus size and the fraction of the catalog that passes the filter (Patel et al., 2024). The third approach keeps the graph traversal itself on nodes that satisfy the predicate, which is where the field has moved, though it is not a universal win either.

Comparison of post-filtering, pre-filtering, and integrated filtering for vector search under a facet constraint. Post-filtering can return too few results; pre-filtering is correct but its cost scales with corpus size times match-set size; integrated filtering has no universal win and depends on selectivity.

Three ways to combine a facet filter with vector search, and how each fails.

The numbers make the trade-off concrete. A predicate-aware traversal can roughly double query throughput at the same recall when filters are around 20 percent selective and uncorrelated with the query, yet at 50 percent selectivity a simpler sweeping strategy outperforms it (Weaviate, 2024). The maturity of this problem is visible in product defaults: a predicate-agnostic filter strategy became the default for new collections in a major open-source vector database only recently (Weaviate, 2025). Engines that handle filtering well do so by switching strategy based on filter cardinality rather than committing to one (Qdrant documentation).

The correctness trap is more dangerous than the latency trap because it is silent. A faceted query that quietly returns three results instead of forty does not throw an error. It looks like a thin catalog. The shopper sees a near-empty results page, assumes the brand does not carry what they want, and leaves. Aggregate relevance metrics computed on unfiltered queries will never surface this, because the failure lives in the interaction between the facet and the index, which the tutorial evaluation does not test.

Embedding refresh cadence

A catalog is not static. Products are added, prices and availability change, seasonal relevance shifts, and periodically the embedding model itself is upgraded. Each of these implies a re-embedding and reindexing decision, and the cadence (daily, weekly, or event-driven) is where the strongest evidence runs thin and the strongest opinions run loud.

What is well established is the cost of getting it wrong by defaulting to periodic full rebuilds. Rebuilding a large index from scratch is days of compute, not hours, at the high end of catalog scale; a billion-point graph build runs on the order of days even on a tuned single-node configuration (Subramanya et al., 2019). A global rebuild also causes large fluctuations in both search latency and accuracy while it runs (Xu et al., 2023). This is the mechanism behind the Q4 failure. A full reindex is not a quiet background task; it competes with live queries for memory, CPU, and disk bandwidth on the same hardware, and the tail of the latency distribution rises while it runs. A reindex that is invisible in the calm of summer becomes a tail-latency spike layered on top of record Q4 traffic, and the result is slow search at the moment when every millisecond is converting or failing to convert. The failure is not in the relevance model; it is in the operational decision about when the index is rebuilt. An incremental in-place update scheme avoids the rebuild entirely, sustaining a 1 percent daily update rate on a billion-scale index using roughly 1 percent of the peak memory and under 10 percent of the cores that a rebuild-based approach required (Xu et al., 2023).

What is not well established by primary evidence is the optimal cadence itself. There is no benchmark that prescribes daily over weekly over event-driven as a function of how fast a given catalog changes. The defensible position is that cadence should be driven by two signals rather than by habit. The first is catalog churn: how much of the catalog actually changes in a given window, which determines how much staleness an interval introduces. The second is drift, the gradual divergence between the distribution the embedding model was tuned on and the catalog and queries it now serves, which is detectable with statistical monitoring rather than guessed at (AWS Prescriptive Guidance; Evidently AI). A nightly full re-embed is a common instinct, but absent churn and drift measurement it is a guess, and at scale an expensive one.

What independent measurement surfaces at scale

The thread connecting all four problems is that they are invisible to the metric most teams report. A single aggregate NDCG computed over a representative query sample can look healthy while specific, commercially important segments of the catalog are failing, because the aggregate is dominated by the high-traffic, well-populated head of the catalog.

A ranking experiment in product search illustrates how completely a segment-level effect can vanish into an aggregate. New and tail products with sparse engagement data tend to be ranked as irrelevant, which keeps them from accumulating the engagement that would lift them. Correcting that bias produced a 13.53 percent increase in new-product impressions and an 11.14 percent increase in new-product purchases in a large A/B test, while overall purchases moved only 0.08 percent (Han et al., 2022). The intervention there was a ranking change rather than a measurement method, but the lesson for measurement is unmistakable: a change that transformed the experience for the part of the catalog that matters most for growth barely moved the headline number. A team watching only the aggregate would have seen almost nothing. The long-tail logic is the same in web search, where serving the rare-query tail slightly better produces a disproportionate gain in overall satisfaction (Goel et al., 2010).

This is why measurement at scale has to be stratified rather than pooled, and why "best" is a measurement question rather than a configuration one holds with even more force once the catalog is large. The questions that matter are subset questions. How does relevance look for newly added SKUs versus established ones? How does it differ between shallow, heavily merchandised category nodes and deep, sparse ones? What happens to results quality under common facet combinations rather than on unfiltered queries? Even a carefully built e-commerce relevance benchmark designed with stratified query selection still tends to report a single pooled relevance number (Chen et al., 2022), which means the disaggregated view that surfaces the real failures is usually nobody's default output. Mature search teams sample their evaluation sets with deliberate stratification for exactly this reason (Etsy Engineering).

There is a deeper point about who produces the number. A vendor's relevance report is graded by the team that built the system, on the queries and segments they chose, and it is rarely stratified in the ways that expose the interactions above. The latency that degrades under Q4 load, the recall that collapses under a selective facet, the new-product cohort that never surfaces: these are precisely the things a self-graded aggregate is not built to reveal.

If your catalog has crossed the point where these interactions matter, the question is no longer whether your search works in a demo. It is whether you can show, with stratified evidence, where it works and where it quietly does not. That is what an independent measurement produces and a vendor dashboard does not.

The sample diagnostic report walks through exactly this kind of stratified relevance measurement on a public e-commerce dataset: graded judgments, NDCG with confidence intervals, and the subset breakdowns that aggregate scores hide. It uses the same methodology as a real engagement, on public data instead of your catalog.

Download the sample report →