How Hybrid Search Works (and Why the Architecture Is the Easy Part)
Running keyword and vector search together and fusing the results is now the default recommendation, and it is the right one. The architecture is simple to copy. What it leaves undefined is how to combine the two signals, and there is no setting that is correct in advance.
Ask how to build a search system today and the same recommendation comes back nearly every time: run keyword search and vector search side by side, then fuse the two ranked lists into one. The recommendation is close to universal now, and it is correct. Lexical retrieval and dense retrieval fail in different places, so running both covers more of what people actually type than either does alone. The architecture is also cheap to assemble. Most search engines and vector databases ship the components, and wiring them together is closer to a weekend than a quarter.
That ease is exactly what makes hybrid search misleading. Copying the architecture is not the same as getting a search system that works, because the architecture leaves one decision undefined, and that decision is the part that determines whether the results are any good.
Two retrievers, two failure modes
The case for running both methods starts with how each one fails.
Keyword retrieval, typically BM25 over an inverted index, is literal. It matches the tokens in the query against the tokens in the documents, which makes it precise on exact strings, rare terms, entity names, and alphanumeric identifiers, and blind to the fact that "couch" and "sofa" name the same thing. It is also durable in a way that is easy to underestimate. BM25 remains a strong baseline that many more complex dense models fail to beat once they leave the data they were trained on (Thakur et al., 2021), and out-of-domain is the normal condition for a production system, because the training distribution of an off-the-shelf embedding model rarely matches one operator's real query stream.
Dense retrieval solves the opposite problem. It maps the query and the documents into a shared vector space and retrieves by proximity, which lets it connect "comfortable option for standing all day" to a product whose description never uses those words. That same proximity is its weakness. On entity-centric queries, dense retrievers tend to surface passages about similar-sounding but entirely different entities, mistaking related for relevant (Sciavolino et al., 2021). A query for one specific model, clause, or part can return a near neighbor that is close in the embedding space and wrong for the user.
The two failure modes are complements, which is the whole argument for hybrid search. The point is not that vectors replace keyword matching. Even strong dense retrievers benefit from being interpolated with BM25 to hold up across diverse tasks (Zhuang et al., 2021). Whether the system is grounding a RAG answer, powering an internal document search, or ranking a catalog, the literal signal and the semantic signal each cover the other's blind spot.
The architecture is the easy part
The mechanics of combining them are simple. One query fans out to both retrievers in parallel, each returns a ranked list, and a fusion step merges the two lists into the single ordering the user sees.
One query, two retrievers, one fused list. The components are standard. The fusion step is where the only real choice lives.
The fusion step has a default that nearly everyone reaches for first: reciprocal rank fusion. It scores each document by its position in each list, so documents ranked highly by both retrievers rise to the top, and because it operates on ranks rather than raw scores, it sidesteps the awkward problem of normalizing scores that two different systems produce on two different scales. It outperforms older rank-combination methods across diverse retrieval tasks and needs almost no tuning to deploy (Cormack et al., 2009). This is the reason hybrid search feels solved. A method exists that combines two lists with one formula and a default constant, and it works well enough to ship.
Everything to this point is commodity. The retrievers come from the same handful of providers, the indexes are standard, and the fusion formula is one line. Two teams building independently will assemble nearly identical plumbing. None of it is where their systems will differ.
The fusion decision has no default that is correct
The difference hides inside that fusion step, in a choice the formula quietly makes for the team: how much to trust each retriever relative to the other.
Reciprocal rank fusion buries the choice in a constant, conventionally set to 60, that most teams never revisit. The convenience is real, but the assumption that the constant does not matter is false. Reciprocal rank fusion is sensitive to its parameter (Bruch et al., 2023). The alternative makes the choice explicit: a convex combination weights the two retrievers' normalized scores directly with a single tunable parameter. A tuned convex combination outperforms reciprocal rank fusion in both in-domain and out-of-domain settings, and its one weight can be fit from a small set of labeled queries drawn from the target domain (Bruch et al., 2023). The load-bearing word is tuned. The advantage comes from fitting the weight to the data, which is another way of saying the right weight cannot be read off in advance.
It is not even one weight per system. The balance that serves precise, literal queries well is the wrong balance for descriptive, natural-language ones, and there is no single retrieval approach that answers every query type effectively (Arabzadeh et al., 2021). In e-commerce that gap is sharp enough that keyword retrieval still wins on exact queries and routing beats a fixed blend. The setting that maximizes relevance depends on the query mix the system actually receives, the corpus it searches over, and which queries carry the most weight for whoever runs it. None of those variables appear in the architecture diagram.
So two teams can run the identical hybrid stack and get opposite results. The components did not choose the setting for them, and the setting is where the relevance lives.
Why this becomes a measurement problem
If the correct setting is data-dependent and cannot be copied from a recipe, the only way to know which configuration works is to measure the candidates on the system's own queries. That measurement has a shape that a quick benchmark check does not.
Candidate settings usually score close together on average, while scores swing widely from one query to the next. A configuration that posts a higher mean can be no better than another, or quietly worse, once that query-to-query variance is taken into account. A single average cannot separate a real improvement from a lucky draw. The method that can is a confidence interval around each configuration's score, paired with a significance test on the per-query differences between configurations, an approach the information retrieval field settled on decades ago and has relied on since (Smucker et al., 2007).
There is a second requirement that is structural rather than statistical: the number has to come from outside the system. A relevance score the search vendor or the building team reports is graded against judgments that team produced, on a query set that team selected. It answers whether the system agrees with itself, and it generally does. An independent measurement, on the operator's own queries, with the uncertainty made explicit, routinely lands on a different number, and that gap is the actionable part. It points at the queries where the system believes it is doing well and users disagree.
This holds wherever hybrid search runs. The architecture is shared across a RAG pipeline, an enterprise search box, and a product catalog alike. The fusion setting that is right for each, and the evidence that it is actually right, are not.
The question worth asking is therefore not "how do I build hybrid search." The components answer that, and they answer it the same way for everyone. The question is which configuration is right for this corpus and these queries, measured how, and with what confidence that the answer is more than noise. For an e-commerce catalog specifically, that reframing is the whole argument that the "best" hybrid implementation is a measurement question.
TensorOpt produces that measurement: graded relevance judgments built on an operator's own top queries, candidate fusion configurations compared with bootstrap confidence intervals and paired significance tests, and a documented method a third party can reproduce. The sample report runs the full procedure on a public benchmark, comparing hybrid weighting configurations and showing which one genuinely wins rather than which one happened to win a coin flip. Download the sample report, or read the methodology in full in Designing Hybrid Search Systems.
Laszlo Csontos
Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.
Struggling with search relevance? Get an audit.
Book a Discovery CallRelated Posts
Beyond 500K SKUs, hybrid search architecture decisions interact in ways small-catalog tutorials never surface. The wrong defaults show up as Q4 latency degradation, not relevance gaps.
May 27, 2026
Hybrid search is the right default, but exact queries (model numbers, SKUs, brand-plus-spec strings) are still best served by keyword retrieval. Routing beats a fixed blend, and only per-class measurement reveals where vectors quietly fail.
May 26, 2026