The Hybrid Search Architecture We Use

Pure keyword search fails on synonyms. Pure semantic search fails on exact matches. The solution is neither—it's both. Here's the hybrid search architecture we deploy in production.

Why Hybrid?

Consider these queries:

| Query | BM25 | Semantic | Ideal | |-------|------|----------|-------| | "iPhone 15 Pro Max 256GB" | ✅ Exact match | ❌ Too broad | BM25 wins | | "phone with best camera" | ❌ No keyword match | ✅ Understands intent | Semantic wins | | "smartphone photography" | ⚠️ Partial match | ⚠️ Too abstract | Need both |

Neither approach dominates. You need both retrieval paths, intelligently combined.

The Architecture

                    ┌─────────────────┐
                    │   User Query    │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Query Analysis  │
                    │ - Intent        │
                    │ - Entities      │
                    │ - Spelling      │
                    └────────┬────────┘
                             │
           ┌─────────────────┼─────────────────┐
           │                 │                 │
    ┌──────▼──────┐   ┌──────▼──────┐   ┌──────▼──────┐
    │   BM25      │   │  Semantic   │   │   Sparse    │
    │  Retrieval  │   │  Retrieval  │   │  Retrieval  │
    └──────┬──────┘   └──────┬──────┘   └──────┬──────┘
           │                 │                 │
           └─────────────────┼─────────────────┘
                             │
                    ┌────────▼────────┐
                    │   Fusion Layer  │
                    │   (RRF / Linear)│
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Cross-Encoder  │
                    │    Reranking    │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Final Results  │
                    └─────────────────┘

Component Deep-Dive

1. Query Understanding

Before retrieval, understand what the user wants:

class QueryAnalyzer:
    def analyze(self, query: str) -> QueryFeatures:
        return QueryFeatures(
            intent=self.classify_intent(query),  # navigational, informational, transactional
            entities=self.extract_entities(query),  # brands, sizes, colors
            corrected_query=self.spell_correct(query),
            expansion_terms=self.expand_query(query)  # synonyms
        )

2. BM25 Retrieval

The workhorse for exact matching. We use Elasticsearch with tuned parameters:

{
  "query": {
    "multi_match": {
      "query": "iPhone 15 Pro",
      "fields": ["title^3", "description", "brand^2"],
      "type": "best_fields",
      "fuzziness": "AUTO"
    }
  }
}

Key tuning parameters:

Field boosting: Title matches worth more than description
Fuzziness: Handle typos without over-matching

3. Semantic Retrieval

Vector similarity search for conceptual matching:

def semantic_retrieve(query: str, k: int = 100) -> List[Document]:
    query_embedding = embedding_model.encode(query)
    results = vector_store.search(
        query_embedding,
        k=k,
        filter=metadata_filter
    )
    return results

We use HNSW indexes for sub-millisecond retrieval at scale.

4. Fusion: Reciprocal Rank Fusion (RRF)

Combine ranked lists without needing score calibration:

def reciprocal_rank_fusion(
    ranked_lists: List[List[Document]],
    k: int = 60
) -> List[Document]:
    """
    RRF score = sum(1 / (k + rank_i)) for each list
    """
    scores = defaultdict(float)
 
    for ranked_list in ranked_lists:
        for rank, doc in enumerate(ranked_list, start=1):
            scores[doc.id] += 1 / (k + rank)
 
    sorted_docs = sorted(scores.items(), key=lambda x: -x[1])
    return [doc_id for doc_id, score in sorted_docs]

RRF works because:

No score normalization needed
Naturally handles different list lengths
Robust to outlier scores

5. Cross-Encoder Reranking

The final refinement. Cross-encoders see query and document together:

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
 
def rerank(query: str, documents: List[Document], top_k: int = 10) -> List[Document]:
    pairs = [(query, doc.text) for doc in documents]
    scores = reranker.predict(pairs)
 
    ranked = sorted(zip(documents, scores), key=lambda x: -x[1])
    return [doc for doc, score in ranked[:top_k]]

Cross-encoders are slower (can't be precomputed) but more accurate. Use them on the top 50-100 candidates from fusion.

The Latency Budget

In production, speed matters:

| Stage | Target Latency | Notes | |-------|----------------|-------| | Query Analysis | < 20ms | Can cache common patterns | | BM25 Retrieval | < 30ms | Elasticsearch is fast | | Semantic Retrieval | < 30ms | HNSW with good recall | | Fusion | < 5ms | Pure computation | | Reranking | < 50ms | Top 50 candidates | | Total | < 135ms | P95 target |

Latency Optimization Tips

Parallelize retrieval: BM25 and semantic run simultaneously
Limit candidates: Don't rerank 1000 documents
Quantize embeddings: int8 is often sufficient
Cache embeddings: Popular queries hit cache

When Not to Use Hybrid

Hybrid search adds complexity. Skip it when:

Pure catalog lookup: SKU search doesn't need semantics
Extreme latency requirements: Every millisecond counts
Limited engineering resources: Start with one approach, add the other when needed

Results

On a recent e-commerce deployment, hybrid search outperformed both individual approaches:

| Approach | NDCG@10 | P95 Latency | |----------|---------|-------------| | BM25 only | 0.58 | 45ms | | Semantic only | 0.62 | 55ms | | Hybrid + Rerank | 0.79 | 130ms |

The 36% improvement in NDCG translated to a 12% increase in click-through rate and 8% improvement in conversion.

Hybrid search isn't just technically superior—it directly impacts business metrics.