Blog

Technical deep-dives on search, embeddings, RAG, and retrieval systems.

RAG hallucinations are usually a retrieval failure you never measured

When a RAG system answers confidently and wrongly, the reflex is to fix the prompt or swap the model. The fault usually sits upstream in retrieval, and it stays invisible because almost no team measures retrieval as its own stage.

May 27, 2026

Hybrid Search11 min read

Hybrid Search at Catalog Scale: Vector + Keyword for Large E-Commerce Catalogs

Beyond 500K SKUs, hybrid search architecture decisions interact in ways small-catalog tutorials never surface. The wrong defaults show up as Q4 latency degradation, not relevance gaps.

May 27, 2026

E-commerce Search7 min read

Why E-commerce Search Fails and How Catalog Data Quality Fixes It

Most mid-market search-relevance gaps trace to the catalog, not the algorithm. Switching vendors carries the same missing attributes and synonym gaps forward, and the fix is often days of merchandising work.

May 27, 2026

E-commerce Search7 min read

Fixing Zero-Result Search Complaints in E-Commerce

Zero-result pages are the most visible search failure, so brands tune their vendor settings one query at a time. The higher-leverage fix is to read the log structurally: four causes, four owners, four timelines.

May 27, 2026

E-commerce Search7 min read

Your E-commerce Search Is Fast. That Says Nothing About Whether It Works.

The default search stack optimizes the metric teams can see (speed) and leaves the one that drives sales (relevance) unmeasured. Fast results that miss are still misses, and the failure stays invisible until someone measures relevance on the brand's own queries.

May 27, 2026

Evaluation8 min read

Bootstrap Confidence Intervals for NDCG: The Rigor Most RAG Evals Skip

A single NDCG number looks like a verdict but is often a coin flip. The variance across query subsets routinely exceeds the difference between two systems, and the IR community solved this in 1997. Most LLM-era eval tooling has not inherited the fix.

May 26, 2026

Evaluation9 min read

Paired Significance Tests for Retrieval Changes: When NDCG Went Up Isn't Enough

A reranker change moves NDCG@10 by +0.015 and the dashboard goes green. The paired bootstrap returns a 95% CI of [-0.003, +0.033], crossing zero. The point estimate said ship; the evidence says wait.

May 26, 2026

Evaluation8 min read

First-Party Eval Tools vs Independent Audit

Galileo, Patronus, Phoenix, and DeepEval make a RAG system better. They cannot prove it works to procurement, diligence, or an investor. Independence is structural, not a feature.

May 26, 2026

Evaluation8 min read

The 4-12 Week Window for a Pre-Funding Retrieval Audit

Run an audit too late and you cannot fix what it finds. Run it too early and the system drifts before the conversation. The usable window is 4 to 12 weeks out.

May 26, 2026

E-commerce Search9 min read

Best hybrid search implementation for e-commerce: what 'best' means, measured

"Best hybrid search" is a measurement question, not a configuration one. The right weighting depends on your own queries and catalog, and the winning config is the one with the highest confidence-interval lower bound, not the highest score.

May 26, 2026

Hybrid Search9 min read

Hybrid Search in E-Commerce: When Keyword Still Wins

Hybrid search is the right default, but exact queries (model numbers, SKUs, brand-plus-spec strings) are still best served by keyword retrieval. Routing beats a fixed blend, and only per-class measurement reveals where vectors quietly fail.

May 26, 2026

Evaluation8 min read

What investors actually look for when they ask how you evaluate retrieval

Internal eval dashboards report a number. Procurement and diligence need three things they cannot supply: independence, statistical significance, and a documented method a stranger can reproduce.

May 22, 2026

Evaluation7 min read

Measuring Search Quality: Reranking and the Evaluation Foundation

Cross-encoder reranking is often the single largest relevance gain in a retrieval pipeline, and adding one takes days, not weeks. Systematic evaluation, offline and online, is what tells you whether a change actually helped.

March 10, 2026

Embeddings9 min read

Why Generic Embeddings Stall in Production, and What Domain Adaptation Fixes

Off-the-shelf embeddings get a demo working. Domain-adapted models get production working. The gap between the two can reach 9.3 nDCG@10 points, and it is invisible until real users start typing.

March 3, 2026

Hybrid Search7 min read

How Hybrid Search Works (and Why the Architecture Is the Easy Part)

Running keyword and vector search together and fusing the results is now the default recommendation, and it is the right one. The architecture is simple to copy. What it leaves undefined is how to combine the two signals, and there is no setting that is correct in advance.

February 24, 2026