Blog
Technical deep-dives on search, embeddings, RAG, and retrieval systems.
When a RAG system answers confidently and wrongly, the reflex is to fix the prompt or swap the model. The fault usually sits upstream in retrieval, and it stays invisible because almost no team measures retrieval as its own stage.
May 27, 2026
Beyond 500K SKUs, hybrid search architecture decisions interact in ways small-catalog tutorials never surface. The wrong defaults show up as Q4 latency degradation, not relevance gaps.
May 27, 2026
Most mid-market search-relevance gaps trace to the catalog, not the algorithm. Switching vendors carries the same missing attributes and synonym gaps forward, and the fix is often days of merchandising work.
May 27, 2026
Zero-result pages are the most visible search failure, so brands tune their vendor settings one query at a time. The higher-leverage fix is to read the log structurally: four causes, four owners, four timelines.
May 27, 2026
The default search stack optimizes the metric teams can see (speed) and leaves the one that drives sales (relevance) unmeasured. Fast results that miss are still misses, and the failure stays invisible until someone measures relevance on the brand's own queries.
May 27, 2026
A single NDCG number looks like a verdict but is often a coin flip. The variance across query subsets routinely exceeds the difference between two systems, and the IR community solved this in 1997. Most LLM-era eval tooling has not inherited the fix.
May 26, 2026
A reranker change moves NDCG@10 by +0.015 and the dashboard goes green. The paired bootstrap returns a 95% CI of [-0.003, +0.033], crossing zero. The point estimate said ship; the evidence says wait.
May 26, 2026
Galileo, Patronus, Phoenix, and DeepEval make a RAG system better. They cannot prove it works to procurement, diligence, or an investor. Independence is structural, not a feature.
May 26, 2026
Run an audit too late and you cannot fix what it finds. Run it too early and the system drifts before the conversation. The usable window is 4 to 12 weeks out.
May 26, 2026
"Best hybrid search" is a measurement question, not a configuration one. The right weighting depends on your own queries and catalog, and the winning config is the one with the highest confidence-interval lower bound, not the highest score.
May 26, 2026
Hybrid search is the right default, but exact queries (model numbers, SKUs, brand-plus-spec strings) are still best served by keyword retrieval. Routing beats a fixed blend, and only per-class measurement reveals where vectors quietly fail.
May 26, 2026
Internal eval dashboards report a number. Procurement and diligence need three things they cannot supply: independence, statistical significance, and a documented method a stranger can reproduce.
May 22, 2026
Cross-encoder reranking is often the single largest relevance gain in a retrieval pipeline, and adding one takes days, not weeks. Systematic evaluation, offline and online, is what tells you whether a change actually helped.
March 10, 2026
Off-the-shelf embeddings get a demo working. Domain-adapted models get production working. The gap between the two can reach 9.3 nDCG@10 points, and it is invisible until real users start typing.
March 3, 2026
Running keyword and vector search together and fusing the results is now the default recommendation, and it is the right one. The architecture is simple to copy. What it leaves undefined is how to combine the two signals, and there is no setting that is correct in advance.
February 24, 2026