How TensorOpt audits retrieval systems
This page summarizes the public-facing version of the TensorOpt methodology. The full runbook stays internal; it is the engagement IP, not a free download. What I share below is enough for a buyer to evaluate whether the approach is rigorous and whether it would surface anything their team would not.
The 4-step diagnostic process
Every engagement (both the Retrieval Audit Sprint for AI startups, 7 business days, and the Search Relevance Diagnostic for mid-market e-commerce, 10 business days) runs the same four steps.
- Engagement. Discovery call, SOW, and 50% deposit. I get read-only access to the client retrieval pipeline (logs, vector store, search vendor, RAG service) plus a representative sample of query traffic. No production write access is requested.
- Judgment-set construction. I build a stratified evaluation dataset sized for the audited pipeline: 100-200 query/document pairs for the AI Startup audit, 500 head and torso queries for the e-commerce diagnostic. Where an internal judgment set already exists, I use it; where it doesn't, I build one and you keep the artifact for future vendor evaluation cycles.
- Measurement. I run the metrics that match the engagement type (NDCG@k, MRR, hit-rate@k, and faithfulness on a generation sample for the AI Startup audit; NDCG@10 plus click-position-1 and conversion-correlated relevance for the e-commerce diagnostic) and wrap each one in a bootstrap 95% confidence interval. Where two pipelines or two configurations are being compared, I run a paired significance test.
- Report. Written report (18 pages for AI Startups, 24 pages for e-commerce), walkthrough call, methodology code as a runnable Python notebook (AI Startups) or CSV deliverable (e-commerce), and three prioritized remediation recommendations with effort estimates.
Statistical rigor layer
This is the differentiation that matters for procurement buyers.
Evaluation-tooling vendors (Galileo, Patronus, Arize Phoenix, FutureAGI, Confident AI, Maxim AI) publish NDCG@k, MRR, Precision@k, and similar metrics as platform features. What they typically do not publish, and what I apply on top, is the classical IR statistical rigor layer:
- Bootstrap confidence intervals. Every reported metric ships with a 95% CI estimated by resampling the judgment set. An NDCG@10 of 0.62 with a CI of [0.58, 0.66] tells a different story from 0.62 with [0.50, 0.74]. Without the CI, a 2-point lift in a vendor comparison cannot be distinguished from a sampling artifact.
- Paired significance tests. When the engagement compares two pipelines (e.g., current vendor vs. challenger, or current chunking strategy vs. alternative), I run paired tests (the same query against both pipelines) rather than independent samples. This is dramatically more sensitive and reflects how A/B test reads are actually structured.
TensorOpt is also independent third-party rather than a SaaS vendor: I am not grading my own platform, and I do not own the eval surface. Many of the metrics buyers most want to see (vendor-vs-vendor, vendor-vs-replacement) are by definition not produced by the incumbent vendor's own dashboard.
What gets measured
The metric mix depends on the engagement and the pipeline. Short notes on each:
- NDCG@k. Graded relevance, position-weighted. The standard IR metric for ranked retrieval. I typically report NDCG@10 and NDCG@20.
- MRR. Reciprocal rank of the first relevant result. Useful when a question has one canonical answer and the user stops scrolling once they find it.
- Hit-rate@k. Binary "is at least one relevant item in the top k". The cleanest metric to communicate to non-IR stakeholders.
- Faithfulness. For RAG generation, whether the generated answer is supported by the retrieved context. I score this on a sample of generated responses; methodology details (rubric, scoring model, sample size, human spot-check rate) live in the report.
- Click-position-1 and conversion-correlated relevance. E-commerce-specific. Position-1 click-through is a direct revenue proxy; conversion-correlated relevance ties judgment-set scoring to downstream conversion data where available.
What stays internal
The internal methodology runbook is engagement IP. It covers the operational detail (judgment-set sampling strategy, prompt templates for LLM-as-judge scoring, edge cases on chunking review, the specific bootstrap implementation, how I triage remediations into the report's three recommendations) and it represents most of the value a client pays for in the engagement. I share it inside the engagement, not on this page.
I mention this explicitly because I'd rather a team read this page and decide to book a discovery call than read this page and decide they can replicate the audit over a weekend. The interesting parts of the methodology are the parts I don't publish.
Third-party LLM API consent
Some scoring tasks (faithfulness in particular, and some LLM-as-judge configurations) require sending client data to a third-party LLM API (OpenAI, Anthropic). Where the engagement uses these APIs, explicit pre-engagement consent is recorded in the SOW and the relevant data flows are listed by name. A self-hosted fallback (Llama, Mistral, or an equivalent) is available if the team cannot send data outside their environment; the methodology is identical, runtime is 2-3x slower, and accuracy on the faithfulness rubric is roughly within the same band but not strictly equivalent.
This is a procurement question that comes up in most mid-market discovery calls, so I publish the policy rather than negotiating it from scratch each time.
See the methodology in action
The full methodology, applied end-to-end on a public benchmark, is documented in the sample reports. Pick the one closest to your use case below.
AI Startups sample (BEIR/FiQA-2018, 18 pages)
Retrieval Audit Sprint deliverable on a public benchmark.
E-commerce sample (WANDS, 24 pages)
Search Relevance Diagnostic deliverable on a public benchmark.
Not ready for a download?