What investors actually look for when they ask how you evaluate retrieval
Internal eval dashboards report a number. Procurement and diligence need three things they cannot supply: independence, statistical significance, and a documented method a stranger can reproduce.
What investors actually look for when they ask how you evaluate retrieval
There is a moment, usually four to twelve weeks before it matters, when someone technical asks a founder a deceptively simple question: how do you know your retrieval works? It comes from a partner doing technical diligence, or from an enterprise customer's procurement team, or from a prospect's internal ML lead who was pulled into the evaluation. The founder pulls up a dashboard. NDCG@10 is 0.71. MRR is 0.62. The numbers are good. The conversation does not get easier.
The reason it does not get easier is that the question was never about the number. It was about whether the number can be trusted by someone who did not produce it. Internal eval dashboards answer the first question and not the second, and the gap between those two has become the thing that stalls deals and slows rounds.
The trigger: AI procurement has hardened
The honeymoon phase of enterprise AI buying is over. Enterprise buyers now put gen-AI vendors through the same gauntlet they apply to any other software purchase: structured evaluations, scrutiny of where the system runs, and benchmark comparisons, while the growing complexity of AI workflows raises the cost of switching later (Wang et al., 2025). That finding comes from a survey of 100 enterprise CIOs across 15 industries, and it marks a shift in how the buy side behaves. Experimentation budgets have collapsed: the share of AI spending coming from innovation budgets dropped from 25% to 7% in a single year (Wang et al., 2025). Spending that moves out of the innovation line and into the operating budget gets evaluated like any other operating expense.
The buying behavior has tightened in step. More than 60% of B2B buyers now run some form of trial before committing to a full purchase, and procurement teams report losing confidence in decisions specifically because of inaccurate AI output at a higher rate than buyers overall (Brohan, 2026). The major analyst houses have responded by publishing AI-specific RFP question templates, distinct from the generic SaaS security questionnaire, designed to make vendor assessments more evidence based (Forrester, 2025).
Regulation pushes in the same direction. Under the EU AI Act, high-risk systems must be tested against metrics and probabilistic thresholds defined in advance, appropriate to the system's intended purpose (European Union, 2024). The compliance calendar is staggered: stand-alone high-risk systems under Annex III were due to fall under the obligations on 2 August 2026, with systems embedded in regulated products under Annex I following on 2 August 2027. That near-term date has since moved. A provisional political agreement reached on 7 May 2026 defers the Annex III stand-alone obligations to 2 December 2027, pending formal adoption and publication in the Official Journal. Because the deferral is not yet adopted, the prudent reading is to treat any delay as bonus time, not a planning baseline.
None of this requires a startup to be selling into a regulated vertical for the pressure to land. The buyer's procurement function carries the standard, and the buyer's standard is now the vendor's problem.
Investor diligence asks the same question
Investor diligence and enterprise procurement are usually treated as separate conversations. On retrieval evaluation they collapse into one. Technical due diligence checklists for AI startups consistently list model performance metrics and how you validate models as line items, alongside data quality and model operations (Sphere, 2024; Heyman, 2025). No public investor framework specifies bootstrap confidence intervals by name, and a founder should not claim one does. But the logic runs in a single direction. When the system under diligence is a retrieval or RAG system, the technically literate reviewer, whether a partner with an ML background or an advisor brought in to assess the stack, ends up at the same question an enterprise buyer reaches: how do you know the difference between version A and version B is real and not noise. The rest of this piece is about answering that question, and the answer is identical whether the person asking sits in a venture partnership or a procurement function.
What internal eval dashboards usually look like
The typical internal retrieval evaluation produces a point estimate. NDCG@10, MRR, sometimes Recall@k and a faithfulness score, computed once over a judgment set that the same team curated, typically a modest set of queries with hand-labeled correct documents. The dashboard shows the number. It updates when the pipeline changes. It is genuinely useful for the work it was built for, which is iterating during development.
What it does not contain is the part a third party needs. There is no uncertainty quantification around the metric, so a reader cannot tell whether 0.71 is meaningfully different from 0.68. There is no documented protocol for how the judgment set was sampled, so a reader cannot tell whether the queries represent production traffic or the queries the system already handles well. And there is no independence, because the team that built the system also chose the queries, wrote the labels, and ran the tool.
This is not a knock on the tools. It is a description of what they are for. The gap only becomes a problem when the audience changes from "us, deciding what to ship next" to "someone deciding whether to trust us with money or production traffic."
The three structural requirements
Procurement-grade and diligence-grade evaluation has three properties that an internal dashboard structurally cannot have. Each maps to a question the reader is actually asking.
Independence: the reviewer is not the system builder
The reader's first unspoken question is whether the people reporting the number had an incentive to make it look good. Every adjacent assurance regime answers this the same way. SOC 2, ISO 27001, and the newer ISO/IEC 42001 AI management system standard are all certified by an accredited third party precisely so that an independent party has verified the claims rather than the vendor asserting them (ISO/IEC, 2023). The independence is the product.
This is also the one requirement no eval tool can ship as a feature. Galileo, Patronus, Arize Phoenix, Confident AI's DeepEval, and Future AGI are all designed to be operated by the team that owns the AI system, which is the correct design for development-loop evaluation. By construction, an internally operated tool cannot supply independence, no matter how good its metrics are. Independence is structural, not a setting (the distinction between a first-party tool and an independent audit is treated in full elsewhere).
Statistical significance: confidence intervals and paired tests
The reader's second question is whether the number is real or noise. The information retrieval research community settled this two decades ago, and the methods have been stable since. There is little practical difference between the randomization test, the bootstrap, and the paired t-test for comparing retrieval systems, while the Wilcoxon and sign tests detect significance poorly and risk false positives (Smucker et al., 2007). Bootstrap techniques produce confidence intervals that quantify how much a metric like mean average precision would move under test-collection variability (Cormack & Lynam, 2006), and bootstrap hypothesis testing extends naturally to comparing evaluation metrics themselves (Sakai, 2006). The approach remains the live standard: empirical bootstrapping is still the baseline that newer retrieval confidence-interval methods are measured against (Oosterhuis et al., 2024).
The practical version is short. A single metric deserves a confidence interval, computed by resampling the query set with replacement, usually with a large number of replicas (the mechanics are the subject of a companion piece on bootstrap confidence intervals for NDCG). A comparison between two configurations deserves a paired test, because each query produces one score under configuration A and one under configuration B, and the per-query difference is the unit that carries the signal. Unpaired comparisons throw that pairing away and lose power.
This is the layer the first-party tools do not ship. The retrieval metrics themselves are increasingly available natively: Arize Phoenix calculates NDCG and MRR on retrieved spans (Arize, 2025), and Recall@k, Precision@k, NDCG@k, MRR, and Hit Rate are documented as built-in metrics elsewhere in the category (Future AGI Docs, 2026). Galileo, Patronus, and DeepEval lean more on LLM-judged contextual metrics that resolve to point scores and pass-or-fail thresholds (Galileo, 2025; Patronus AI, 2025; Confident AI, 2025). Across all five, the comparison feature is a side-by-side diff of point estimates against a baseline, not a confidence interval or a paired significance test (Arize, 2025; Galileo, 2025; Patronus AI, 2025). The metric is there. The inferential statistics that tell a stranger whether to believe it are not.
Documented, reproducible methodology
The reader's third question is whether they could reproduce the result if they had to defend it to their own boss, their own auditor, or their own board. This is the requirement the regulators name most explicitly, and it is the same demand for prior-defined metrics that the EU AI Act makes (noted above). A mature measurement function, in the NIST framing, means evaluation processes that are objective, repeatable or scalable, and documented (NIST, 2023). ISO/IEC 42001 goes further on the paperwork: documentation must be versioned and traceable, and a procurement response lacking a current signed policy or an up-to-date register is treated as a silent failure (ISO/IEC, 2023).
For retrieval specifically, a documented methodology means three things a dashboard rarely surfaces. It means a judgment construction protocol: how relevance was defined, on what scale, by how many annotators, and with what agreement between them, following the pooled, graded-relevance approach the field has used since the early TREC collections (Voorhees, 2000). It means the sampling math: how queries were drawn, and whether the sample reflects the production distribution or a convenient subset. And it means a calibration procedure for any automated judgments, so a reader knows the labels were checked against human ground truth rather than assumed correct.
What a procurement-grade statement actually looks like
The difference is easiest to see side by side. The numbers below are illustrative, constructed to show format rather than to report a real system.
A dashboard screenshot says:
NDCG@10: 0.71 | MRR: 0.62
A procurement-grade statement says:
NDCG@10 = 0.71, 95% bootstrap confidence interval [0.66, 0.76], over n = 200 queries drawn by stratified sampling from 30 days of production traffic. Against the prior production retriever, the paired bootstrap difference in NDCG@10 is +0.03, 95% CI [0.005, 0.055], p = 0.013. Relevance was judged on a 4-point graded scale by two annotators over the pooled top-10 results of three retriever variants, with Cohen's kappa = 0.74. Labeling protocol and sampling code attached.
The first version answers no questions for a stranger. The second answers all three structural requirements in a paragraph: it is independent if a third party produced it, it carries the statistical layer, and it documents the method well enough to reproduce. Every parameter in it has a basis in standard practice (Smucker et al., 2007; Cormack & Lynam, 2006; Voorhees, 2000). A founder who can put the second version on the table changes the diligence conversation from "convince me" to "noted, what's next."
Practical sequencing: before the conversation, not during
An independent retrieval audit takes days. Remediation does not. The common findings, a chunking strategy change, a reranker swap, an expansion of the judgment set toward procurement scale, take weeks to implement and re-measure. Run the audit during the procurement conversation and any finding that requires a fix puts the team in a corner with no runway to act on it. Run it too early and the methodology can drift before the conversation happens.
The window that absorbs this comfortably is four to twelve weeks ahead of the conversation. For a fundraise, that means the memo is ready before the data room opens. For an enterprise pilot, it means the evaluation is on the table by the second technical meeting rather than scrambled together after procurement asks for it. The point of running it ahead is not to have a document. It is to have time to act on what the document says.
The question "how do you evaluate retrieval" is going to get asked. The teams that treat it as a procurement artifact to be prepared in advance, rather than a dashboard to be screenshotted on demand, are the ones for whom the conversation gets easier rather than harder.
Related reading
- Paired significance tests for retrieval changes: why a per-query paired test, not the point estimate, decides whether a difference is real.
- The 4-12 week window for a pre-funding retrieval audit: when to run the audit so a finding can still be fixed before the conversation.
TensorOpt runs independent, fixed-fee retrieval audits that produce exactly the kind of statement above: NDCG, MRR, and hit-rate@k with bootstrap confidence intervals and paired significance tests, a judgment set you keep, and a documented methodology a third party can reproduce. Download the sample report to see the format on a public benchmark, or read the methodology in full in Designing Hybrid Search Systems.
Laszlo Csontos
Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.
Need help measuring search quality? Start with a baseline audit.
Book a Discovery CallRelated Posts
A single NDCG number looks like a verdict but is often a coin flip. The variance across query subsets routinely exceeds the difference between two systems, and the IR community solved this in 1997. Most LLM-era eval tooling has not inherited the fix.
May 26, 2026
A reranker change moves NDCG@10 by +0.015 and the dashboard goes green. The paired bootstrap returns a 95% CI of [-0.003, +0.033], crossing zero. The point estimate said ship; the evidence says wait.
May 26, 2026
Galileo, Patronus, Phoenix, and DeepEval make a RAG system better. They cannot prove it works to procurement, diligence, or an investor. Independence is structural, not a feature.
May 26, 2026