The 4-12 Week Window for a Pre-Funding Retrieval Audit
Run an audit too late and you cannot fix what it finds. Run it too early and the system drifts before the conversation. The usable window is 4 to 12 weeks out.
A team building a retrieval system tends to think about evaluation as something it does continuously, on a dashboard, forever. That is the right frame for development and the wrong frame for the moments that decide whether the company gets funded or lands the enterprise account. Those moments have a date attached: the investor meeting, the data room opening, the second technical call with a prospect's platform team. An evaluation that proves the system works is only useful if it lands before that date with enough runway to fix whatever it surfaces.
The window is narrow on both ends. Run the audit too late and the team discovers a remediation requirement it cannot finish in time, which is worse than not knowing, because now it is walking into the conversation with a documented weakness and no fix. Run it too early and the system drifts before anyone reads the memo, so the number being quoted no longer describes the system that ships. The usable window sits between those two failures, and for most teams it is four to twelve weeks before the conversation.
Why the conversation now demands evidence, not assertion
This timing matters because the buyers and investors on the other side of the table stopped accepting "our internal evals look good" as an answer.
Enterprise AI procurement now mirrors traditional software buying, with more rigorous evaluations and benchmark scrutiny, and the maturation of the market has pushed companies to reference external benchmarks as an initial filter much as they have long used analyst reports (a16z, 2025). The buyer is no longer asking whether a vendor believes its system is good; it is asking the vendor to demonstrate it in a form the buyer's own evaluation process can verify (the specific properties that form needs to carry are the subject of what investors look for when they ask how you evaluate retrieval). On the acquisition and investment side, the bar moved the same direction. AI-focused transactions increasingly require deeper technical due diligence, with buyers asking how models were trained and how performance holds up under red-teaming, edge-case testing, and scaling (Skadden, 2026). Model performance has become a diligence line item, not a footnote.
There is a reason the scrutiny hardened. Over 40 percent of agentic AI projects are projected to be canceled by the end of 2027, driven by escalating costs, unclear business value, or inadequate risk controls (Gartner, 2025). Buyers and investors have watched this happen, and independent, statistically grounded evidence is what separates a system that works from a system that demos. The audit methodology TensorOpt uses, documented in the measurement chapters of Designing Hybrid Search Systems, is built to produce exactly that: graded judgments, bootstrap confidence intervals, and paired significance tests against a documented baseline.
The math of the lower bound
The four-week floor is arithmetic. A focused retrieval audit runs about seven days; the recommendations it produces take roughly two to six weeks to remediate depending on what it finds. Seven days plus two to six weeks is three to seven weeks of total work, and the four-week floor protects the common case from running past the date. The floor sits at four weeks rather than one because the audit's output splits into two groups with very different clocks: fast fixes need a few days, slow fixes need most of the window. An audit that finishes inside four weeks leaves no room for the slow group.
Fixes that take days
The first group is configuration-level work that does not require retraining a model or re-indexing the whole corpus from scratch.
Chunking strategy is the clearest example. How a corpus is segmented accounts for substantial variance in retrieval quality, with smaller chunks of 64 to 128 tokens favoring concise, fact-based answers and larger chunks favoring queries that need broader context (Bhat et al., 2025, preprint). Re-chunking re-embeds the corpus with the same model, but it does not change the model, the vector dimensions, or the index schema, so the re-embed is the only heavy step. Adding or swapping a cross-encoder reranker is similarly contained, an integration change rather than a retraining project, and one that can move ranking quality materially: a production-oriented benchmark reported a 14 percent gain over the next-best reranker on a question-answering task (Meng et al., 2024, NVIDIA-authored, vendor benchmark). Query rewriting is the same shape, a step inserted in front of a frozen retriever; simple LLM-based rewriting has been measured to deliver the strongest aggregate reduction in retriever bias across six retrievers (Goyal et al., 2026, preprint). Retrieval top-k adjustments and prompt-template changes round out the group, and both iterate in hours.
Fixes that take weeks
The second group is the reason the floor is four weeks rather than one.
Embedding model fine-tuning is the clearest case, and the reason it lands in this group rather than the last one is the work that surrounds the re-embed. A chunking change re-embeds with the existing model; an embedding change first requires a fine-tune or training loop to produce the new model, can shift the vector dimensions, and then forces not just a re-embed but a rebuild of the nearest-neighbor index and a re-validation that the new space still ranks correctly (Vejendla, 2025, preprint). The training run can be cheap; the surrounding migration is what consumes the calendar time.
Expanding the judgment set to a credible scale is the item teams consistently underestimate. A stable evaluation needs enough graded query-document pairs that the metric measures signal rather than sampling noise. The classic guidance from information retrieval is that a sound experiment needs at least 25 queries and that 50 is better (Buckley and Voorhees, 2000). Build a graded set at that scale with multiple judgments per query and the result is on the order of a thousand assessed pairs, constructed by hand or under careful supervision. What counts as enough varies widely; the BEIR benchmark, built to test retrieval models across heterogeneous domains, spans datasets from roughly one relevant document per query to nearly 500 (Thakur et al., 2021). A judgment set sized for the system actually being sold is closer to a research artifact than a config change, and it cannot be conjured in the final week. Building the evaluation into continuous integration is the third slow item: an automated pipeline that runs representative and edge-case queries on every change is a multi-component build (Google Cloud, 2025), and not done in an afternoon.
Stack the slow group against a seven-day audit and the floor resolves. If the memo surfaces a judgment-set gap or an embedding change, the team needs the runway to close it before the date. Less than four weeks forces a choice between shipping the fix half-finished and walking in with a known weakness.
The math of the upper bound
The twelve-week ceiling is a different problem. It is not about whether the team can fix things in time; it is about whether the thing it measured is still the thing it ships.
Retrieval systems drift. The foundational treatment in machine learning describes how model quality degrades as the underlying data distribution shifts, distinguishing sudden drift from the slower, gradual kind (Gama et al., 2014). In a retrieval system the drift is concrete: the corpus moves as new documents arrive and old ones expire, so a model tuned on last quarter's corpus can represent this quarter's worse even with identical weights (TianPan, 2026), and even an embedding-model version bump produces measurable shifts in recall that have to be re-measured rather than assumed (Vejendla, 2025, preprint).
The consequence is a shelf life. Picture a knowledge-base RAG system whose support corpus refreshes monthly. An audit run against the January index reports one NDCG@10; by the time a data room opens three refreshes later, the index the system actually serves has changed enough that the January number describes a system that no longer exists. The figure is not wrong so much as expired. Twelve weeks is the working horizon for that staleness: an audit should be recent enough that the system under evaluation is recognizably the system in the room.
Sequencing for a fundraise
The fundraising calendar gives the audit a natural slot. The median interval between a seed round and a Series A reached about 616 days, a little over 20 months, in mid-2025 (Carta, 2025), so by the time a team is actively raising, the technical story has accumulated both progress and drift. The active raise has a shorter internal structure: preparation over months, then investment materials assembled over roughly one to two months, then the meetings and diligence that close the round (M13, n.d.).
The audit belongs at the start of that materials phase. Run it as the prep window opens, and the memo is ready before the data room is. When diligence reaches the technical questions, the evaluation evidence is already in the room, statistically framed and independently produced, rather than scrambled together while a partner waits. Twelve weeks out is early enough to remediate; four weeks out is the latest point at which a surfaced problem can still be closed before the documents go live.
Sequencing for an enterprise pilot
The enterprise pilot runs a tighter clock with the same logic. A well-scoped AI proof of concept is widely treated as a four-to-eight-week exercise, with shorter efforts unable to use real production data and longer ones signaling scope creep. The buyer's broader procurement cycle is longer still; enterprise software deals routinely run 90 to 180 days or more (Optifai, 2025), and AI procurement now front-loads evaluation, using external evidence as the initial filter before deeper technical engagement (a16z, 2025).
That front-loading is the opening. Run the audit the week the pilot conversation starts, and the memo is on the table by the second technical meeting, the point where the buyer's platform team starts probing whether the system holds up. Arriving with independent NDCG measurements and confidence intervals, rather than a dashboard screenshot from the team that built the system, changes the character of the conversation. It is the difference between asking the buyer to trust the vendor and handing the buyer something its own evaluators can check.
The point
The window is four to twelve weeks because both failure modes are real. Inside four weeks, an audit that finds a real problem leaves no room to fix it. Beyond twelve, the system drifts out from under the memo. The procurement and investment world has hardened enough that a team will be asked for this evidence whether or not it prepared it (a16z, 2025; Skadden, 2026). The only choice the team controls is whether the evidence is ready and recent when the date arrives.
Related reading
- Bootstrap confidence intervals for NDCG: the statistical layer the memo carries, and why a bare point estimate fails a skeptical reader.
- Paired significance tests for retrieval changes: how the memo shows a configuration change is a real improvement rather than noise.
The sample report shows what a procurement-grade retrieval memo looks like: an independent NDCG measurement on a public benchmark, with bootstrap confidence intervals and paired significance tests, in the form a buyer or investor's evaluators expect. Download it to see the artifact, then book a 30-minute call to scope the audit against a specific window, whether that window is a data room opening in ten weeks or a pilot conversation starting next week. The measurement chapters of Designing Hybrid Search Systems document the full methodology behind the memo.
Laszlo Csontos
Author of Designing Hybrid Search Systems (Leanpub, 2026). Practitioner background in production hybrid search, embeddings, cross-encoder reranking, and retrieval evaluation.
Need help measuring search quality? Start with a baseline audit.
Book a Discovery CallRelated Posts
A single NDCG number looks like a verdict but is often a coin flip. The variance across query subsets routinely exceeds the difference between two systems, and the IR community solved this in 1997. Most LLM-era eval tooling has not inherited the fix.
May 26, 2026
A reranker change moves NDCG@10 by +0.015 and the dashboard goes green. The paired bootstrap returns a 95% CI of [-0.003, +0.033], crossing zero. The point estimate said ship; the evidence says wait.
May 26, 2026
Galileo, Patronus, Phoenix, and DeepEval make a RAG system better. They cannot prove it works to procurement, diligence, or an investor. Independence is structural, not a feature.
May 26, 2026