For AI Startups

Independent retrieval audits for vertical AI startups.

Built for founders, CTOs, and Heads of ML/Eval at 5-30 person AI startups (Series Seed through Series B) preparing for fundraising or first enterprise pilot.

You are within 4-12 weeks of one of these conversations

  • An investor in due diligence asking how you evaluate retrieval.
  • An enterprise customer's procurement team asking for evaluation documentation.
  • A competitive benchmark surfacing (LM Arena, internal customer-led comparison).
  • Your first major enterprise pilot.

Internal dashboards from the team that built the system answer none of these questions in a way the audience trusts.

$7,500fixed
Retrieval Audit Sprint
7 business daysFounder, CTO, Head of ML/Eval at 5-30 person AI startups (Series Seed through Series B)

Independent retrieval pipeline audit for one production RAG system: chunking, embedding model, hybrid retrieval, reranking, and evaluation methodology.

Deliverables

  • Written report (18 pages): pipeline review, methodology review, benchmark results, three remediation recommendations with effort estimates
  • Custom benchmark dataset (100-200 query/document pairs with relevance judgments; client retains)
  • Measurement results: NDCG@10, MRR, hit-rate@k, and faithfulness scoring on a representative generation sample, each with bootstrap 95% confidence intervals
  • Methodology code (Python notebook, runnable against your pipeline post-delivery)

Process

  1. 130-min discovery call
  2. 2SOW + 50% deposit
  3. 3Days 1-2: pipeline access and judgment set construction
  4. 4Days 3-5: measurement runs
  5. 5Days 6-7: report writing and walkthrough
  6. 6Final invoice on delivery
Payment: 50% on signature, 50% on delivery. Net-15.

Why this exists

AI procurement now mirrors traditional software buying, with more rigorous evaluations, hosting considerations, and security checklists (Andreessen Horowitz, 2025 CIO Survey). External benchmarks (LM Arena cited) are part of the gate. The audit is built to serve as the procurement-grade evaluation memo that goes into the data room, alongside model evals and security questionnaires.

Vertical AI startups hire ML, Backend, Forward Deployed, and Applied Research engineers, not dedicated IR or Search Relevance engineers (visible in public job postings at Hebbia, Vectara, and Glean). Even where in-house founders carry deep IR backgrounds, IR is not a continuous headcount line item; 5-30 person teams need outside rigor on a periodic basis.

Procurement velocity matches founder velocity. Series Seed through Series A founder/CTO signature authority typically clears below $25K in 24-72 hours. Existing AI infrastructure spend already covers incremental eval workload during the audit; no new budget line is required.

How TensorOpt differs

Bootstrap confidence intervals and paired significance tests are not in the feature set of Galileo, Patronus, Arize, or FutureAGI. They publish NDCG, MRR, and Precision@k as platform metrics; I apply classical IR statistical rigor on top, and I'm not grading my own SaaS.

CompetitorHow TensorOpt differs
GalileoIndependent third-party audit, not first-party tooling. Bootstrap confidence intervals and paired significance tests are not in their feature set.
Patronus AIPatronus is a tools vendor. TensorOpt produces outside-counsel-style assessments with statistical-significance discipline.
Arize PhoenixPhoenix is observability + eval tooling. TensorOpt is point-in-time independent audit with bootstrap CIs that Phoenix does not ship as a feature.
FutureAGIMetrics-as-platform-features, not independent audit. No published bootstrap-CI or paired significance testing.
Confident AI / DeepEvalLibrary, not services. Useful tool to use inside a TensorOpt audit, not a competitor for the audit itself.
VectaraVendor self-eval is not independent third-party review.
OpenSource Connections, Sease, PureinsightsEnterprise tier; not productized for the 5-30 person AI startup buyer at the $7,500-$20,000 SKU level.

Built by a practitioner

Built and delivered by Laszlo Csontos, author of Designing Hybrid Search Systems (Leanpub, 2026). Previously: production search at scale at a prior platform role, spanning embeddings, hybrid fusion, and taxonomy inference.

Sample report

Sample report on BEIR / FiQA-2018 (18 pages). Walks through pipeline review, judgment-set construction, bootstrap-CI measurement, and prioritized remediations.

Sample report built on public benchmark data. Real client engagements produce identical structure on the client's own retrieval pipeline.

We'll email occasionally with new sample reports and methodology updates. Unsubscribe anytime.

Who this is not for

Not a fit for teams pre-product without a live retrieval pipeline, or for enterprise buyers seeking a multi-month build engagement. This is an independent diagnostic, not an implementation partner.

Available after the audit

  • Production-Grade Eval Harness
    $20,0004 weeks

    Stratified datasets, CI integration, and a statistical-rigor dashboard wired into your stack so retrieval regressions block deploys.

  • Fractional IR Advisor Retainer
    $7,500/mo, 3-month minimumOngoing

    An independent IR reviewer kept on the team for monthly metric reviews and pre-pilot evaluation refreshes.

Ready for an independent audit?

30-minute discovery call covers your pipeline, your eval methodology, and the trigger that brought you here.