Services

Two fixed-fee engagements designed to land in 7-10 business days, then a ladder of remediation and retainer options once I know what your pipeline actually needs.

Start here: the diagnostic engagements

Pick the engagement that matches your buyer. Both are fixed-fee, time-boxed, and produce a written report plus a prioritized remediation roadmap.

$7,500Fixed-fee
Retrieval Audit Sprint
7 business daysFounder, CTO, Head of ML/Eval at 5-30 person AI startups (Series Seed through Series B)

Independent retrieval pipeline audit with bootstrap-CI metrics and a runnable Python notebook you keep.

$7,500-$9,500Fixed-fee
Search Relevance Diagnostic
10 business daysVP Ecommerce, Director of Digital, Head of Site Search at $20M-$200M GMV Shopify Plus / Adobe Commerce brands

Independent NDCG@10 + conversion-correlated relevance diagnostic against your existing search vendor.

What happens after the diagnostic

When the diagnostic surfaces work worth doing, these are the named remediation SKUs the engagement can extend into.

Tier 2$20,000
Production-Grade Eval Harness
4 weeksAI startups that need procurement-grade evaluation infrastructure wired into CI before fundraising or enterprise pilots.

Deliverables

  • Stratified dataset (500-1,500 query/document pairs by head/torso/tail and query intent)
  • CI integration (GitHub Actions or GitLab CI) with configurable thresholds that block deploys on regressions
  • Dashboard: NDCG@k, MRR, hit-rate@k, faithfulness, custom metrics, per-query-class breakdowns, regression alerts
  • Statistical rigor layer: bootstrap confidence intervals and paired significance tests
  • Methodology runbook plus 1-2 weeks of Slack/email handoff support

Why $20K

The deliverable is the procurement-grade evaluation infrastructure that investors and enterprise procurement increasingly ask for as part of AI due diligence (Andreessen Horowitz, 2025 CIO Survey). Against a Series B round or a first enterprise contract in six to seven figures, the fee is a rounding error on the outcome the harness defends. Fixed-fee, four weeks, deliverable ships on the SOW date.

Tier 2$25,000
Q4 Readiness Audit
4 weeksMid-market brands preparing for BFCM with one search vendor in place and an open question about whether to renew or replace.

Deliverables

  • Multi-vendor benchmark (NDCG@10, click-position-1, conversion-correlated relevance) on top 1,000-2,000 head and torso queries against 1-2 alternative search vendors
  • Full merchandising review: boost/bury rules, synonyms, redirects, banners, manual curation overlays
  • Q4 stress test against last year's BFCM query mix, promotion-aware relevance, out-of-stock handling
  • A/B test design for Q3 with hypothesis, sample-size math, and success criteria
  • 60-minute executive presentation

Why Q4 Readiness

Adobe Analytics reported double-digit YoY growth in BFCM 2025 dollars and a meaningful conversion lift among AI-influenced shoppers. A diagnostic delivered in Q1-Q2 feeds directly into the Q3 testing window before peak.

Tier 2$25,000 - $75,000
Search Relevance Optimization
4-8 weeksTeams with an existing production search stack that need a measurable lift in retrieval quality without rebuilding the platform.

Deliverables

  • Hybrid retrieval implementation (BM25 plus dense vectors with reciprocal rank fusion or learned fusion)
  • Query understanding layer covering intent classification and entity extraction on the client query log
  • Cross-encoder reranking pipeline over the top-N candidates from the hybrid stage
  • Relevance evaluation framework with NDCG@10, MRR, and recall reported with bootstrap confidence intervals (DHSS, statistical-rigor methodology)
  • A/B testing infrastructure with hypothesis, sample-size math, and stop conditions

Why $25K-$75K

Hybrid retrieval and cross-encoder reranking typically add 5-15 NDCG@10 points over a tuned BM25 baseline on out-of-domain benchmarks (Thakur et al., BEIR, NeurIPS 2021; Santhanam et al., ColBERTv2, NAACL 2022). The engagement delivers production code, an evaluation harness, and a documented handoff, not a notebook proof of concept. The price band tracks scope: query log size, number of indexed corpora, and whether reranking ships behind a feature flag or as default.

Tier 3$50,000 - $150,000
Custom Embedding Development
6-12 weeksTeams running generic API embeddings on a specialized corpus where domain terminology, product attributes, or user language is not well covered by general-purpose models.

Deliverables

  • Domain-adapted embedding model fine-tuned on client query-document pairs (sentence-transformers / SBERT methodology; Reimers & Gurevych, EMNLP 2019)
  • Training pipeline on proprietary data with held-out evaluation splits stratified by query class
  • Benchmark of the fine-tuned model against the prior generic baseline on a held-out test set, reported with bootstrap confidence intervals
  • Deployment artifacts: serving image, embedding-version registry, and rollback procedure
  • Retraining runbook with cadence and trigger criteria tied to data drift

Why $50K-$150K

Domain fine-tuning of embedding models is a published lift mechanism for specialized retrieval: SBERT (Reimers & Gurevych, EMNLP 2019) is the canonical reference, and the MTEB leaderboard (Muennighoff et al., EACL 2023) shows domain-specialized encoders consistently outperform general-purpose models on domain tasks. Scope covers training data curation, model selection, fine-tuning runs, evaluation, and a deployable serving path. The price band tracks dataset size, whether contrastive pairs need to be mined from logs or supplied as labels, and whether the model is served self-hosted or behind a managed inference endpoint.

Tier 4$75,000 - $150,000
RAG Pipeline Development
8-12 weeksTeams building an AI assistant, knowledge system, or document Q&A product where retrieval quality is the dominant lever on output quality and procurement now asks for the eval methodology.

Deliverables

  • End-to-end RAG architecture with documented component boundaries (indexing, retrieval, reranking, generation, guardrails)
  • Hybrid retrieval layer with cross-encoder reranking over the top candidate set
  • Chunking strategy and metadata schema derived from the client corpus, not a default
  • Evaluation framework covering retrieval quality (NDCG@k, recall, MRR) and generation quality (faithfulness, answer relevance) with bootstrap confidence intervals
  • Production deployment with observability hooks and a regression alerting path

Why $75K-$150K

Retrieval is the dominant failure mode in production RAG: a recent due-diligence review of RAG evaluation practice (Martinon et al., arXiv 2507.21753, 2025) frames retrieval rigor as the prerequisite to defensible generation metrics, and DHSS (the hybrid-retrieval methodology underlying this engagement) treats the retrieval layer as the testable contract that downstream generation depends on. Scope covers the full pipeline end-to-end, with the retrieval and evaluation layers built to the same standard as the Production-Grade Eval Harness. The price band tracks corpus complexity, number of document types, and whether generation runs against a hosted API or a self-hosted model.

Retainers

For clients who completed an audit or diagnostic and want ongoing oversight without hiring a full-time IR engineer.

Tier 5$7,500/mo, 3-mo min
Fractional IR Advisor Retainer
OngoingAI startups that completed a Retrieval Audit Sprint and want ongoing IR oversight without hiring a full-time retrieval engineer.

Deliverables

  • Monthly retrieval-metric review (NDCG@10, MRR, hit-rate@k, faithfulness) against the eval harness
  • Async architecture review of pipeline changes (chunking, embeddings, reranking)
  • Pre-fundraising or pre-enterprise-pilot evaluation memo refresh
  • Priority Slack / email response during business hours
Tier 5$5,000/mo, 3-mo min
Relevance Retainer
OngoingMid-market e-commerce brands that completed a Search Relevance Diagnostic and want monthly oversight of vendor relevance quality.

Deliverables

  • Monthly NDCG@10 + click-position-1 measurement against the existing search vendor
  • Quarterly judgment-set refresh for the top 500 head and torso queries
  • Quick-win recommendation review (config, synonyms, merchandising overlays)
  • Pre-peak-season and pre-vendor-renewal advisory

Compare All Engagements

EngagementVerticalPriceTimelineYou GetBest For
Retrieval Audit SprintAI Startups$7,5007 daysIndependent retrieval audit + report + judgment setPre-fundraise / pre-pilot
Search Relevance DiagnosticE-commerce$7.5-9.5K10 daysNDCG@10 + conversion-correlated diagnosticPost-BFCM / vendor renewal
Production-Grade Eval HarnessAI Startups$20K4 wksStratified dataset, CI integration, statistical-rigor dashboardProcurement-grade eval before fundraise
Q4 Readiness AuditE-commerce$25K4 wksMulti-vendor benchmark + merchandising review + Q3 A/B designPre-BFCM, vendor renew/replace
Relevance OptimizationBoth$25-75K4-8 wksHybrid retrieval + cross-encoder reranking in productionExisting search needs a measurable lift
Custom EmbeddingsBoth$50-150K6-12 wksDomain-adapted embedding model + training pipelineGeneric API embeddings miss domain terminology
RAG Pipeline DevelopmentAI Startups$75-150K8-12 wksEnd-to-end RAG with hybrid retrieval and rerankingRetrieval is the dominant lever on output quality
Fractional IR AdvisorAI Startups$7,500/mo3-mo minMonthly metric review + async advisoryPost-diagnostic IR oversight
Relevance RetainerE-commerce$5,000/mo3-mo minMonthly NDCG@10 + judgment-set refreshVendor-quality oversight

Frequently Asked Questions

Not sure which engagement fits?

Book a 30-minute discovery call. I'll walk through your pipeline and route you to the right engagement, or to a resource that does more for you than I would.