Services
Two fixed-fee engagements designed to land in 7-10 business days, then a ladder of remediation and retainer options once I know what your pipeline actually needs.
Start here: the diagnostic engagements
Pick the engagement that matches your buyer. Both are fixed-fee, time-boxed, and produce a written report plus a prioritized remediation roadmap.
Independent retrieval pipeline audit with bootstrap-CI metrics and a runnable Python notebook you keep.
Independent NDCG@10 + conversion-correlated relevance diagnostic against your existing search vendor.
What happens after the diagnostic
When the diagnostic surfaces work worth doing, these are the named remediation SKUs the engagement can extend into.
Deliverables
- Stratified dataset (500-1,500 query/document pairs by head/torso/tail and query intent)
- CI integration (GitHub Actions or GitLab CI) with configurable thresholds that block deploys on regressions
- Dashboard: NDCG@k, MRR, hit-rate@k, faithfulness, custom metrics, per-query-class breakdowns, regression alerts
- Statistical rigor layer: bootstrap confidence intervals and paired significance tests
- Methodology runbook plus 1-2 weeks of Slack/email handoff support
Why $20K
The deliverable is the procurement-grade evaluation infrastructure that investors and enterprise procurement increasingly ask for as part of AI due diligence (Andreessen Horowitz, 2025 CIO Survey). Against a Series B round or a first enterprise contract in six to seven figures, the fee is a rounding error on the outcome the harness defends. Fixed-fee, four weeks, deliverable ships on the SOW date.
Deliverables
- Multi-vendor benchmark (NDCG@10, click-position-1, conversion-correlated relevance) on top 1,000-2,000 head and torso queries against 1-2 alternative search vendors
- Full merchandising review: boost/bury rules, synonyms, redirects, banners, manual curation overlays
- Q4 stress test against last year's BFCM query mix, promotion-aware relevance, out-of-stock handling
- A/B test design for Q3 with hypothesis, sample-size math, and success criteria
- 60-minute executive presentation
Why Q4 Readiness
Adobe Analytics reported double-digit YoY growth in BFCM 2025 dollars and a meaningful conversion lift among AI-influenced shoppers. A diagnostic delivered in Q1-Q2 feeds directly into the Q3 testing window before peak.
Deliverables
- Hybrid retrieval implementation (BM25 plus dense vectors with reciprocal rank fusion or learned fusion)
- Query understanding layer covering intent classification and entity extraction on the client query log
- Cross-encoder reranking pipeline over the top-N candidates from the hybrid stage
- Relevance evaluation framework with NDCG@10, MRR, and recall reported with bootstrap confidence intervals (DHSS, statistical-rigor methodology)
- A/B testing infrastructure with hypothesis, sample-size math, and stop conditions
Why $25K-$75K
Hybrid retrieval and cross-encoder reranking typically add 5-15 NDCG@10 points over a tuned BM25 baseline on out-of-domain benchmarks (Thakur et al., BEIR, NeurIPS 2021; Santhanam et al., ColBERTv2, NAACL 2022). The engagement delivers production code, an evaluation harness, and a documented handoff, not a notebook proof of concept. The price band tracks scope: query log size, number of indexed corpora, and whether reranking ships behind a feature flag or as default.
Deliverables
- Domain-adapted embedding model fine-tuned on client query-document pairs (sentence-transformers / SBERT methodology; Reimers & Gurevych, EMNLP 2019)
- Training pipeline on proprietary data with held-out evaluation splits stratified by query class
- Benchmark of the fine-tuned model against the prior generic baseline on a held-out test set, reported with bootstrap confidence intervals
- Deployment artifacts: serving image, embedding-version registry, and rollback procedure
- Retraining runbook with cadence and trigger criteria tied to data drift
Why $50K-$150K
Domain fine-tuning of embedding models is a published lift mechanism for specialized retrieval: SBERT (Reimers & Gurevych, EMNLP 2019) is the canonical reference, and the MTEB leaderboard (Muennighoff et al., EACL 2023) shows domain-specialized encoders consistently outperform general-purpose models on domain tasks. Scope covers training data curation, model selection, fine-tuning runs, evaluation, and a deployable serving path. The price band tracks dataset size, whether contrastive pairs need to be mined from logs or supplied as labels, and whether the model is served self-hosted or behind a managed inference endpoint.
Deliverables
- End-to-end RAG architecture with documented component boundaries (indexing, retrieval, reranking, generation, guardrails)
- Hybrid retrieval layer with cross-encoder reranking over the top candidate set
- Chunking strategy and metadata schema derived from the client corpus, not a default
- Evaluation framework covering retrieval quality (NDCG@k, recall, MRR) and generation quality (faithfulness, answer relevance) with bootstrap confidence intervals
- Production deployment with observability hooks and a regression alerting path
Why $75K-$150K
Retrieval is the dominant failure mode in production RAG: a recent due-diligence review of RAG evaluation practice (Martinon et al., arXiv 2507.21753, 2025) frames retrieval rigor as the prerequisite to defensible generation metrics, and DHSS (the hybrid-retrieval methodology underlying this engagement) treats the retrieval layer as the testable contract that downstream generation depends on. Scope covers the full pipeline end-to-end, with the retrieval and evaluation layers built to the same standard as the Production-Grade Eval Harness. The price band tracks corpus complexity, number of document types, and whether generation runs against a hosted API or a self-hosted model.
Retainers
For clients who completed an audit or diagnostic and want ongoing oversight without hiring a full-time IR engineer.
Deliverables
- Monthly retrieval-metric review (NDCG@10, MRR, hit-rate@k, faithfulness) against the eval harness
- Async architecture review of pipeline changes (chunking, embeddings, reranking)
- Pre-fundraising or pre-enterprise-pilot evaluation memo refresh
- Priority Slack / email response during business hours
Deliverables
- Monthly NDCG@10 + click-position-1 measurement against the existing search vendor
- Quarterly judgment-set refresh for the top 500 head and torso queries
- Quick-win recommendation review (config, synonyms, merchandising overlays)
- Pre-peak-season and pre-vendor-renewal advisory
Compare All Engagements
| Engagement | Vertical | Price | Timeline | You Get | Best For |
|---|---|---|---|---|---|
| Retrieval Audit Sprint | AI Startups | $7,500 | 7 days | Independent retrieval audit + report + judgment set | Pre-fundraise / pre-pilot |
| Search Relevance Diagnostic | E-commerce | $7.5-9.5K | 10 days | NDCG@10 + conversion-correlated diagnostic | Post-BFCM / vendor renewal |
| Production-Grade Eval Harness | AI Startups | $20K | 4 wks | Stratified dataset, CI integration, statistical-rigor dashboard | Procurement-grade eval before fundraise |
| Q4 Readiness Audit | E-commerce | $25K | 4 wks | Multi-vendor benchmark + merchandising review + Q3 A/B design | Pre-BFCM, vendor renew/replace |
| Relevance Optimization | Both | $25-75K | 4-8 wks | Hybrid retrieval + cross-encoder reranking in production | Existing search needs a measurable lift |
| Custom Embeddings | Both | $50-150K | 6-12 wks | Domain-adapted embedding model + training pipeline | Generic API embeddings miss domain terminology |
| RAG Pipeline Development | AI Startups | $75-150K | 8-12 wks | End-to-end RAG with hybrid retrieval and reranking | Retrieval is the dominant lever on output quality |
| Fractional IR Advisor | AI Startups | $7,500/mo | 3-mo min | Monthly metric review + async advisory | Post-diagnostic IR oversight |
| Relevance Retainer | E-commerce | $5,000/mo | 3-mo min | Monthly NDCG@10 + judgment-set refresh | Vendor-quality oversight |
Frequently Asked Questions
Not sure which engagement fits?
Book a 30-minute discovery call. I'll walk through your pipeline and route you to the right engagement, or to a resource that does more for you than I would.