Generic Embeddings Failed Us
When we switched from OpenAI embeddings to custom fine-tuned models, retrieval quality jumped 35%.
Last year, a client came to us with a legal document retrieval system. They'd built it with OpenAI's text-embedding-ada-002—the standard choice—and it worked... poorly. Relevant documents ranked in positions 15-20 instead of the top 5. Users were frustrated. The AI assistant built on top hallucinated because it couldn't find the right context.
The Generic Embedding Problem
OpenAI's embeddings are trained on a massive corpus of internet text. They understand general language incredibly well. But legal documents aren't general language:
- Domain terminology: "consideration" means something very different in contract law
- Citation patterns: "See Smith v. Jones, 123 F.3d 456" is meaningful structure
- Document relationships: Amendments reference original contracts in specific ways
The embedding model had never seen legal documents during training. It was doing its best, but its best wasn't good enough.
Quantifying the Gap
We ran a proper evaluation using 500 queries with human-labeled relevant documents:
| Metric | OpenAI ada-002 | Gap to Ideal | |--------|---------------|--------------| | NDCG@10 | 0.62 | -0.38 | | Recall@20 | 0.71 | -0.29 | | MRR | 0.45 | -0.55 |
These numbers told a clear story: roughly 40% of relevant documents weren't being surfaced in the top 10 results.
The Fine-Tuning Approach
We fine-tuned a sentence-transformer model on the client's data using contrastive learning:
from sentence_transformers import SentenceTransformer, losses, InputExample
# Start with a strong base model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
# Create training pairs from click data and human labels
train_examples = [
InputExample(texts=[query, positive_doc, negative_doc])
for query, positive_doc, negative_doc in training_data
]
# Fine-tune with multiple negatives ranking loss
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_examples, epochs=3, warmup_steps=100)The key ingredients:
- Hard negatives: Documents that seem relevant but aren't
- Domain-specific queries: Real searches from users
- Positive pairs: Documents users actually found useful
The Math Behind Contrastive Learning
Contrastive learning optimizes the embedding space so that similar items are close and dissimilar items are far apart. The loss function for a query with positive document and negative documents :
Where is a temperature parameter controlling the sharpness of the distribution.
Results
After fine-tuning on 50,000 query-document pairs:
| Metric | OpenAI ada-002 | Fine-tuned | Improvement | |--------|---------------|------------|-------------| | NDCG@10 | 0.62 | 0.84 | +35% | | Recall@20 | 0.71 | 0.92 | +30% | | MRR | 0.45 | 0.73 | +62% |
The improvement wasn't marginal—it was transformative. Users found what they needed. The AI assistant stopped hallucinating.
When to Fine-Tune
Fine-tuning isn't always necessary. Use generic embeddings when:
- Your domain is general-purpose
- You have limited training data
- Speed-to-market matters more than quality
Fine-tune when:
- Domain terminology differs from general language
- You have query-document pairs (clicks, ratings, labels)
- Retrieval quality directly impacts business metrics
The Investment
Fine-tuning isn't free:
- Data preparation: Cleaning and formatting training pairs
- Compute: GPU hours for training
- Evaluation: Rigorous testing on held-out data
- Deployment: Serving infrastructure for custom models
But for specialized domains, the ROI is clear. A 35% improvement in retrieval directly translates to better user experience and reduced hallucination in downstream AI applications.
Generic embeddings are a great starting point. But for serious retrieval systems, custom models are often the difference between "works" and "works well."
Generic embeddings not cutting it? Let's talk custom models.
Book a Discovery Call