Generic Embeddings Failed Us

Last year, a client came to us with a legal document retrieval system. They'd built it with OpenAI's text-embedding-ada-002—the standard choice—and it worked... poorly. Relevant documents ranked in positions 15-20 instead of the top 5. Users were frustrated. The AI assistant built on top hallucinated because it couldn't find the right context.

The Generic Embedding Problem

OpenAI's embeddings are trained on a massive corpus of internet text. They understand general language incredibly well. But legal documents aren't general language:

Domain terminology: "consideration" means something very different in contract law
Citation patterns: "See Smith v. Jones, 123 F.3d 456" is meaningful structure
Document relationships: Amendments reference original contracts in specific ways

The embedding model had never seen legal documents during training. It was doing its best, but its best wasn't good enough.

Quantifying the Gap

We ran a proper evaluation using 500 queries with human-labeled relevant documents:

| Metric | OpenAI ada-002 | Gap to Ideal | |--------|---------------|--------------| | NDCG@10 | 0.62 | -0.38 | | Recall@20 | 0.71 | -0.29 | | MRR | 0.45 | -0.55 |

These numbers told a clear story: roughly 40% of relevant documents weren't being surfaced in the top 10 results.

The Fine-Tuning Approach

We fine-tuned a sentence-transformer model on the client's data using contrastive learning:

from sentence_transformers import SentenceTransformer, losses, InputExample
 
# Start with a strong base model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
 
# Create training pairs from click data and human labels
train_examples = [
    InputExample(texts=[query, positive_doc, negative_doc])
    for query, positive_doc, negative_doc in training_data
]
 
# Fine-tune with multiple negatives ranking loss
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_examples, epochs=3, warmup_steps=100)

The key ingredients:

Hard negatives: Documents that seem relevant but aren't
Domain-specific queries: Real searches from users
Positive pairs: Documents users actually found useful

The Math Behind Contrastive Learning

Contrastive learning optimizes the embedding space so that similar items are close and dissimilar items are far apart. The loss function for a query $q$ with positive document $d^+$ and negative documents $\{d^-_i\}$ :

\mathcal{L} = -\log \frac{e^{\text{sim}(q, d^+) / \tau}}{\sum_{i} e^{\text{sim}(q, d^-_i) / \tau}}

Where $\tau$ is a temperature parameter controlling the sharpness of the distribution.

Results

After fine-tuning on 50,000 query-document pairs:

| Metric | OpenAI ada-002 | Fine-tuned | Improvement | |--------|---------------|------------|-------------| | NDCG@10 | 0.62 | 0.84 | +35% | | Recall@20 | 0.71 | 0.92 | +30% | | MRR | 0.45 | 0.73 | +62% |

The improvement wasn't marginal—it was transformative. Users found what they needed. The AI assistant stopped hallucinating.

When to Fine-Tune

Fine-tuning isn't always necessary. Use generic embeddings when:

Your domain is general-purpose
You have limited training data
Speed-to-market matters more than quality

Fine-tune when:

Domain terminology differs from general language
You have query-document pairs (clicks, ratings, labels)
Retrieval quality directly impacts business metrics

The Investment

Fine-tuning isn't free:

Data preparation: Cleaning and formatting training pairs
Compute: GPU hours for training
Evaluation: Rigorous testing on held-out data
Deployment: Serving infrastructure for custom models

But for specialized domains, the ROI is clear. A 35% improvement in retrieval directly translates to better user experience and reduced hallucination in downstream AI applications.

Generic embeddings are a great starting point. But for serious retrieval systems, custom models are often the difference between "works" and "works well."