Back to Blog
Embeddingsfine-tuningdomain-specificsentence-transformers

Generic Embeddings Failed Us

When we switched from OpenAI embeddings to custom fine-tuned models, retrieval quality jumped 35%.

January 5, 202612 min read

Last year, a client came to us with a legal document retrieval system. They'd built it with OpenAI's text-embedding-ada-002—the standard choice—and it worked... poorly. Relevant documents ranked in positions 15-20 instead of the top 5. Users were frustrated. The AI assistant built on top hallucinated because it couldn't find the right context.

The Generic Embedding Problem

OpenAI's embeddings are trained on a massive corpus of internet text. They understand general language incredibly well. But legal documents aren't general language:

  • Domain terminology: "consideration" means something very different in contract law
  • Citation patterns: "See Smith v. Jones, 123 F.3d 456" is meaningful structure
  • Document relationships: Amendments reference original contracts in specific ways

The embedding model had never seen legal documents during training. It was doing its best, but its best wasn't good enough.

Quantifying the Gap

We ran a proper evaluation using 500 queries with human-labeled relevant documents:

| Metric | OpenAI ada-002 | Gap to Ideal | |--------|---------------|--------------| | NDCG@10 | 0.62 | -0.38 | | Recall@20 | 0.71 | -0.29 | | MRR | 0.45 | -0.55 |

These numbers told a clear story: roughly 40% of relevant documents weren't being surfaced in the top 10 results.

The Fine-Tuning Approach

We fine-tuned a sentence-transformer model on the client's data using contrastive learning:

from sentence_transformers import SentenceTransformer, losses, InputExample
 
# Start with a strong base model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
 
# Create training pairs from click data and human labels
train_examples = [
    InputExample(texts=[query, positive_doc, negative_doc])
    for query, positive_doc, negative_doc in training_data
]
 
# Fine-tune with multiple negatives ranking loss
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_examples, epochs=3, warmup_steps=100)

The key ingredients:

  1. Hard negatives: Documents that seem relevant but aren't
  2. Domain-specific queries: Real searches from users
  3. Positive pairs: Documents users actually found useful

The Math Behind Contrastive Learning

Contrastive learning optimizes the embedding space so that similar items are close and dissimilar items are far apart. The loss function for a query qq with positive document d+d^+ and negative documents {di}\{d^-_i\}:

L=logesim(q,d+)/τiesim(q,di)/τ\mathcal{L} = -\log \frac{e^{\text{sim}(q, d^+) / \tau}}{\sum_{i} e^{\text{sim}(q, d^-_i) / \tau}}

Where τ\tau is a temperature parameter controlling the sharpness of the distribution.

Results

After fine-tuning on 50,000 query-document pairs:

| Metric | OpenAI ada-002 | Fine-tuned | Improvement | |--------|---------------|------------|-------------| | NDCG@10 | 0.62 | 0.84 | +35% | | Recall@20 | 0.71 | 0.92 | +30% | | MRR | 0.45 | 0.73 | +62% |

The improvement wasn't marginal—it was transformative. Users found what they needed. The AI assistant stopped hallucinating.

When to Fine-Tune

Fine-tuning isn't always necessary. Use generic embeddings when:

  • Your domain is general-purpose
  • You have limited training data
  • Speed-to-market matters more than quality

Fine-tune when:

  • Domain terminology differs from general language
  • You have query-document pairs (clicks, ratings, labels)
  • Retrieval quality directly impacts business metrics

The Investment

Fine-tuning isn't free:

  • Data preparation: Cleaning and formatting training pairs
  • Compute: GPU hours for training
  • Evaluation: Rigorous testing on held-out data
  • Deployment: Serving infrastructure for custom models

But for specialized domains, the ROI is clear. A 35% improvement in retrieval directly translates to better user experience and reduced hallucination in downstream AI applications.

Generic embeddings are a great starting point. But for serious retrieval systems, custom models are often the difference between "works" and "works well."

L

Laszlo Csontos

Senior ML engineer specializing in search and retrieval systems. Building intelligent search solutions and consulting via TensorOpt.

Generic embeddings not cutting it? Let's talk custom models.

Book a Discovery Call