Why Generic Embeddings Stall in Production, and What Domain Adaptation Fixes

Off-the-shelf embedding models are the quickest path to a working semantic search demo and the slowest path to production-quality retrieval. A general-purpose model trained on broad web data understands general language well and fails quietly on the domain-specific terminology, jargon, and long-tail queries that real users actually type. The gap between demo and production is where most search projects stall, and it does not announce itself. It shows up three months in, as a flat metric nobody can explain.

This piece is about why that happens, why fine-tuning on domain data produces a step change rather than an increment, and why the lever that decides the outcome is the training data, not the model architecture.

The demo-to-production gap

The trajectory plays out the same way across teams. Day one: integrate an embedding API (OpenAI, Cohere, Voyage, or any open-source model from HuggingFace), build a quick prototype, and the demo looks great. Month three: "why isn't semantic search moving our metrics?"

The model does not understand the domain's vocabulary. Out of the box, a dense retriever trained on a general benchmark like MS MARCO degrades sharply the moment it meets a specialized domain, and this degradation under domain shift is a well-documented property of dense retrievers rather than an edge case (BEIR benchmark, Thakur et al., 2021). Industry jargon maps to the wrong region of the embedding space, acronyms are meaningless to the model, and product-specific language reads as noise. The degradation is not a defect to wait out as models improve. It is the specific problem domain adaptation solves, which is the rest of this article.

This is not an OpenAI-specific issue. It applies to any general-purpose embedding model, because all of them are trained on broad web data. They understand general language. They do not understand the vocabulary of a particular medical practice, a particular manufacturing process, or a particular product catalog. Every domain carries terms that general models mishandle: medical abbreviations that encode precise meaning, legal terms where similar-sounding phrases carry distinct consequences, e-commerce attributes where "12 gauge" means one thing for wire and another for shotguns, and the long tail of part numbers, model years, and internal codes.

The failure is hard to catch because it does not appear in demos. Demo queries use clean, common language ("find similar products," "search for relevant documents") and those work. Production queries do not look like that. Users type abbreviations, internal codes, and domain shorthand, and that is exactly where a zero-shot general model breaks down.

Domain adaptation closes the gap

Adapting the model to its domain closes the gap, and the improvement is not incremental. Generative pseudo-labeling, which adapts a dense retriever using queries generated for the target corpus and relevance scored by a cross-encoder, has lifted retrieval quality by up to 9.3 nDCG@10 points over an out-of-the-box model trained only on a general corpus (Wang, Thakur, Reimers & Gurevych, 2022). A change of that size is the difference between a feature that ships and one that gets quietly shelved.

The real wins show up in the long tail. Queries that previously returned irrelevant results, or nothing at all, start working, because the model learns domain-specific relationships that no synonym dictionary or prompt template can capture. A medical search system learns that "MI," "myocardial infarction," and "heart attack" should return the same documents. A parts catalog learns that "5/16-18 hex bolt" is not interchangeable with "5/16-24 hex bolt" despite their surface resemblance.

The training data does not need to be enormous to start. With as few as 8 in-domain examples used to prompt a task-specific query generator, a dual-encoder retriever has been trained to outperform far more expensive models like ColBERT v2 by more than 1.2 nDCG@10 points on average across 11 retrieval sets (Dai et al., 2023). The key qualifier is in-domain: those 8 examples ground the generator in how the domain's users actually phrase things, and the generated queries are filtered for consistency. That is a different mechanism from generating queries blind, which is the failure mode discussed below. A few thousand well-curated pairs make a measurable difference; tens of thousands make it transformative.

The process is not automatic. Naive fine-tuning on noisy data can make retrieval worse. False negatives hiding among the hard negatives degrade training, and denoised hard negatives, filtered with a cross-encoder, significantly outperform naive selection (Qu et al., 2021). The model has to learn what is similar-but-wrong, not only what is right.

The standard fine-tuning objective is contrastive learning. Given a query $q$ , a relevant document $d^+$ , and a set of irrelevant documents $\{d^-_1, \ldots, d^-_n\}$ , the model learns embeddings where the similarity between $q$ and $d^+$ is high and the similarity between $q$ and each $d^-_i$ is low. The InfoNCE loss formalizes this:

$\mathcal{L} = -\log \frac{e^{\text{sim}(q, d^+) / \tau}}{e^{\text{sim}(q, d^+) / \tau} + \sum_{i=1}^{n} e^{\text{sim}(q, d^-_i) / \tau}}$

Here $\text{sim}$ is typically cosine similarity and $\tau$ is a temperature that controls the sharpness of the distribution. The quality of the negatives $d^-_i$ matters enormously. Random negatives (documents that are obviously unrelated) provide little learning signal, because the model can distinguish them trivially. Hard negatives (documents that look relevant but are not) force the model to learn the fine-grained boundaries of relevance in the target domain.

Training data quality beats model architecture

The hardest part of fine-tuning embeddings is not the model. It is constructing good training data. The difference between a mediocre fine-tuned model and a great one is almost never the architecture, the learning rate, or the number of epochs. It is the training pairs. A simple contrastive recipe over a carefully curated and filtered pair dataset was enough for the E5 model to beat embedding models with 40 times more parameters on the MTEB benchmark (Wang, Yang et al., 2022). The size of the model lost to the quality of the data.

Three sources supply training pairs, and each has a distinct profile.

Click logs

Rich in signal, full of noise. Users click irrelevant results out of curiosity or by mistake, and position bias inflates top results regardless of actual relevance: users disproportionately click top-ranked results whether or not those results are the best matches (Joachims et al., 2005). The cascade model best explains this bias in the early ranks, where users evaluate results sequentially from the top down (Craswell et al., 2008).

Using raw click logs as training data teaches the model to replicate the current system's failures, including its ranking biases. Debiasing is essential: filter by dwell time (a click followed by a long stay is a stronger signal than a bounce), use session-level signals (a query reformulation right after a click suggests the click was a miss), and apply inverse propensity weighting to correct for position. A clicked result at position 8 is a stronger relevance signal than one at position 1, because position 1 gets clicked regardless of quality.

Human labels

High quality, low scale. Expert annotators labeling thousands of query-document pairs is expensive and slow, and most teams cannot sustain it as a primary data source. Even a small set of human labels is valuable, but its highest-leverage use is evaluation, not bulk training. A curated evaluation set of 200 to 300 queries provides the compass that tells whether any of the rest of this work is paying off.

Synthetic pairs from LLMs

Scalable, but biased toward generic language when generated without grounding. An LLM asked to generate queries for a document, with no in-domain examples to anchor it, produces "reasonable" queries rather than the abbreviated, jargon-heavy strings real users type. The distribution is wrong, and poor generation quality on specialized corpora is the central failure mode of unguided synthetic approaches (Wang, Thakur, Reimers & Gurevych, 2022). This is the opposite end of the few-shot, in-domain-prompted generation described earlier: a handful of real examples plus consistency filtering keeps the generated queries on-distribution; generating blind does not.

Combining all three

The approach that works is strategic combination rather than reliance on any single source. Fine-tuning on a mixture of synthetic and labeled data has set new state-of-the-art results on the MTEB and BEIR benchmarks (Wang, Yang et al., 2024). The working recipe:

Click logs as bulk training signal, filtered aggressively. Session-level signals (reformulations, return-to-results) are more reliable than raw clicks.
Human labels reserved for evaluation and hard-case calibration.
Synthetic pairs, generated with in-domain grounding, to fill coverage gaps in underrepresented query types.

A left-to-right flow diagram of the training-data recipe. Three source boxes, click logs (high volume, noisy), human labels (high quality, low scale), and grounded synthetic pairs (scalable, on-distribution), feed into a contrastive fine-tuning step that uses the InfoNCE objective with cross-encoder-mined hard negatives, producing a domain-adapted embedding model.

Hard negatives are non-negotiable

The model does not only need to learn what is relevant. It needs to learn what looks relevant but is not. "Nike Air Max 90" and "Nike Air Max 97" are close in embedding space, but for a shopper who wants the 90, the 97 is the wrong product. Hard negatives teach that distinction.

Uninformative negatives (random documents that are obviously irrelevant) cause diminishing gradients and slow convergence. Global hard negatives selected through an approximate nearest-neighbor index improve both training quality and retrieval performance substantially (Xiong et al., 2021). The best architecture with weak training pairs loses to a modest architecture with strong ones, every time.

The target keeps moving

One more property holds regardless of which model currently leads a public benchmark: no single model dominates across all domains (BEIR benchmark, Thakur et al., 2021). The model that tops an aggregate leaderboard can underperform on a specific domain, so last year's default is rarely this year's best choice. The only leaderboard that decides anything is the one built from actual queries in the target domain.

That makes model selection a measurement problem rather than a reading-the-leaderboard problem (the evaluation foundation that measurement rests on, from NDCG to offline-versus-online testing, is covered separately). Shortlist the top three to five candidates that fit the latency and memory budget, benchmark each on a representative sample of production queries with human relevance judgments, and keep the one that performs best on the domain's own data. Public benchmark scores are useful for narrowing the shortlist and useless for the final pick: a model scoring high on a general benchmark but low on a domain-specific evaluation set is worse than the reverse.

What this means before a fundraise or a pilot

Domain adaptation fixes the model. Proving the model works to someone who did not build it is a separate problem with its own requirements (independence, statistical rigor, and a documented methodology a third party can reproduce), and it is the one that tends to decide a fundraise or an enterprise pilot.

Related reading

What investors look for when they ask how you evaluate retrieval: the independence, significance, and reproducibility a fundraise or pilot conversation actually demands.
The 4-12 week window for a pre-funding retrieval audit: when to produce that evidence so a finding can still be remediated.

The sample report shows what answering that looks like: the full methodology run on a public benchmark (a financial-domain question-answering pipeline, FiQA), with graded relevance judgments, bootstrap confidence intervals, and paired significance tests against a documented baseline, in the structure a real engagement produces. Download the sample report to see how an independent measurement reads to a skeptical evaluator. The methodology behind it is documented end to end in Designing Hybrid Search Systems, so the memo is reproducible against a team's own pipeline rather than a black box.