Generic Embeddings Failed Us
Off-the-shelf embeddings get demos working. Domain-adapted models get production working. The gap is up to 9.3 nDCG@10 points.
Abstract
Off-the-shelf embedding models are the quickest path to a working semantic search demo and the slowest path to production-quality retrieval. General-purpose models trained on broad web data understand general language well but fail silently on the domain-specific terminology, jargon, and long-tail queries that real users actually type. The gap between demo and production is where most search projects stall. This article covers why generic embeddings fail, how fine-tuning on domain data produces a step change in quality, what the current embedding landscape looks like, and why training data quality matters more than model architecture.
The Demo-to-Production Gap
The trajectory plays out repeatedly. Day one: integrate OpenAI embeddings (or Cohere, Voyage, or any open-source model from HuggingFace). Build a quick prototype. Demo looks great. Stakeholders are impressed.
Month three: "Why isn't semantic search improving our metrics?"
The embedding model does not understand domain terminology. Dense retrievers trained on MS MARCO severely underperform BM25 on specialized domains like BioASQ, TREC-COVID, SciFact, and FiQA (Thakur et al., 2021). Industry jargon maps to wrong regions of the embedding space. Acronyms are meaningless to the model. Product-specific language is noise.
This is not an OpenAI-specific problem. It applies to any general-purpose embedding model. These models are trained on broad web data. They understand general language. They do not understand the vocabulary of a specific medical practice, a specific manufacturing process, or a specific product catalog.
Every domain has vocabulary that general models mishandle: medical terminology where abbreviations carry precise meaning, legal jargon where similar-sounding terms have distinct legal consequences, e-commerce product attributes where "12 gauge" means entirely different things in different categories, manufacturing specifications, financial instruments. The long tail of domain-specific terms is where general embeddings silently fail.
The failure is insidious because it does not show up in demos. Demo queries use common language: "Find similar products." "Search for relevant documents." These work fine. Production queries are different. Users type abbreviations, internal codes, domain shorthand. Dense retriever performance severely degrades under a domain shift, with gaps reaching up to 9.3 nDCG@10 points between zero-shot general models and domain-adapted ones (Wang, Thakur, Reimers & Gurevych, 2022).
Fine-Tuning Changes Everything
Fine-tuning on domain-specific query-document pairs fixes the gap. Using an architecture like E5 as a base and training on actual domain data produces results that are not incremental. They are a step change (Wang, Thakur, Reimers & Gurevych, 2022).
The real wins show up in the long tail. Queries that previously returned irrelevant results or nothing at all suddenly work. The model learns domain-specific relationships that no synonym dictionary or prompt engineering can capture. A medical search system learns that "MI" and "myocardial infarction" and "heart attack" should produce the same results. A parts catalog learns that "5/16-18 hex bolt" is not similar to "5/16-24 hex bolt" despite their surface resemblance.
The training data does not need to be enormous. Starting from as few as 8 examples, an LLM query generator can create training data enabling retrievers to outperform ColBERT v2 by more than 1.2 nDCG@10 (Dai et al., 2023). A few thousand well-curated query-document pairs make a measurable difference. Tens of thousands make it transformative.
The process matters, though. Naive fine-tuning on noisy data can make things worse. False negatives among hard negatives degrade training, and "denoised hard negatives" (cross-encoder filtering of candidate negatives) significantly outperform naive approaches (Qu et al., 2021). The model needs to learn what is similar-but-wrong, not just what is right.
The typical fine-tuning objective is contrastive learning. Given a query , a relevant document , and a set of irrelevant documents , the model learns to produce embeddings where the similarity between and is high and the similarity between and each is low. The InfoNCE loss formalizes this:
Where is typically cosine similarity and is a temperature parameter that controls the sharpness of the distribution. The quality of negatives matters enormously. Random negatives (documents that are obviously unrelated) provide little learning signal because the model can trivially distinguish them. Hard negatives (documents that are similar but not relevant) force the model to learn the fine-grained boundaries of relevance in the target domain.
The Embedding Landscape Moves Fast
The embedding model selected 12 months ago is probably no longer the best option.
E5 was an era-defining default. It introduced "query:" and "passage:" prefixes for asymmetric retrieval (short query mapped to long document) and was the first model to outperform BM25 on BEIR zero-shot without labeled data (Wang, Yang et al., 2022). It fine-tuned predictably and offered a practical balance of size, speed, and quality.
The field has moved past it.
EmbeddingGemma (300M parameters) achieves state-of-the-art scores for models under 500M parameters on MTEB Multilingual, English, and Code leaderboards. It outperforms multilingual-e5-large (560M) despite being nearly half the size. The key innovations include encoder-decoder initialization from Gemma 3, geometric embedding distillation from a larger teacher model, and spread-out regularization that makes embeddings robust even under int4/int8 quantization (Vera et al., 2025).
At the larger end, Qwen3-Embedding-8B leads the multilingual MTEB with scores around 70.58, NV-Embed-v2 hits 72.31 on the legacy English benchmark, and Gemini Embedding 001 leads the refreshed English MTEB at 68.32 (MTEB Leaderboard, 2026). These are not marginal improvements over the E5 generation. They are step changes.
The efficiency story has also shifted. EmbeddingGemma runs in under 200MB of RAM with quantization and produces embeddings in sub-15ms on edge hardware. Matryoshka representation learning allows truncation from 768 to 128 dimensions with minimal quality loss (Vera et al., 2025). That kind of flexibility was not available when E5 was the default.
What has not changed: no single model dominates across all domains (Thakur et al., 2021). The model that tops MTEB overall may underperform on a specific domain. The only leaderboard that matters is the one built from actual queries in the target domain.
The practical model selection process: take the top 3 to 5 candidates from the current MTEB leaderboard (filtering for models that fit your latency and memory budget), benchmark each on a representative sample of production queries with human relevance judgments, and pick the one that performs best on your data. MTEB scores are useful for shortlisting. They are not useful for final selection. A model that scores 72 on MTEB but 0.65 on your domain-specific evaluation set is worse than a model scoring 68 on MTEB but 0.78 on your data.
Training Data Quality Beats Model Architecture
The hardest part of fine-tuning embeddings is not the model. It is constructing good training data. The difference between a mediocre fine-tuned model and a great one is almost never the architecture, the learning rate, or the number of epochs. It is the training pairs.
Click Logs
Rich in signal, full of noise. Users click irrelevant results out of curiosity or misclicks. Position bias inflates top results regardless of actual relevance: users disproportionately click top-ranked results regardless of whether those results are the best matches (Joachims et al., 2005). The cascade model best explains this bias in early ranks, where users evaluate results sequentially from top to bottom (Craswell et al., 2008).
Using raw click logs as training data trains the model to replicate the current system's failures, including its ranking biases. The debiasing step is essential: filter by dwell time (users who clicked and stayed are more likely to have found what they wanted), use session-level signals (did the user reformulate their query after clicking, suggesting dissatisfaction?), and apply inverse propensity weighting to correct for position bias. A clicked result at position 8 is a stronger relevance signal than a clicked result at position 1, because position 1 gets clicked regardless of quality.
Human Labels
High quality, low scale. Expert annotators labeling thousands of query-document pairs is expensive and slow. Most teams cannot sustain it as a primary data source. Even a small set of human labels is valuable, but the highest-leverage use is for evaluation, not training. A curated evaluation set of 200 to 300 queries provides the compass; human labels are wasted if used as bulk training signal.
Synthetic Pairs from LLMs
Scalable but biased toward generic language. An LLM generating queries for a document will produce "reasonable" queries, not the abbreviated, jargon-heavy queries real users actually type. The distribution is wrong. Bad generation quality on specialized corpora is the key failure mode of purely synthetic approaches (Wang, Thakur, Reimers & Gurevych, 2022).
Combining All Three
The approach that works is strategic combination. A mixture of synthetic and labeled data achieves new state-of-the-art results on MTEB and BEIR, beating models with 40 times more parameters (Wang, Yang et al., 2024). The recipe:
- Click logs as bulk training signal, but filtered aggressively. Session-level signals (reformulations, return-to-results) are more reliable than raw clicks.
- Human labels for evaluation and hard-case calibration.
- Synthetic pairs to fill coverage gaps in underrepresented query types.
Hard Negatives Are Non-Negotiable
The model does not just need to learn what is relevant. It needs to learn what looks relevant but is not. "Nike Air Max 90" and "Nike Air Max 97" are similar. But for a user searching for the 90, the 97 is wrong. Hard negatives teach this distinction.
Uninformative negatives (random documents that are obviously irrelevant) cause diminishing gradients and slow convergence. Global hard negatives selected via approximate nearest neighbor (ANN) index dramatically improve training quality and retrieval performance (Xiong et al., 2021).
Data quality beats model complexity. Every time. A simple contrastive training procedure with high-quality data achieves state-of-the-art on MTEB and BEIR, beating models with 40x more parameters (Wang, Yang et al., 2024). The best architecture with bad training data loses to a decent architecture with great training data.
Conclusion
Generic embeddings are a starting point, not a destination. They get a demo working. They do not get a production system working. The path from demo to production runs through domain adaptation: fine-tuning on domain-specific query-document pairs, constructing high-quality training data from multiple sources, and investing in hard negatives that teach the model the boundaries of relevance in a specific domain.
The embedding landscape will continue to move fast. Models will continue to improve. But the fundamental principle holds regardless of which model is currently leading MTEB: an embedding model is only as good as its training pairs, and domain-specific data is the competitive asset that no leaderboard score can substitute for.
References
- Craswell, N., Zoeter, O., Taylor, M., & Ramsey, B. (2008). "An Experimental Comparison of Click Position-bias Models." WSDM 2008, pp. 87-94. https://dl.acm.org/doi/10.1145/1341531.1341545
- Dai, Z., Zhao, V. Y., Ma, J., Luan, Y., Ni, J., Lu, J., Bakalov, A., Guu, K., Hall, K. B., & Chang, M. (2023). "Promptagator: Few-shot Dense Retrieval From 8 Examples." ICLR 2023. https://arxiv.org/abs/2209.11755
- Joachims, T., Granka, L., Pan, B., Hembrooke, H., & Gay, G. (2005). "Accurately Interpreting Clickthrough Data as Implicit Feedback." SIGIR 2005, pp. 154-161. https://dl.acm.org/doi/10.1145/1076034.1076063
- MTEB Leaderboard (2026). https://huggingface.co/spaces/mteb/leaderboard
- Qu, Y., et al. (2021). "RocketQA: An Optimized Training Approach to Dense Passage Retrieval." NAACL 2021, pp. 5835-5847. https://aclanthology.org/2021.naacl-main.466/
- Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021. https://arxiv.org/abs/2104.08663
- Vera, L., Dua, R., Zhang, Y., Salz, D., Mullins, R., et al. (2025). "EmbeddingGemma: Powerful and Lightweight Text Representations." arXiv:2509.20354. https://arxiv.org/abs/2509.20354
- Wang, K., Thakur, N., Reimers, N., & Gurevych, I. (2022). "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval." NAACL 2022. https://aclanthology.org/2022.naacl-main.168/
- Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., & Wei, F. (2022). "Text Embeddings by Weakly-Supervised Contrastive Pre-training." arXiv:2212.03533. https://arxiv.org/abs/2212.03533
- Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024). "Improving Text Embeddings with Large Language Models." ACL 2024, pp. 11897-11916. https://aclanthology.org/2024.acl-long.642/
- Xiong, L., Xiong, C., Li, Y., Liu, T., Bennett, P. N., Ahmed, J., & Overwijk, A. (2021). "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval" (ANCE). ICLR 2021. https://arxiv.org/abs/2007.00808
Generic embeddings not cutting it? Let's talk custom models.
Book a Discovery Call