Why E-commerce Search Fails and How Catalog Data Quality Fixes It

A familiar sequence plays out at mid-market brands. Site search converts poorly. The internal read is that the search engine is the weak link, so the team evaluates vendors, signs a new one, runs the migration, and waits for conversion to climb. Often it barely moves. The new engine produces a slightly better version of the same failures, and the brand is left with a larger bill and the same problem.

The reason that sequence disappoints so often is that the diagnosis was wrong. Most search-relevance gaps in mid-market e-commerce do not originate in the retrieval algorithm. They originate in the catalog: missing attributes, inconsistent taxonomy, and vocabulary gaps between the words shoppers type and the words the catalog stores. A new vendor inherits all of it. A better algorithm operating on impoverished data is still operating on impoverished data.

This matters because search is not a side feature. Shoppers who use site search make up roughly a quarter of visitors but drive close to half of revenue, and they convert at more than twice the rate of non-searchers (Constructor, 2025). When search falls short, 72% of shoppers will leave, and 36% go straight to a competitor (Coveo, 2025). A search problem is a revenue problem, which is exactly why the temptation to throw a vendor at it is so strong.

The migration carries the problem forward

The reason a vendor swap so rarely delivers the expected lift is structural. The migration moves the catalog as it is. The new engine indexes the same incomplete attribute coverage, the same inconsistent unit conventions, and the same missing synonyms. Whatever was unfindable before the migration is unfindable after it, because the words and attributes that would make it findable were never in the data.

The migration is also the slowest and most expensive intervention available, and it is the one most exposed to failure: it touches the entire product feed, the index, the front end, and the merchandising rules at once. A brand that treats a search problem as a procurement decision is signing up for that risk, when the actual fix usually lives in the catalog and never requires a new engine at all.

What a catalog audit surfaces

Three categories of problem recur, and all three masquerade as search failures.

Missing or incomplete attributes. Search can only filter, facet, and rank on attributes that exist in the data, and they frequently do not. Across roughly 5,000 laptops collected from a major e-commerce platform, only about 16% had all eight key technical specification attributes populated, with most product descriptions incomplete on at least one dimension (Kunal et al., 2024). A shopper who filters across the full spec set is invisible to roughly four out of five of those products, not because the products lack the specs but because the catalog never recorded the complete set.

Synonym gaps. The canonical example is the one most teams recognize immediately: a search for "blow dryer" returns nothing on a site whose catalog says "hair dryer," and most shoppers never try the alternate term, assuming zero results means the store does not carry the product (Baymard Institute, 2014). On the top 50 US e-commerce sites, 70% required shoppers to search using the exact jargon the site itself used. The same pattern shows up across categories: "multifunction printer" against "all-in-one printer," "tee shirt" against "t-shirt." The gap has narrowed since, largely because modern search vendors ship default synonym tables, but search UX remains broadly weak: 56% of more than 170 benchmarked sites still rate mediocre or worse (Baymard Institute, 2026). Brand-specific and category-specific vocabulary remains the part no default table covers.

Inconsistent taxonomy and formatting. Two related problems sit in this category. The same attribute appears as "4 oz," "4 ounces," and "4oz" across a single catalog, often because the data arrived from multiple suppliers with no normalization step, which degrades both filtering and matching. Products are also filed under inconsistent categories, so scoped searches and category filters miss items that are present but classified elsewhere. Inconsistency in basic product attributes is a structural driver of low catalog data quality, compounded by the lack of standardized data across suppliers (Niemir and Mrugalska, 2022). The underlying failure modes are well documented: lexical matching is fragile to synonyms and hypernyms, to morphological variants such as "woman" versus "women," and to minor spelling differences (Nigam et al., 2019). Each of those fragilities is a catalog-vocabulary problem before it is an algorithm problem.

Why it reads as a search problem

All three categories surface at the shopper as the same symptom: a query returns nothing, or returns the wrong things. From the shopper's seat, the search box failed. From the analytics dashboard, the search engine looks like the culprit. The blame lands on the layer that displayed the failure, not the layer that caused it.

Zero-result rates make the surface area concrete. Practitioners commonly treat a null-results rate above 3 to 5% as suboptimal and aim for below 2% (Algolia). Most stores run well above that threshold, and the bulk of the gap is vocabulary and coverage, not ranking math. Reading that zero-result log structurally, decomposed into its four distinct causes, is what tells a brand which layer to fix.

The harder version of the same misattribution is the query that does return results, passes the engine's internal relevance test, and still does not convert. For 42% of shoppers, search results are technically relevant to the query yet are not the products they actually want to buy (Constructor, 2025). The engine matched "shirts" to shirts and did its job by the classical definition of relevance. The results were simply not the shirts the shopper had in mind. That is not a retrieval defect; the defect sits one or two layers below the engine that gets blamed for it, in the richness of the catalog and the merchandising rules that order results.

The diagnostic that attributes the gap correctly

The way out is to stop treating "search is bad" as a single problem. It is at least two problems, they live in different layers, and they have different fixes. The mistake that wastes months is fixing the wrong one.

Two query lists localize the failure, and they point at different layers. A diagnostic that reports a single aggregate relevance score cannot tell them apart; it produces one number and attributes the whole gap to "search."

| Symptom | What the analytics show | Where the failure lives | The fix | |---|---|---|---| | Top zero-result queries | Shoppers search, nothing comes back | Catalog layer: missing attributes, vocabulary the catalog does not recognize | Vocabulary and coverage work: synonyms, attribute enrichment | | Top high-CTR, low-conversion queries | Results returned, shoppers clicked, then did not buy | Ranking and merchandising layer | Merchandising rules and ranking, not an engine swap |

The split is well established in product-search research. Matching, the task of retrieving the relevant products at all, and ranking, the task of ordering them well, are distinct subtasks with distinct failure modes (Nigam et al., 2019). The large public benchmark from the Amazon KDD Cup labels every query-product pair as Exact, Substitute, Complement, or Irrelevant (Reddy et al., 2022): an "Irrelevant" result is a coverage and catalog signal, while misordered Substitutes and Complements are a ranking signal. Vendor query pipelines encode the same logic operationally, mapping each failure type to its own intervention: synonyms for vocabulary gaps, spell correction for input errors, query relaxation for over-constrained retrieval (Bloomreach documentation). Measuring data quality alongside relevance is what lets a diagnostic say which layer is actually broken, instead of handing back a single score and a vendor to blame.

The cheapest fix is usually not a new vendor

The practical consequence is that the highest-leverage improvement is often the least expensive one, and it is not a migration. Adding synonyms is merchandising work measured in days, not the quarters a replatform consumes, and it sidesteps the migration risk entirely.

A single synonym entry can recover a high-intent query that was returning nothing. When AirPods launched, shoppers searching that term on a site that catalogued the category as "earbuds" got zero results until a synonym was added, after which the same query found the product (Klevu case study, archived). Brands accumulate thousands of these over time; one retailer maintained more than 2,000 synonyms by hand for variant spellings and phrasings before search volume forced automation (Klevu case study). None of that is an algorithm change. It is catalog vocabulary, and it routinely moves more revenue than the engine swap a brand was about to fund.

This is not an argument that search engines never matter. At very large scale, neural retrieval does outperform lexical matching on the same catalog (Nigam et al., 2019), and some vendor switches do lift conversion. The published case studies that show those lifts, though, almost always bundle the new engine with catalog enrichment and merchandising tooling the brand was not using before. The enrichment is doing much of the work that gets credited to the algorithm. For a mid-market brand without a dedicated ML team, the catalog layer is almost always both the cheaper lever and the larger one.

The point is narrower and more useful than "engines are overrated": the problem cannot be assigned to search or to the catalog until both are measured, separately, on the brand's own queries (and a fast search system says nothing about whether it works until relevance is measured at all). Switching vendors before that measurement is a bet placed without reading the cards. Sometimes the algorithm is the issue. More often it is not, and the brand finds out only after the migration fails to deliver.

See the separation on a real catalog

The sample diagnostic report does exactly this separation, applied to a public dataset so the format is fully visible before any engagement. It runs against WANDS, the Wayfair product-search relevance dataset of 42,994 products and 233,448 human relevance judgments across 480 queries (Chen et al., 2022), and it includes a catalog-quality section that shows where coverage failures end and ranking failures begin. It is the difference between a single relevance score and a report that names the layer to fix first.

Download the sample report to see the diagnostic applied to a real catalog, including the catalog-quality breakdown and the query-stratified view that separates the vocabulary problem from the ranking problem.