1. The AI shopping channel is real, and it's growing fast
Over the last 22 days we captured 1,047,024 distinct product recommendations from six AI agents, running our standardized buyer-intent query set across five verticals (apparel, beauty, home, food, electronics). That's an average of 47,592 captures per day, growing.
Each capture is a record of what an agent returned when our benchmark suite issued a buyer-style query. It is not a sample of real shopper traffic (which is private to each agent). Two takeaways from the benchmark:
- The answers are deterministic enough to measure.Issue the same buyer-style query 30 days apart and you get largely overlapping product lists, meaning catalog-level signal is what moves the answers, not query phrasing noise.
- The answers are concentrated.We'll show below that the top 10 brands receive a disproportionate share of mentions, which is bad news for the long tail and great news for whoever's optimizing.
3. Top 10 brands AI agents recommend most
From the 200 distinct brands we observed in the last 90 days, the top 10 received the following mention counts:
| # | Brand | Mentions | Agents |
|---|---|---|---|
| 1 | Amazon amazon.com |
64,426 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 2 | Patagonia patagonia.com |
26,846 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 3 | Target target.com |
25,664 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 4 | Uniqlo uniqlo.com |
21,267 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 5 | Nike nike.com |
20,871 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 6 | Walmart walmart.com |
17,097 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 7 | Adidas adidas.com |
16,264 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 8 | Columbia columbia.com |
16,108 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 9 | Thenorthface thenorthface.com |
14,446 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
| 10 | Rei rei.com |
14,435 | Claude, DeepSeek, Gemini, Mistral, ChatGPT |
The complete top 100 leaderboard is live and updates hourly. A few observations from the long tail (positions 50-200, omitted from the table above to keep it readable):
- Mention counts drop off sharply after the top 20. Position 50 typically gets <10% of the mentions that position 1 does. Long-tail visibility is a real opportunity for catalogs that optimize properly.
- Mid-tier brands (positions 30-100) are mostly cited by 2-3 agents, not all 6. Cross-agent visibility is rare and high-signal.
4. Catalog quality vs. mention rank
We ran the public AI Catalog Score audit on the top 200 most-mentioned brands. 49 stores returned valid catalog data. Average score: 59/100.
The top 10 by audit score:
| # | Brand | AI Catalog Score | Products |
|---|---|---|---|
| 1 | Burtsbeesbaby | 76/100 | 250 |
| 2 | Mattandnat | 75/100 | 250 |
| 3 | Rothys | 73/100 | 250 |
| 4 | Gymshark | 70/100 | 250 |
| 5 | Decathlon | 69/100 | 250 |
| 6 | Colehaan | 68/100 | 250 |
| 7 | Outdoorresearch | 67/100 | 250 |
| 8 | Packagefreeshop | 67/100 | 116 |
| 9 | Boody | 66/100 | 250 |
| 10 | Toms | 66/100 | 250 |
The full audit-score leaderboard is at /leaderboard/catalog-score. Worth noting: the catalogs with the highest mention counts are not always the same as the catalogs with the highest audit scores. Discoverability and catalog quality are correlated but not identical.
5. Top queries in our benchmark suite
The 10 most-frequent queries our standardized benchmark suite issued in this window (queries are pre-defined, not sourced from real shopper search logs):
| # | Query | Captures |
|---|---|---|
| 1 | compare Allbirds Tree Runners vs Nike Pegasus |
2,502 |
| 2 | sustainable merino wool sweater for sensitive skin |
2,448 |
| 3 | best winter jacket under $200 for beginners |
2,448 |
| 4 | formal tie gift for grandma |
2,424 |
| 5 | sustainable bamboo fiber socks for small spaces |
2,412 |
| 6 | eco-friendly vegan leather boots |
2,286 |
| 7 | compare Sony WH-1000XM5 vs Bose QuietComfort Ultra |
2,286 |
| 8 | compare Kindle Paperwhite vs Kobo Clara |
2,286 |
| 9 | compare Anker vs Belkin USB-C charger |
2,286 |
| 10 | compact foldable treadmill for small apartment |
2,286 |
The pattern in our suite: specific queries elicit more confident answers than broad ones. Queries like "waterproof running jacket under $200" and "vegan skincare with niacinamide" return concrete brand-and-product lists; broad queries like "running gear" return generic category guidance. We constructed our suite to test the constraint-rich end of the distribution intentionally. That's where AI agent retrieval is most discriminating, and where catalog quality differences surface most clearly. If your catalog can't answer factual constraints, you don't get cited.
6. The structural takeaway
Three qualitative patterns hold across the dataset, regardless of which agent or vertical we slice. None of these are causal claims; we don't run controlled merchant experiments. They're descriptions of what the captures look like.
- Structure beats prose. Brands cited most often in the captures dataset overwhelmingly publish structured metafield data on the platforms where they're recommended. The reverse is not observed: catalogs that hide attributes in marketing prose rarely surface at the top.
- Specificity correlates with citation. Top-ranked captures consistently surface products described with factual markers (units, ingredients, materials, certifications) rather than marketing superlatives. We haven't run a controlled comparison, but the pattern is visible at a glance.
- The distribution is winner-take-most. Rank 50 in our brand list receives ~5% of the mentions that rank 1 does. The long tail past rank 100 drops further still.
Methodology
Each day we run a ~5,000 query batch through six AI agents. The batch combines two sources: a 700-query anchor set of hand-curated queries kept identical across runs (so the same query's response can be tracked over time), and a probabilistically generated set that fills the rest.
The probabilistic generator samples each query from explicit distributions:
- Length: 30% short (1-3 tokens, e.g. "running shoes"), 45% medium (4-8 tokens, e.g. "running shoes for marathon training"), 20% long (9-15 tokens, includes 2+ constraints), 5% verbose (16+ tokens, conversational).
- Phrasing register: 55% search-style, 30% question-style ("what's the best..."), 15% conversational ("I'm looking for...").
- Constraint mix: price ceiling, use case, demographic, factual attribute, brand relation. Pareto-distributed count, with at most one constraint per type per query.
- Vertical share: Pareto across ten verticals (apparel and electronics ~18-20% each, beauty 17%, home 12%, gifts 10%, then a long tail through fitness, outdoor, pets, food, baby). Seasonally boosted (gifts in Q4, fitness in T1, outdoor in summer).
These parameters are explicit and reviewable in the open methodology repo. They are best-effort approximations of shopper-LLM behavior, calibrated from public observation rather than fitted to real shopper traffic (which is not publicly available). They will be wrong in some verticals. The right response when a reader pushes back is to debate the parameters, not to defend the output.
After collection, we extract product-and-brand recommendations from each agent's response. The parser is intentionally tolerant: different agents return slightly different shapes. We dedupe at the merchant-domain level per capture, then aggregate. Top brands are ranked over a 90-day window, which matches typical AI agent retraining cadence. Aggregated counts are exposed via /api/public/insights and refreshed hourly.
What the dataset is and is not. This is a benchmark. We do not observe real shopper traffic; actual shopping interactions with the agents are private to each provider. The signal the dataset surfaces is "given a buyer-style query, which catalogs do AI agents cite?", useful for benchmarking visibility and tracking changes over time. The signal it does not surface is "what queries real shoppers type to AI agents and at what volume" since that data exists only inside each agent's servers.
Limitations:
- The query suite is generated from a model of shopper-LLM behavior, not sampled from real search logs. The model's parameters are best-effort approximations and may diverge from actual shopper distributions, especially in long-tail verticals.
- Each query in the daily batch is run once per agent. Head queries in the real world receive many more shopper impressions than tail queries; our captures dataset treats them with equal weight.
- Capture set is currently English-language only. Multi-language is on the roadmap.
- "Mentions" do not equal "purchases". We measure AI agent visibility, not downstream conversion.
- Catalog audits are over public products.json data; signals like metafields and SEO meta are install-only (covered in the full rubric).
Methodology open at commerce-agentic/agentic-catalog-scanner. Raw dataset README at commerce-agentic/ai-visibility-metrics.
Audit your catalog in 60 seconds
Free public scan of any Shopify store. See where you'd rank.
Run a free audit → Install on Shopify