State of AI Commerce on Shopify, Q2 2026 Report

1. The AI shopping channel is real, and it's growing fast

Over the last 22 days we captured 1,047,024 distinct product recommendations from six AI agents, running our standardized buyer-intent query set across five verticals (apparel, beauty, home, food, electronics). That's an average of 47,592 captures per day, growing.

Each capture is a record of what an agent returned when our benchmark suite issued a buyer-style query. It is not a sample of real shopper traffic (which is private to each agent). Two takeaways from the benchmark:

The answers are deterministic enough to measure.Issue the same buyer-style query 30 days apart and you get largely overlapping product lists, meaning catalog-level signal is what moves the answers, not query phrasing noise.
The answers are concentrated.We'll show below that the top 10 brands receive a disproportionate share of mentions, which is bad news for the long tail and great news for whoever's optimizing.

2. Agent share: who recommends how much

Not all AI agents capture share equally. Across the 1,047,024 captures in this window:

Gemini

383,553

ChatGPT

176,927

Claude

167,890

DeepSeek

160,210

Mistral

158,444

Gemini leads with 383,553 captures (37% share). At the other end, Mistral trails at 158,444 (15%). The gap matters because different agents prioritize different signals. What wins on ChatGPT may underperform on Claude, and vice versa.

3. Top 10 brands AI agents recommend most

From the 200 distinct brands we observed in the last 90 days, the top 10 received the following mention counts:

#	Brand	Mentions	Agents
1	Amazon amazon.com	64,426	Claude, DeepSeek, Gemini, Mistral, ChatGPT
2	Patagonia patagonia.com	26,846	Claude, DeepSeek, Gemini, Mistral, ChatGPT
3	Target target.com	25,664	Claude, DeepSeek, Gemini, Mistral, ChatGPT
4	Uniqlo uniqlo.com	21,267	Claude, DeepSeek, Gemini, Mistral, ChatGPT
5	Nike nike.com	20,871	Claude, DeepSeek, Gemini, Mistral, ChatGPT
6	Walmart walmart.com	17,097	Claude, DeepSeek, Gemini, Mistral, ChatGPT
7	Adidas adidas.com	16,264	Claude, DeepSeek, Gemini, Mistral, ChatGPT
8	Columbia columbia.com	16,108	Claude, DeepSeek, Gemini, Mistral, ChatGPT
9	Thenorthface thenorthface.com	14,446	Claude, DeepSeek, Gemini, Mistral, ChatGPT
10	Rei rei.com	14,435	Claude, DeepSeek, Gemini, Mistral, ChatGPT

The complete top 100 leaderboard is live and updates hourly. A few observations from the long tail (positions 50-200, omitted from the table above to keep it readable):

Mention counts drop off sharply after the top 20. Position 50 typically gets <10% of the mentions that position 1 does. Long-tail visibility is a real opportunity for catalogs that optimize properly.
Mid-tier brands (positions 30-100) are mostly cited by 2-3 agents, not all 6. Cross-agent visibility is rare and high-signal.

4. Catalog quality vs. mention rank

We ran the public AI Catalog Score audit on the top 200 most-mentioned brands. 49 stores returned valid catalog data. Average score: 59/100.

The top 10 by audit score:

#	Brand	AI Catalog Score	Products
1	Burtsbeesbaby	76/100	250
2	Mattandnat	75/100	250
3	Rothys	73/100	250
4	Gymshark	70/100	250
5	Decathlon	69/100	250
6	Colehaan	68/100	250
7	Outdoorresearch	67/100	250
8	Packagefreeshop	67/100	116
9	Boody	66/100	250
10	Toms	66/100	250

The full audit-score leaderboard is at /leaderboard/catalog-score. Worth noting: the catalogs with the highest mention counts are not always the same as the catalogs with the highest audit scores. Discoverability and catalog quality are correlated but not identical.

5. Top queries in our benchmark suite

The 10 most-frequent queries our standardized benchmark suite issued in this window (queries are pre-defined, not sourced from real shopper search logs):

#	Query	Captures
1	`compare Allbirds Tree Runners vs Nike Pegasus`	2,502
2	`sustainable merino wool sweater for sensitive skin`	2,448
3	`best winter jacket under $200 for beginners`	2,448
4	`formal tie gift for grandma`	2,424
5	`sustainable bamboo fiber socks for small spaces`	2,412
6	`eco-friendly vegan leather boots`	2,286
7	`compare Sony WH-1000XM5 vs Bose QuietComfort Ultra`	2,286
8	`compare Kindle Paperwhite vs Kobo Clara`	2,286
9	`compare Anker vs Belkin USB-C charger`	2,286
10	`compact foldable treadmill for small apartment`	2,286

The pattern in our suite: specific queries elicit more confident answers than broad ones. Queries like "waterproof running jacket under $200" and "vegan skincare with niacinamide" return concrete brand-and-product lists; broad queries like "running gear" return generic category guidance. We constructed our suite to test the constraint-rich end of the distribution intentionally. That's where AI agent retrieval is most discriminating, and where catalog quality differences surface most clearly. If your catalog can't answer factual constraints, you don't get cited.

6. The structural takeaway

Three qualitative patterns hold across the dataset, regardless of which agent or vertical we slice. None of these are causal claims; we don't run controlled merchant experiments. They're descriptions of what the captures look like.

Structure beats prose. Brands cited most often in the captures dataset overwhelmingly publish structured metafield data on the platforms where they're recommended. The reverse is not observed: catalogs that hide attributes in marketing prose rarely surface at the top.
Specificity correlates with citation. Top-ranked captures consistently surface products described with factual markers (units, ingredients, materials, certifications) rather than marketing superlatives. We haven't run a controlled comparison, but the pattern is visible at a glance.
The distribution is winner-take-most. Rank 50 in our brand list receives ~5% of the mentions that rank 1 does. The long tail past rank 100 drops further still.

If you read one paragraph of this report: the single highest-leverage thing you can do for AI catalog visibility is set vertical-relevant metafields. The gap between "no AI-relevant metafields" and "3 vertical-relevant metafields" is the largest single jump in the rubric. We documented this in detail in the 8 signals article.

Methodology

Each day we run a ~5,000 query batch through six AI agents. The batch combines two sources: a 700-query anchor set of hand-curated queries kept identical across runs (so the same query's response can be tracked over time), and a probabilistically generated set that fills the rest.

The probabilistic generator samples each query from explicit distributions:

Length: 30% short (1-3 tokens, e.g. "running shoes"), 45% medium (4-8 tokens, e.g. "running shoes for marathon training"), 20% long (9-15 tokens, includes 2+ constraints), 5% verbose (16+ tokens, conversational).
Phrasing register: 55% search-style, 30% question-style ("what's the best..."), 15% conversational ("I'm looking for...").
Constraint mix: price ceiling, use case, demographic, factual attribute, brand relation. Pareto-distributed count, with at most one constraint per type per query.
Vertical share: Pareto across ten verticals (apparel and electronics ~18-20% each, beauty 17%, home 12%, gifts 10%, then a long tail through fitness, outdoor, pets, food, baby). Seasonally boosted (gifts in Q4, fitness in T1, outdoor in summer).

These parameters are explicit and reviewable in the open methodology repo. They are best-effort approximations of shopper-LLM behavior, calibrated from public observation rather than fitted to real shopper traffic (which is not publicly available). They will be wrong in some verticals. The right response when a reader pushes back is to debate the parameters, not to defend the output.

After collection, we extract product-and-brand recommendations from each agent's response. The parser is intentionally tolerant: different agents return slightly different shapes. We dedupe at the merchant-domain level per capture, then aggregate. Top brands are ranked over a 90-day window, which matches typical AI agent retraining cadence. Aggregated counts are exposed via /api/public/insights and refreshed hourly.

What the dataset is and is not. This is a benchmark. We do not observe real shopper traffic; actual shopping interactions with the agents are private to each provider. The signal the dataset surfaces is "given a buyer-style query, which catalogs do AI agents cite?", useful for benchmarking visibility and tracking changes over time. The signal it does not surface is "what queries real shoppers type to AI agents and at what volume" since that data exists only inside each agent's servers.

Limitations:

The query suite is generated from a model of shopper-LLM behavior, not sampled from real search logs. The model's parameters are best-effort approximations and may diverge from actual shopper distributions, especially in long-tail verticals.
Each query in the daily batch is run once per agent. Head queries in the real world receive many more shopper impressions than tail queries; our captures dataset treats them with equal weight.
Capture set is currently English-language only. Multi-language is on the roadmap.
"Mentions" do not equal "purchases". We measure AI agent visibility, not downstream conversion.
Catalog audits are over public products.json data; signals like metafields and SEO meta are install-only (covered in the full rubric).

Methodology open at commerce-agentic/agentic-catalog-scanner. Raw dataset README at commerce-agentic/ai-visibility-metrics.

1. The AI shopping channel is real, and it's growing fast

2. Agent share: who recommends how much

3. Top 10 brands AI agents recommend most

4. Catalog quality vs. mention rank

5. Top queries in our benchmark suite

6. The structural takeaway

Methodology

Audit your catalog in 60 seconds