Research

How to Measure AI Search Visibility: Why Per-Engine Tracking Exposes Share of Voice as a Broken Metric

Aggregate AI share of voice treats six different citation engines as one. New data shows 77% of brands are cited by only one engine and cross-platform overlap sits at 11%. Per-engine measurement is the minimum viable approach — here is the evidence and the framework.

Published AuthorityTech
Index Data
TopicsMeasurementShare Of VoiceCitation BehaviorPer Engine Tracking

Aggregate AI share of voice is a broken metric. An empirical test across five AI engines found that 77% of brands were cited by only one engine, and cross-platform citation data shows just 11% domain overlap between ChatGPT and Perplexity. An 8,400-prompt study across four engines found cross-engine agreement on the top-cited brand reached only 34% for head-term queries. Treating these engines as one measurement surface produces numbers that are directionally misleading. Per-engine tracking — measuring citation authority within each retrieval stack separately — is the minimum viable measurement for AI search visibility in 2026.

Last updated: June 23, 2026

The Divergence Problem: Six Engines, Six Citation Realities #

The core assumption behind aggregate share of voice is that AI engines behave like a single search channel with minor variation. The data refutes this at every level.

The Visionary Marketing study deployed 8,400 prompts across ChatGPT, Claude, Perplexity, and Gemini in February 2026 and found citation density varies by 44% between the most and least citation-dense engines:

Engine Brand citation rate Citations per answer
Perplexity 84.2% ~21.9
ChatGPT 71.4% ~7.9
Gemini 62.8% ~17.0
Claude 58.4% Prose-integrated

Source: Visionary Marketing, AI Search Visibility Statistics 2026

Cross-engine agreement on the top-cited brand hit only 34% for head-term queries. For comparison queries ("best X for Y"), agreement dropped to 21.4%. A brand dominating one engine's citations can be invisible on another.

The VerityScore empirical test confirmed this pattern at the brand level. Using a single beauty-category prompt across five engines, 77% of cited brands appeared in only one engine's response. Only one brand — Clémence & Vivien — was cited unanimously by all five.

This is not noise. Each engine uses a different retrieval stack, different reranking logic, and different source preferences. Research on citation divergence shows that these architectural differences produce systematically different citation patterns, not random variation. Perplexity emphasizes reviews and expertise sources with heavy Reddit weighting. ChatGPT draws 48.7% of citations from third-party directories. Gemini favors brand-owned content at 52.2% — the opposite pattern.

Three Ways Aggregate Share of Voice Fails #

1. It Masks Engine-Level Concentration #

A brand with 25% aggregate SoV could hold 80% share on Perplexity and 0% everywhere else. An aggregate dashboard would call this "moderate visibility." An operator making budget decisions from that number would misallocate toward engines where the brand has no traction — or, worse, would neglect the single engine driving real traffic.

The Machine Relations Index tracks this through engine breadth scoring. In the current MRI measurement window, a source cited by all six measured engines (Google AI Mode, Claude, Perplexity, Gemini, Google AI Overviews, ChatGPT) scores 40/40 on engine breadth. A source cited by only one engine scores 6.7/40, regardless of how many total citations it earns. The breadth component forces visibility into whether authority is distributed or concentrated — a signal aggregate SoV deliberately obscures.

2. It Treats All Mentions as Equal #

Attrifast's analysis demonstrates the revenue distortion precisely: a definitional citation in a "what is X" answer — where the user got their answer and never clicked — counts identically to a comparison citation in a "best X for Y" answer that drove a high-intent click and a sale.

Their worked example: Brand A holds 40% classic SoV generating $80,000 monthly. Brand B holds 10% SoV generating $160,000. Revenue share of voice inverts the competitive ranking entirely. Teams can grow classic SoV by 40% while watching AI-attributed revenue stay flat because new mentions land in low-intent, informational contexts.

The MRI methodology addresses this through query diversity scoring rather than revenue attribution. A source cited across 40 distinct queries scores higher than one cited 100 times on a single query, because query diversity is a measurable proxy for the range of user intents a source serves.

3. It Cannot Detect Temporal Drift #

DigitalApplied's measurement framework documents 40–60% monthly shifts in cited domain sets within active categories. Ahrefs measured AI Overview citations from Google's top-10 results declining from 76% to 38% between July 2025 and March 2026.

A quarterly aggregate SoV snapshot captures a number that may have been true for only one of those three months. The MRI tracks temporal consistency as a scored component — measuring how many days within a 30-day window a source was cited. A source cited on 26 out of 30 days has structurally different authority than one cited on 5 days, even if both have the same total citation count.

Per-Engine Measurement: The MRI Framework #

The Machine Relations Index was built around per-engine measurement because the divergence data made aggregate tracking structurally unreliable. The methodology (MRI Score v1.1, 6-engine) decomposes citation authority into five scored components:

Component Max score What it measures
Engine breadth 40 How many of 6 engines cite the source
Query diversity 20 Range of distinct queries that trigger citations
Vertical spread 15 Number of industry verticals represented
Position quality 10 Average citation position (higher = cited earlier)
Temporal consistency 10 Days cited within the measurement window

Source: MRI methodology, Machine Relations Index

The consensus score is a weighted composite. The weighted authority score adjusts for the source's peer group size and citation volume. Together, these produce a measurement that can only be high if the source is cited across multiple engines, across diverse queries, across verticals, in strong positions, consistently over time.

The current MRI measurement window covers 7,020 domains and 25,241 source events across six engines: Google AI Mode, Claude, Perplexity, Gemini, Google AI Overviews, and ChatGPT. This per-engine granularity exposes patterns that aggregate measurement cannot:

  • Distribution shifts: Google AI Mode's citation share of individual sources is declining cycle over cycle — not because sources lose authority, but because citation volume diversifies across engines. An aggregate tracker would miss this structural redistribution.
  • Engine-specific preferences: Market databases like Crunchbase earn their highest citation share from Google AI Mode (30.5%) and Claude (25.5%), while ChatGPT contributes only 2.8%. A strategy optimized for "AI search" without engine segmentation would overinvest in ChatGPT's retrieval characteristics and underserve the engines delivering most citations.
  • Position quality divergence: A source can rank in the top 3 citation positions on Perplexity while sitting at position 8+ on Gemini. Per-engine position tracking reveals this; aggregate position averaging conceals it.

What Operators Should Measure Instead #

The measurement gap is stark: only 14% of marketers currently track AI citations, while AI search visits grew 42.8% year over year from Q1 2025 to Q1 2026 (15.6B to 27.4B visits). The measurement infrastructure lags the channel growth by an order of magnitude.

A viable measurement stack for AI search visibility requires:

Per-engine citation tracking. Monitor each engine separately. The minimum viable engine set in 2026 is Google AI Mode, ChatGPT (search mode), Perplexity, Gemini, Claude, and Google AI Overviews. Each has different retrieval architecture, different source preferences, and different update cadences.

Query-level granularity. Aggregate brand mention counts cannot distinguish between a definitional citation that drives no action and a comparison citation that drives purchase intent. Track which queries trigger citations and classify them by intent — informational, navigational, comparison, transactional.

Temporal measurement cadence. Monthly snapshots miss 40–60% drift in citation sets. Weekly measurement is the minimum for detecting real shifts versus noise. The VerityScore methodology recommends minimum 3 runs per prompt to account for intra-engine variance of up to 15%.

Source role classification. Not all citations carry equal authority. The MRI classifies sources by role — market database, analyst research, news media, review platform, brand-owned — because citation patterns cluster by source role. Understanding which role your domain occupies determines which engines are structurally likely to cite it.

Position tracking, not just mention tracking. An 8,400-prompt study found that only 12% of AI citations go to brand-owned domains. But brands receiving consistent mentions saw 23.4% lift in branded search volume over 30 days. Position within the citation list matters: earlier citations carry disproportionate influence on user behavior and on the branded-search effect that drives actual traffic.

Machine Relations Implication #

The measurement failure is not technical — the data exists. The failure is conceptual: treating six architecturally distinct retrieval systems as one channel and counting mentions without decomposing them by engine, query, position, intent, and temporal stability.

Share of citation — the framework Machine Relations uses to replace share of voice in AI search contexts — exists precisely because the underlying measurement reality demands per-engine, per-query, position-aware tracking. A single number cannot represent a brand's citation authority across engines that agree on the top brand only 34% of the time.

The evidence base for this is no longer theoretical. Independent studies from VerityScore, Visionary Marketing, DigitalApplied, and Attrifast converge on the same structural finding: aggregate AI share of voice collapses meaningful signal into a directionally misleading composite. The MRI methodology and these independent frameworks all arrive at per-engine decomposition as the minimum viable measurement architecture.

Operators still reporting a single "AI share of voice" number to leadership are reporting a metric that describes no engine accurately.

FAQ #

What is AI share of voice? AI share of voice measures the percentage of AI-generated responses that mention or cite a brand across a set of category-relevant prompts. The metric borrows from traditional media share of voice but applies it to AI answer engines like ChatGPT, Perplexity, Gemini, Claude, and Google AI Mode.

Why is aggregate AI share of voice unreliable? Because AI engines cite dramatically different sources for the same queries. Empirical data shows 77% of brands are cited by only one engine, cross-platform overlap between ChatGPT and Perplexity is 11%, and cross-engine agreement on the top brand reaches only 34% for head-term queries. Aggregating across these engines produces a number that describes none of them accurately.

How should brands measure AI search visibility instead? Per-engine citation tracking with query-level granularity, temporal cadence (weekly minimum), source role classification, and position tracking. The Machine Relations Index methodology decomposes citation authority into five components — engine breadth, query diversity, vertical spread, position quality, and temporal consistency — to produce a measurement that requires multi-engine, multi-dimension authority.

What is the difference between share of voice and share of citation? Share of voice counts brand mentions. Share of citation measures citation authority — accounting for which engines cite a source, in what position, across how many distinct queries, in how many verticals, and with what temporal consistency. Share of citation is a per-engine, position-aware framework designed for the structural reality that AI engines do not agree on what to cite.

How often should AI citation data be measured? Weekly at minimum. Research shows 40–60% monthly shifts in cited domain sets within active categories, with up to 15% variance between identical runs on the same engine. Quarterly or monthly snapshots miss the majority of citation dynamics.

Additional source context #

This research was produced by AuthorityTech — the first agency to practice Machine Relations. Machine Relations was coined by Jaxon Parrott.

Request free AI visibility audit →