AI citation measurement methodologies...

Multiple AI citation indices now rank publishers and domains by how often answer engines cite them — and they produce sharply different rankings for the same properties. A controlled test of seven tracking tools on the same domain over 15 days found an 8.2x gap between the lowest and highest citation count. The divergence is not rounding error. It is a structural consequence of how each index defines a citation, which engines it samples, and how it handles deduplication.

This matters because brands, agencies, and media buyers are beginning to allocate earned media budgets based on these numbers. If the numbers disagree by nearly an order of magnitude, the allocation decisions built on them are unreliable — unless the buyer understands exactly what each methodology measures and what it omits.

What the major AI citation indices actually measure #

At least five distinct measurement systems now publish AI citation data. Each defines the object of measurement differently.

Baden Bower AI Visibility Index (launched June 2026) tracked 12,040 citations across six engines — ChatGPT, Perplexity, Claude, Gemini, Google AI Overviews, and Microsoft Copilot. The methodology runs 20 buyer-intent questions, each repeated 10 times per engine (1,200 total observations). It produces an AIO Citation Score and a citation share percentage. Forbes leads at 92 points and 13.8% share; Business Insider follows at 10.5%.

Foglift AI Search Citation Benchmark (Q2 2026) used 75 brand-neutral buyer-intent prompts across 25 verticals, generating 375 responses across five engines (ChatGPT, Claude, Gemini, Google AI Overview, Perplexity). The study found 2,697 cited URLs from 1,119 distinct domains. Its cross-engine Jaccard similarity score was 0.18 — meaning any two engines share less than one-fifth of their cited domains for the same queries.

Machine Relations Index (MRI) tracks 7,020 domains across 25,241 source events on six engines — ChatGPT, Perplexity, Gemini, Claude, Google AI Mode, and Google AI Overviews. MRI uses a composite consensus score built from five weighted components: engine breadth, query diversity, vertical spread, position quality, and temporal consistency. The methodology version (v1.1) produces both a raw consensus score and a weighted authority score, with peer-context percentile ranking against the full tracked domain set.

EPR AI Labs Citation Share Index (Phase 0, published June 2026) uses a five-factor formula: citation frequency (40%), cross-engine breadth (20%), query-type breadth (20%), extractability (15%), and crawl access (5%). Phase 0 published the framework only — quantitative citation-share data is deferred to Phase 1 in Q3 2026.

BrightEdge AI Search Insights publishes weekly citation volatility reports tracking how AI search citations shift over time, focusing on temporal instability rather than static rankings.

Why the same domain gets a different score from every index #

The kenimoto.dev comparison identified four structural causes for the divergence across seven tracking tools:

Definition variance is the largest factor #

Tools do not agree on what counts as a citation. Profound ($499/mo) counts only clickable source links. Peec AI (€89/mo) counts any brand mention — linked or unlinked. Otterly AI ($29/mo) counts cited URLs with daily deduplication. Bluefish AI calculates a share-of-voice rank. Semrush's AI Toolkit counts domain URLs in structured answer fields only. A DIY Python script tracking raw answer text found 54 citations where Otterly found 38 and Peec AI found 312 — for the same domain, same queries, same time period.

Tool	Citations found (same domain, 15 days)	Multiple vs. baseline
Otterly AI	38	1.0x
Self-built Python	54	1.4x
Semrush AI Toolkit	71	1.9x
Bluefish AI	89	2.3x
Profound	147	3.9x
Scrunch	203	5.3x
Peec AI	312	8.2x

Engine coverage varies #

Peec AI samples five engines (ChatGPT, Claude, Gemini, Perplexity, Copilot). Others omit one or more — dropping Claude, Gemini, or Copilot from the sample. Since each engine has distinct citation preferences (Foglift found Perplexity cites YouTube in 41.3% of responses while ChatGPT and Claude cite it in 0%), omitting an engine changes the total count and the relative ranking of every tracked domain.

Sampling frequency and deduplication create drift #

Otterly deduplicates within 24-hour windows. Peec AI counts each separate mention across its measurement period. A weekly tracker that deduplicates by URL will report a fraction of what a daily tracker counting text mentions reports. Neither is wrong — they measure different things.

Query construction determines what gets found #

Baden Bower uses 20 buyer-intent questions repeated 10 times. Foglift uses 75 brand-neutral prompts across 25 verticals. MRI tracks real user queries across 10+ verticals with 32+ unique queries per domain. The more verticals and query types an index samples, the higher the citation count ceiling — and the more likely it captures niche domains that dominate specific verticals but disappear in narrow samples.

Where the indices agree — and what that consensus reveals #

Despite the 8.2x citation count divergence, the indices converge on several structural findings:

Forbes dominates across measurement systems. Baden Bower ranks Forbes first (92 score, 13.8% citation share). Foglift finds Forbes leads in Gemini (21.3%) and Google AI Overview (20%). MRI data shows Forbes at Elite tier across 6 engines. The cross-index agreement on Forbes as the most-cited publication is one of the strongest consensus signals in AI citation measurement.

Cross-engine overlap is low. Foglift's Jaccard similarity of 0.18 confirms what MRI's engine-breadth component measures: engines do not cite the same sources. Of the 81 unique domains in any engine's top-25, only healthline.com appeared across all five engines. This means any index that tracks fewer than four engines misses the structural divergence entirely.

Recency matters, but not uniformly. Baden Bower found that coverage published within 12 months was cited 2.9 times more often than older content. MRI's temporal consistency component (which measures days cited over a 30-day window) captures this as a stability signal rather than a pure recency filter.

Editorial coverage outperforms distribution. Baden Bower's data shows authored editorial content cited 2.3 times more often than press releases. This aligns with MRI's finding that source role classification — whether a domain functions as original research, analyst commentary, or syndication — affects citation authority independently of domain authority.

How to choose an AI citation measurement approach #

The right methodology depends on what business decision the measurement informs.

Decision	Measurement need	Best-fit approach
Media placement ROI	Which publications drive AI citations for buyer queries?	Baden Bower (publication-level, buyer-intent prompts) or Foglift (vertical-segmented)
Brand visibility tracking	Is our domain being cited more or less over time?	Otterly or custom Python (low cost, consistent definition, trend-over-time)
Competitive intelligence	Where does our domain rank vs. peers across all engines?	MRI (multi-engine consensus, weighted authority, peer percentile)
Content strategy	Which content types and structures get cited?	Foglift (structural type classification) + MRI (vertical spread, position quality)
Publisher authority assessment	Which domains are structurally advantaged in AI retrieval?	MRI (composite score across engine breadth, query diversity, vertical spread, temporal consistency)

No single index answers every question. A media buyer evaluating Forbes vs. TechCrunch needs citation share per publication (Baden Bower). A B2B SaaS company tracking its own AI visibility needs engine-specific trend data (MRI or Otterly). A content strategist deciding what to publish next needs structural type analysis (Foglift).

The Machine Relations framework for evaluating measurement quality #

Measurement methodologies can be evaluated on four dimensions that determine whether their output is actionable:

Engine coverage completeness. An index that omits Google AI Mode or Claude misses citation surfaces that handle substantial query volume. Six-engine coverage (MRI, Baden Bower) captures more of the actual retrieval surface than five-engine designs.
Query diversity and vertical spread. Twenty buyer-intent questions (Baden Bower) sample a narrower intent space than 75 brand-neutral prompts across 25 verticals (Foglift) or continuous multi-vertical tracking across 32+ queries per domain (MRI). Narrow query sets amplify the weight of individual query biases.
Definition transparency. The 8.2x gap exists because tools define "citation" differently and most do not disclose their definition precisely. Any index that reports a citation count without specifying whether it counts linked mentions, text mentions, or share-of-voice positions is reporting a number without a unit.
Temporal design. A single-snapshot benchmark (Foglift Q2) tells you what happened on one day. A daily-deduped tracker (Otterly) shows trend direction. A rolling 30-day consistency metric (MRI temporal consistency component) captures stability vs. volatility. The right temporal design depends on whether the decision is about a moment or a trajectory.

FAQ #

Why do AI citation tracking tools disagree so much? #

The primary cause is definition variance — tools define "citation" differently. Some count only clickable links, others count any text mention. A comparative test found an 8.2x gap between tools for the same domain. Engine coverage, sampling frequency, and deduplication rules compound the divergence.

Which AI citation index is most accurate? #

No single index is universally accurate because accuracy depends on the definition of citation being used. Indices that track more engines, more query types, and more verticals capture a broader picture. Indices with transparent definitions and consistent methodology allow trend comparison over time. Choose based on what business question the data needs to answer.

How does the Machine Relations Index differ from other AI citation trackers? #

MRI tracks 7,020 domains across 25,241 source events on six engines and produces a composite score weighted across five components: engine breadth, query diversity, vertical spread, position quality, and temporal consistency. Unlike single-metric tools, MRI provides peer-context percentile ranking and confidence grades (A/B/C) to signal measurement reliability.

How many engines should an AI citation index track? #

At minimum four. Foglift's cross-engine Jaccard similarity of 0.18 means engines share less than one-fifth of cited domains for the same queries. Omitting even one major engine — particularly Google AI Mode, which now handles a material share of search queries — creates systematic blind spots in any ranking.

Last updated: June 23, 2026

Additional source context #

Crystal: Characterizing Relative Impact of Scholarly Publications — arxiv.org
Public Benchmarks for Citation Accuracy in AI-Authored Papers — clawRxiv, 2026
Structural Correlates of AI Citation: An Observational Analysis of 22 On-Page Metrics — SIGI, 2026
Best AI Citation Tools for Researchers in 2026 — aitrendblend.com, 2026
I Plugged the Same Site Into 7 AI-Citation Trackers. They Reported 7 Different Numbers. — DEV Community
Unbiased evaluation of ranking metrics reveals consistent performance in science and technology citation data — arxiv.org
CausalCite: A Causal Formulation of Paper Citations — ETH Zurich

AI citation measurement methodologies compared — why different indices rank the same publishers differently