Research

AI Citation: What Determines Which Sources Answer Engines Extract

AI citation is a two-stage pipeline — selection then absorption — and each engine runs it differently. Analysis of 127,198 citations across five engines reveals what actually determines which sources appear in AI-generated answers.

Published AuthorityTech
Index Data
TopicsMriCitation BehaviorSource AuthorityAI Search

AI citation is not a single event. It is a two-stage pipeline — selection (the engine retrieves and lists your page as a source) and absorption (the engine actually pulls your language, evidence, or structure into the generated answer). A 2026 measurement framework analyzing 21,143 citations across ChatGPT, Gemini, and Perplexity confirmed these stages operate independently: many cited pages never influence the response at all. Understanding this distinction is the foundation of any serious Machine Relations strategy.

The citation pipeline: how AI engines actually select sources #

Every major answer engine follows a similar retrieval-to-citation pipeline, though each implements it differently:

  1. Query decomposition. The engine rewrites the user's question into multiple sub-queries. Google's own patents describe fanning a single question into related queries, gathering documents that answer the full set.

  2. Candidate retrieval. The engine pulls a pool of candidate pages — typically far more than will appear in the final answer. Research from Fahlout found that engines filter out roughly 95% of retrieved content before generating an answer.

  3. Re-ranking by trust signals. Retrieved pages are ranked by domain trust, author signals, semantic alignment with the query, and content freshness. Critically, traditional SEO metrics correlate poorly with citation outcomes: traffic explains citation behavior at r² = 0.05, and backlink counts at r² = 0.038.

  4. Citation selection. Only about 15% of retrieved pages earn a visible citation. The engine selects which sources to display alongside the answer.

  5. Content absorption. The engine extracts actual language, data, and structure from selected sources to compose the answer. This is where the citation absorption gap emerges — a page can appear in the source list while contributing nothing to the answer text.

What actually drives citation selection #

The factors that determine whether a page gets cited bear little resemblance to traditional search ranking signals. Based on measured data across multiple studies:

Extractability beats authority #

Entity richness increases citations by 267%, while cosine similarity (query-passage match) is 7.3× more predictive than domain authority. Pages structured with clean definitions, self-contained statistics, comparison tables, and procedural steps give retrieval systems exactly what they need to extract.

Content freshness matters — unevenly #

ChatGPT citations are on average 25.7% fresher than Google's organic results. Perplexity shows the strongest freshness bias, with content updated two hours prior cited 38% more frequently than month-old equivalents. Older evergreen content still earns citations when it carries unique evidence, but recency is a meaningful signal for competitive queries.

Structure determines absorption #

The five factors that drive citation absorption work multiplicatively — the weakest one limits overall absorption potential:

Absorption factor What it means Measured effect
Extractable evidence Definitions, numbers, comparisons, steps +17.3% citation-rate lift
Structural alignment Clear document organization with hierarchical headings Pages under 1,000 words retain 61% of content in answers
Semantic density Precisely on-topic passages without filler Query-passage cosine similarity is 7.3× more predictive than domain authority
Self-containment Passages readable without surrounding context Pages over 3,000 words retain only 13% of content
Corroboration surface Claims verified across independent sources Cross-source verified claims cited at higher rates

How the five major engines differ #

Each answer engine runs the citation pipeline with different parameters, priorities, and source preferences. An analysis of 127,198 citations across five engines revealed stark divergence:

Citations per answer #

Engine Avg. sources cited
Gemini 11.0
Perplexity 8.6
Google AI Mode 7.8
Claude 6.8
ChatGPT 3.7

Citation overlap is remarkably low #

Only 2.7% of domains (309 out of 11,647) were cited by all five engines. Nearly 70% of cited domains appeared in only one engine's results. This means optimizing for "AI search" as a monolith misses the structural reality: each engine has a distinct retrieval index, re-ranking model, and source preference profile.

Engine-specific source preferences #

Source type Google AI Mode Perplexity Gemini ChatGPT Claude
YouTube 11.2% 8.8% 2.2% 1.6% 0.02%
Reddit 4.0% 0.01%
Wikipedia
Vendor/product sites 90.6% overall

Google AI Mode pulls heavily from YouTube (11.2% of citations) and Reddit (4.0%). Claude almost never cites either. A page that earns citations from Perplexity may be invisible to ChatGPT — and vice versa.

Citation rate variance #

The gap between the most and least citation-generous engines is 615×:

  • Grok: 27.01% citation rate
  • Perplexity: 13.05%
  • Google AI Mode: 9.09%
  • ChatGPT: 0.59%

Machine Relations Index: source authority varies by category #

The Machine Relations Index (MRI) tracks citation behavior across 6,079 domains and 17,847 source events across Perplexity, ChatGPT, Gemini, Claude, Google AI Mode, and Google AI Overviews (MRI Score v1.1, 6-engine methodology). The data confirms that source type determines citation share independently of content quality.

Source role Domains Citations (30d) Citation share
Vendor-owned sources 584 3,236 18.1%
Editorial publications 549 2,649 14.8%
Market databases 308 1,627 9.1%
Analyst/consulting research 237 1,048 5.9%
Academic/government 175 748 4.2%
Community/social platforms 20 730 4.1%
Wire/press-release distribution 9 109 0.6%
Other observed sources 4,193 7,687 43.1%

Community and social platforms show the highest citation density per domain — 20 platforms capture 730 citations (36.5 citations per domain), while the 4,193 uncategorized sources average just 1.8 citations each. This concentration reflects how AI engines weight established platforms as corroboration surfaces.

Case study: Crunchbase.com citation authority #

Crunchbase.com illustrates how source architecture compounds citation authority. In the MRI, Crunchbase ranks 2nd among 308 market databases (99.7th percentile) with an Elite-tier consensus score of 79.1.

Metric Value
Total citations (30d) 85
Engines citing 6 of 6
Distinct queries triggering citations 29
Industry verticals 9
Average citation position 3.8
Days cited (of 22 measured) 22

The per-engine breakdown reveals the divergence pattern discussed above:

Engine Citations
Claude 21
Gemini 17
Google AI Overviews 16
Perplexity 15
Google AI Mode 13
ChatGPT 3

Crunchbase earns 7× more Claude citations than ChatGPT citations — confirming that a single source can have fundamentally different visibility across engines depending on how each engine's retrieval model weights structured data, domain trust, and content type.

The citation concentration problem #

Source diversity in AI answers is lower than most operators assume. The SurfacedBy study found that the top 10 domains captured 20.6% of all citations, while the top 100 captured 42%. Meanwhile, 43% of cited domains were cited exactly once.

This creates a winner-take-most dynamic. Once an engine's retrieval model associates a domain with a topic cluster, that domain captures a disproportionate share of future citations on related queries. Breaking into this cycle requires building entity chains — connected webs of evidence across multiple authoritative surfaces — rather than optimizing individual pages in isolation.

Why Machine Relations treats citation as infrastructure #

Machine Relations defines citation architecture as the structural engineering that makes a brand's evidence retrievable, extractable, and absorbable by AI engines. This is distinct from content marketing or SEO because the optimization target is the retrieval pipeline itself, not the reader or the crawler.

The measured evidence supports this framing:

  • Selection is necessary but not sufficient. A page can appear in the source list without contributing to the answer. The arxiv framework found that citation count alone inadequately measures effectiveness — answer-level absorption requires separate evaluation.
  • Each engine needs a separate strategy. With only 2.7% domain overlap across all five engines, a single-engine approach leaves 97.3% of the citation surface unaddressed.
  • Authority signals are earned, not purchased. Entity richness, cross-source corroboration, and passage-level extractability are structural properties of how content is built — not attributes that can be applied after the fact.

FAQ #

How many sources do AI engines typically cite per answer? #

It varies by engine. Gemini cites an average of 11.0 sources per answer, while ChatGPT cites only 3.7. The range across all engines is 2 to 10+ sources, depending on query complexity.

Minimally. Backlink counts correlate with AI citations at r² = 0.038 — essentially no predictive power. Entity richness (267% citation increase) and query-passage semantic alignment (7.3× more predictive than domain authority) matter far more.

Is there overlap between what different AI engines cite? #

Very little. Only 2.7% of cited domains appear across all five major engines (ChatGPT, Claude, Gemini, Perplexity, Google AI Mode). Nearly 70% of domains are cited by only one engine.

What is the difference between citation selection and citation absorption? #

Citation selection is when an engine retrieves a page and lists it as a source. Citation absorption is when the engine actually uses that page's language, evidence, or structure in the generated answer. A page can be selected without being absorbed — appearing in the source list while the answer draws its actual content from other sources.

Does content freshness affect AI citations? #

Yes, but unevenly across engines. ChatGPT citations are 25.7% fresher than Google's organic results on average. Perplexity shows the strongest freshness preference, citing recently updated content 38% more often than month-old equivalents.


Last updated: July 2, 2026. Analysis based on published research including SurfacedBy's 127,198-citation study (March–June 2026), the arxiv citation selection/absorption framework (602 controlled prompts, 21,143 citations), and Fahlout's AI citation research.

Additional source context #

This research was produced by AuthorityTech — the first agency to practice Machine Relations. Machine Relations was coined by Jaxon Parrott.

Request free AI visibility audit →