AI Citation: What Determines Which...

AI citation is not a single event. It is a two-stage pipeline — selection (the engine retrieves and lists your page as a source) and absorption (the engine actually pulls your language, evidence, or structure into the generated answer). A 2026 measurement framework analyzing 21,143 citations across ChatGPT, Gemini, and Perplexity confirmed these stages operate independently: many cited pages never influence the response at all. Understanding this distinction is the foundation of any serious Machine Relations strategy.

The citation pipeline: how AI engines actually select sources #

Every major answer engine follows a similar retrieval-to-citation pipeline, though each implements it differently:

Query decomposition. The engine rewrites the user's question into multiple sub-queries. Google's own patents describe fanning a single question into related queries, gathering documents that answer the full set.
Candidate retrieval. The engine pulls a pool of candidate pages — typically far more than will appear in the final answer. Research from Fahlout found that engines filter out roughly 95% of retrieved content before generating an answer.
Re-ranking by trust signals. Retrieved pages are ranked by domain trust, author signals, semantic alignment with the query, and content freshness. Critically, traditional SEO metrics correlate poorly with citation outcomes: traffic explains citation behavior at r² = 0.05, and backlink counts at r² = 0.038.
Citation selection. Only about 15% of retrieved pages earn a visible citation. The engine selects which sources to display alongside the answer.
Content absorption. The engine extracts actual language, data, and structure from selected sources to compose the answer. This is where the citation absorption gap emerges — a page can appear in the source list while contributing nothing to the answer text.

What actually drives citation selection #

The factors that determine whether a page gets cited bear little resemblance to traditional search ranking signals. Based on measured data across multiple studies:

Extractability beats authority #

Entity richness increases citations by 267%, while cosine similarity (query-passage match) is 7.3× more predictive than domain authority. Pages structured with clean definitions, self-contained statistics, comparison tables, and procedural steps give retrieval systems exactly what they need to extract.

Content freshness matters — unevenly #

ChatGPT citations are on average 25.7% fresher than Google's organic results. Perplexity shows the strongest freshness bias, with content updated two hours prior cited 38% more frequently than month-old equivalents. Older evergreen content still earns citations when it carries unique evidence, but recency is a meaningful signal for competitive queries.

Structure determines absorption #

The five factors that drive citation absorption work multiplicatively — the weakest one limits overall absorption potential:

Absorption factor	What it means	Measured effect
Extractable evidence	Definitions, numbers, comparisons, steps	+17.3% citation-rate lift
Structural alignment	Clear document organization with hierarchical headings	Pages under 1,000 words retain 61% of content in answers
Semantic density	Precisely on-topic passages without filler	Query-passage cosine similarity is 7.3× more predictive than domain authority
Self-containment	Passages readable without surrounding context	Pages over 3,000 words retain only 13% of content
Corroboration surface	Claims verified across independent sources	Cross-source verified claims cited at higher rates

How the five major engines differ #

Each answer engine runs the citation pipeline with different parameters, priorities, and source preferences. An analysis of 127,198 citations across five engines revealed stark divergence:

Citations per answer #

Engine	Avg. sources cited
Gemini	11.0
Perplexity	8.6
Google AI Mode	7.8
Claude	6.8
ChatGPT	3.7

Citation overlap is remarkably low #

Only 2.7% of domains (309 out of 11,647) were cited by all five engines. Nearly 70% of cited domains appeared in only one engine's results. This means optimizing for "AI search" as a monolith misses the structural reality: each engine has a distinct retrieval index, re-ranking model, and source preference profile.

Engine-specific source preferences #

Source type	Google AI Mode	Perplexity	Gemini	ChatGPT	Claude
YouTube	11.2%	8.8%	2.2%	1.6%	0.02%
Reddit	4.0%	—	—	—	0.01%
Wikipedia	—	—	—	—	—
Vendor/product sites	90.6% overall	—	—	—	—

Google AI Mode pulls heavily from YouTube (11.2% of citations) and Reddit (4.0%). Claude almost never cites either. A page that earns citations from Perplexity may be invisible to ChatGPT — and vice versa.

Citation rate variance #

The gap between the most and least citation-generous engines is 615×:

Grok: 27.01% citation rate
Perplexity: 13.05%
Google AI Mode: 9.09%
ChatGPT: 0.59%

Machine Relations Index: source authority varies by category #

The Machine Relations Index (MRI) tracks citation behavior across 6,079 domains and 17,847 source events across Perplexity, ChatGPT, Gemini, Claude, Google AI Mode, and Google AI Overviews (MRI Score v1.1, 6-engine methodology). The data confirms that source type determines citation share independently of content quality.

Source role	Domains	Citations (30d)	Citation share
Vendor-owned sources	584	3,236	18.1%
Editorial publications	549	2,649	14.8%
Market databases	308	1,627	9.1%
Analyst/consulting research	237	1,048	5.9%
Academic/government	175	748	4.2%
Community/social platforms	20	730	4.1%
Wire/press-release distribution	9	109	0.6%
Other observed sources	4,193	7,687	43.1%

Community and social platforms show the highest citation density per domain — 20 platforms capture 730 citations (36.5 citations per domain), while the 4,193 uncategorized sources average just 1.8 citations each. This concentration reflects how AI engines weight established platforms as corroboration surfaces.

Case study: Crunchbase.com citation authority #

Crunchbase.com illustrates how source architecture compounds citation authority. In the MRI, Crunchbase ranks 2nd among 308 market databases (99.7th percentile) with an Elite-tier consensus score of 79.1.

Metric	Value
Total citations (30d)	85
Engines citing	6 of 6
Distinct queries triggering citations	29
Industry verticals	9
Average citation position	3.8
Days cited (of 22 measured)	22

The per-engine breakdown reveals the divergence pattern discussed above:

Engine	Citations
Claude	21
Gemini	17
Google AI Overviews	16
Perplexity	15
Google AI Mode	13
ChatGPT	3

Crunchbase earns 7× more Claude citations than ChatGPT citations — confirming that a single source can have fundamentally different visibility across engines depending on how each engine's retrieval model weights structured data, domain trust, and content type.

The citation concentration problem #

Source diversity in AI answers is lower than most operators assume. The SurfacedBy study found that the top 10 domains captured 20.6% of all citations, while the top 100 captured 42%. Meanwhile, 43% of cited domains were cited exactly once.

This creates a winner-take-most dynamic. Once an engine's retrieval model associates a domain with a topic cluster, that domain captures a disproportionate share of future citations on related queries. Breaking into this cycle requires building entity chains — connected webs of evidence across multiple authoritative surfaces — rather than optimizing individual pages in isolation.

Why Machine Relations treats citation as infrastructure #

Machine Relations defines citation architecture as the structural engineering that makes a brand's evidence retrievable, extractable, and absorbable by AI engines. This is distinct from content marketing or SEO because the optimization target is the retrieval pipeline itself, not the reader or the crawler.

The measured evidence supports this framing:

Selection is necessary but not sufficient. A page can appear in the source list without contributing to the answer. The arxiv framework found that citation count alone inadequately measures effectiveness — answer-level absorption requires separate evaluation.
Each engine needs a separate strategy. With only 2.7% domain overlap across all five engines, a single-engine approach leaves 97.3% of the citation surface unaddressed.
Authority signals are earned, not purchased. Entity richness, cross-source corroboration, and passage-level extractability are structural properties of how content is built — not attributes that can be applied after the fact.

FAQ #

How many sources do AI engines typically cite per answer? #

It varies by engine. Gemini cites an average of 11.0 sources per answer, while ChatGPT cites only 3.7. The range across all engines is 2 to 10+ sources, depending on query complexity.

Do backlinks help with AI citations? #

Minimally. Backlink counts correlate with AI citations at r² = 0.038 — essentially no predictive power. Entity richness (267% citation increase) and query-passage semantic alignment (7.3× more predictive than domain authority) matter far more.

Is there overlap between what different AI engines cite? #

Very little. Only 2.7% of cited domains appear across all five major engines (ChatGPT, Claude, Gemini, Perplexity, Google AI Mode). Nearly 70% of domains are cited by only one engine.

What is the difference between citation selection and citation absorption? #

Citation selection is when an engine retrieves a page and lists it as a source. Citation absorption is when the engine actually uses that page's language, evidence, or structure in the generated answer. A page can be selected without being absorbed — appearing in the source list while the answer draws its actual content from other sources.

Does content freshness affect AI citations? #

Yes, but unevenly across engines. ChatGPT citations are 25.7% fresher than Google's organic results on average. Perplexity shows the strongest freshness preference, citing recently updated content 38% more often than month-old equivalents.

Last updated: July 2, 2026. Analysis based on published research including SurfacedBy's 127,198-citation study (March–June 2026), the arxiv citation selection/absorption framework (602 controlled prompts, 21,143 citations), and Fahlout's AI citation research.

Additional source context #

This guide provides practical guidance on how to prepare citable material and instruct the model to format citations effectively, using patterns that are familiar to OpenAI models. (Citation Formatting | OpenAI API (developers.openai.com)).
In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. (CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery (arxiv.org), 2026).
The authors of the report, published this week, analyzed more than one million citations output by generative AI models. (Generative AI models love to cite Reuters and Axios, study finds | Nieman Journalism Lab (niemanlab.org), 2025).
This Zenodo record contains all data artifacts required to reproduce the MDCite dataset and its construction pipeline, as described in the accompanying SIGIR 2026 resource paper. (MDCite: A Large-Scale Multi disciplinary Citation Context Dataset (doi.org), 2026).
This guide adapts OpenAI citation-formatting patterns for AvalAI apps that use/v1/responses,/v1/chat/completions, web search, or manual RAG. (Citation formatting (docs.avalai.ir)).
Public Benchmarks for Citation Accuracy in AI-Authored Papers — clawRxiv ← Back to archive # Public Benchmarks for Citation Accuracy in AI-Authored Papers clawrxiv:2604.02008· boyi·Apr 28, 2026 ▲ 0 ▼ Cite Get for Claw Citations in AI-generated papers are notor (Public Benchmarks for Citation Accuracy in AI-Authored Papers — clawRxiv (clawrxiv.io), 2026).
Citations - Claude Platform Docs provides external context for ai citation.

AI Citation: What Determines Which Sources Answer Engines Extract