LLMs under-cite numbers and names

A February 2026 study from RIKEN AIP and the University of Tokyo found that current LLMs still misread what humans think deserves a citation. Across a benchmark built from 5,192 Wikipedia sentences, models were up to 27.4% more likely than humans to add citations to sentences already marked "citation needed," while underselecting numeric sentences by as much as 22.6% and sentences containing personal names by as much as 20.1% (Ando and Harada, 2026).

That is not a cosmetic failure. It means machine citation behavior still responds more reliably to citation-shaped training signals than to the kinds of proof points people actually scrutinize first: numbers, names, dates, and attributable facts (Ando and Harada, 2026).

The live web evidence points in the same direction. SourceBench, a February 2026 benchmark from UC San Diego and GenseeAI, evaluated 3,996 cited sources across 12 systems and found wide quality differences in the sources AI systems choose to cite. GPT-5 posted the highest weighted score at 89.081, but the more important finding was architectural: a non-reasoning model paired with stronger search outperformed a reasoning model paired with weaker search, which means source quality is being set by the retrieval stack at least as much as by the model generating the answer (Jin et al., 2026).

A separate large-scale study comparing six LLM search engines with Google and Bing reinforces that point. Across 55,936 queries, LLM search engines returned 4.3 URLs on average versus 10.3 for traditional search. They also surfaced a materially different source universe, with 37% of cited domains absent from traditional search outputs, yet they did not outperform traditional search on credibility, political neutrality, or safety metrics (Zhang et al., 2025). AI search is broadening the citation graph without consistently improving the evidence inside it (Zhang et al., 2025; Jin et al., 2026).

User-visible attribution is weaker than the retrieval footprint behind it. In roughly 14,000 real LMArena logs, Gemini generated 34% of responses without explicitly fetching online content and provided no clickable citation source in 92% of answers. Perplexity Sonar visited about 10 relevant pages per query but cited only three to four. Citation efficiency ranged from 0.19 to 0.45 extra citations per additional relevant page visited across models on identical queries, which led the authors to a simple conclusion: retrieval design, not technical limits, shapes ecosystem impact (Strauss et al., 2025).

The mechanism underneath this is now clearer. LLMs do not choose citations from a neutral evidence field. They inherit priors from training data and existing citation graphs. In April 2025, researchers at Vrije Universiteit Brussel analyzed 274,951 references generated by GPT-4o for 10,000 papers and found a strong Matthew effect in model recommendations. Roughly 90% of the existing references generated by the model fell inside the top 10% most-cited papers for their field and year, and more than 60% landed in the top 1% (Algaba et al., 2025). When a system already prefers what is heavily cited, and also underweights numeric and named-entity claims, the result is predictable: authority compounds around what already looks citeable.

Recent hallucination audits show the same instability from another angle. One March 2026 audit verified 69,557 citations generated by 10 commercial LLMs against CrossRef, OpenAlex, and Semantic Scholar and found hallucination rates ranging from 11.4% to 56.8% depending on model, domain, and prompt framing. The paper also found that when more than three LLMs independently cited the same work, accuracy rose to 95.6%, which is a useful validation signal but also an admission that single-model citation output is still unreliable by default (Naser, 2026). Another February 2026 study using CiteVerifier benchmarked 13 models across 375,440 generated citations and found hallucination rates from 14.23% to 94.93%. The same paper detected invalid or fabricated citations in 604 published papers, equal to 1.07% of the 56,381-paper corpus it examined, with an 80.9% increase in 2025 alone (Xu et al., 2026).

The good news is that this is an architectural problem, not a mystery. In Nature, the OpenScholar team reported that GPT-4o hallucinated citations 78% to 90% of the time when asked to cite recent literature, while their retrieval-grounded system used a datastore of 45 million open-access papers and 236 million passage embeddings to reach citation accuracy on par with human experts (Asai et al., 2026). A separate November 2025 paper from the University of Science and Technology Beijing and Xiaomi showed that improving how models represent citation markers and source context raised citation quality by 5.8% and response correctness by 17.4% over the prior state of the art (Yu et al., 2025).

The implication for Machine Relations is specific. Brands care most about claims with names, numbers, dates, and direct attribution. Those are the exact claims AI systems still mishandle unless the retrieval and citation layer is built to surface them cleanly (Ando and Harada, 2026; Strauss et al., 2025). A homepage paragraph that says a company is trusted, innovative, or widely used is easy for a model to paraphrase and easy to ignore. A sourced claim that names the customer, states the metric, dates the result, and appears in multiple credible documents is harder to miss and easier to lift into an answer (Jin et al., 2026; Zhang et al., 2025).

This is why evidence architecture matters more than content volume. The winning unit in AI search is not a generic page. It is a clearly bounded claim with explicit entities, explicit numbers, explicit provenance, and enough third-party reinforcement to enter the citation graph that models already trust (Algaba et al., 2025; Asai et al., 2026). Brands that want machine visibility need to publish proof in a form machines can validate and reuse, not just prose humans can skim.

That is the deeper signal inside this new research. Citation systems are improving, but they still do not naturally elevate the evidence humans care about most. The brands that adapt first will build pages, documents, and earned references that expose numbers and names cleanly enough for machines to carry forward. Everyone else will keep publishing material that gets read, compressed, and left uncited. More research at machinerelations.ai.

Related reading #