From hallucination to verification
AI has made fabricated citations commonplace. Used correctly, AI can make them rare. I review the problem and offer some "skills" as part of the solution.
Earlier this month, a widely shared thread on X pointed out that a prominent political scientist had a half-dozen fabricated citations in a recent article on South Korean politics.1
The reaction was about what you would expect: disappointment, mostly, and, in some quarters, shock. To be sure, no instructor would accept this in a student paper, and fabricated references in a published article must be addressed, whether through a correction or a retraction. Given the number of references that appear to have been fabricated, I suspect that most people following the case expect the latter.
Still, the speed and intensity of the response are worth pausing over. A case like this deserves more care than it has received so far.
First, we do not actually know how those citations got there. It is unlikely but entirely possible that the fault lies with the journal rather than the author. Many of us have concluded that this particular outlet does not meet the peer-review standards it advertises, and here it was turning an article around quickly to address fast-moving events in Korea. Under that kind of pressure, it is not far-fetched that an editor ran the manuscript through an AI tool to tidy or reformat the references and that the tool quietly rewrote them into works that do not exist. I do not know that this happened, and at this point I would have expected this information to have been brought to light, but neither does anyone else.
Calling out errors of this kind is entirely fair. But a public accusation of malpractice directed at an individual had better be well founded, because once it is made, it cannot easily be undone. Reputational damage travels quickly and lands hard; any later correction, if one comes at all, rarely receives comparable attention.
This moment also calls for a dose of humility. The ordinary, almost boring parts of academic work, such as locating a source and formatting a reference, have been disrupted. The tools responsible are exceptionally good at making users feel that they have completed the work correctly when, in fact, they have not. Even otherwise careful people can be duped by their personal robot.
Much of the anger on display, I suspect, reflects a broader anxiety about what AI is doing to the academy. Before making an example of someone, particularly someone with a record of excellent work, we should be far more certain about what actually happened. We should also place the case within the wider structural conditions that made it possible.
To be clear, I do not know how this happened. The most plausible in my mind is not calculated fraud but misplaced confidence in a technology that routinely produces convincing falsehoods. That does not excuse the error. It does, however, change what we should learn from it. The structural failure may be less an individual lapse than a culture that has adopted powerful tools without establishing equally powerful norms of verification. The good news is that this is a problem we can fix — and I have some solutions to try out. But first, we need to understand how we got here, and why the machine is so good at deceiving us.
Old habits die hard
Type a scholar’s name and a couple of keywords into Google Scholar, and you get a list of citations. Use a few of them without reading past the abstract because you already know roughly what that person argues, and you have done one of the more mundane tasks of writing. AI is just exposing an inconvenient truth. We do not (closely) read every source we cite. Some citations are indeed used as evidence, and some are presented as a kind of performance. They signal that we know the literature or that we have cited whoever a reviewer expected or explicitly asked us to. A chatbot like ChatGPT or Claude, in these early days, feels like the same exercise. You ask for the literature on a subject or particular claim, and it gives you names you recognize along with titles that look right. The problem is that some of those titles do not exist.
None of this is new. Take one prominent example. A decade ago, before anyone could blame a model, a prize-winning history of North Korea was found to have cited nonexistent or irrelevant sources in at least sixty-one places. The author returned his award and left his chair.2 The much less catastrophic version of the problem is ubiquitous. It includes use of the wrong DOI, a page range off by two, or a citation to a real paper for a claim it does not actually make. Hallucinations in all but name. What AI adds is ease of use and an appearance of accuracy, which contributes to the major structural issue we are all dealing with now.
Why the machine makes things up
The difference between Google Scholar and an AI chatbot is that the search engine (usually) returns sources that exist. A large language model returns things that are probable. It is neither a database nor a search engine. It generates the next most likely token given everything before it, so when you ask for a citation, it produces a string outcome that looks like one. The string has an author, a year, a title, a journal, and a volume because that shape is overwhelmingly probable wherever citations appear. Nothing in that process checks the reference against a master list. The model is indifferent to whether the work exists, a machine version of Harry Frankfurt’s “bullshit” — unconcerned with the truth of what it produces. A recent paper from OpenAI argues that the problem is more deeply embedded in these systems than we like to admit. Our training and evaluation procedures reward a confident guess over an admission of uncertainty, so models learn to answer confidently rather than abstain or admit ignorance.3
How often does this happen? In one controlled study, GPT-3.5 fabricated 55 percent of the citations it generated for literature reviews, and a newer model (GPT-4), while considerably better, still fabricated 18 percent.4 An audit of two and a half million biomedical papers found that the rate of fabricated references had risen more than twelvefold since 2023, to about 57 per 10,000 papers by early 2026. The steepest increase accompanied the arrival of new AI writing tools in mid-2024. Most of those references had already passed peer review, which is a lot to ask of reviewers who would have to chase down every source.5 The problem reaches well beyond the academy. In 2023, a federal judge fined two New York lawyers five thousand dollars after they filed a brief built on six judicial opinions ChatGPT had invented. The lawyer had asked the chatbot to confirm that one of them was real.
While we do not know how the fabricated citations in the case presented in the introduction of this post were produced, the likeliest account is the simplest one. A single prompt to a general-purpose chatbot, with no tools or documents to check against. That is the most fragile way to use this new technology, and some degree of hallucination in the response(s) is close to guaranteed. However, systems designed to retrieve and read real sources before answering do far better, even if imperfectly so. When Stanford tested the retrieval-based legal tools that vendors marketed as “hallucination-free,” they still returned false information in something like one answer in six. But this is at least a sign of things moving in the right direction.
At the very least, we need verification systems that can find hallucinations and other errors.
Responsible all the same
We are accountable for what we publish. There is no better safeguard than reading our sources carefully, checking every citation, and refusing to include anything we cannot verify. Fabricated sources are unacceptable whether they come from the poor use of an LLM, a tired research assistant, or our own carelessness in formatting.
For students, that standard has to be explicit, and at my own institution it is. While I am generally unimpressed with how universities — my own included — cast AI as primarily a tool for cheating, Leiden University’s Faculty of Humanities correctly defines fraud as any act or omission that prevents a correct assessment of a student’s knowledge. Its examples include the unauthorized use of AI software and the invention of research data.6 Its guidance on generative AI is more direct: “GenAI generates language, not information … If asked to produce references, many LLMs will simply invent non-existent citations.” A fabricated citation misrepresents the evidentiary basis of the work, which is exactly what assessment is meant to evaluate. Under a strict reading, it is indeed fraud, and the heaviest penalty on the books, exclusion from all examinations for a year, should be deterrent enough.
A field-wide response is going to be harder than a local one, and the distance between what these policies seek to resolve and what actually happens seems quite wide. The instinct here seems to be to police the problem out of existence. Rooted in a commitment to supporting good science, it is understandable. In mid-May, arXiv announced a one-year ban for authors caught submitting work with hallucinated references, with any later submission required to clear peer review elsewhere first.7 The motivation is right, but the mechanism is not. I think the verdict that the policy is “welcome but unenforceable” is the correct one.
The early evidence agrees. When one peer-review service ran a reference checker over two hundred papers posted to arXiv after the ban, more than one in four still carried a hallucinated citation. Despite threats of punishment, obviously people are using AI for research. The systems are too good, and too useful, for prohibition to hold. People will use them to write and to think, whatever the rules say. In all liklihood, messages meant to deter, to the extent that they are heard at all, are pobably just driving an underreporting of AI use becuase of social desirability.
We need a different approach here. Teach people to use these tools well, in the open, and turn the same technology around on the problem it has exacerbated. Build verification into the places that matter, such as a journal’s submission desk, and help writers — senior authors and students alike — such that auditing a reference list, or checking whether a source actually supports the claim attached to it, becomes routine. Done properly, these tasks could be performed more systematically and reliably than they were before ChatGPT. This shifts the emphasis from telling people what not to do toward giving them something better to do.
Skills / chatbots
The first thing we need to do is stop treating chatbots as the default interface for serious research and writing. Instead, build (or use) skills for use with agentic AI.
A skill, in the agentic AI sense, is a structured set of instructions that constrains how an AI system does a task. It fixes the steps the model takes and the tools it may use. It also specifies how the model checks its own output. A one-shot prompt invites the model to be creative. A skill takes that freedom away. It can direct a model to dispatch agents and to look up sources in real indexes and distinguish what it actually confirmed from what it merely guessed. The fabrication problem is fundamentally a model answering from probability instead of from evidence, and a well-built skill helps close off that option.
I keep a small library of these for my own research, the Open Science Skills. Three of them are built for the problems we are entertaining here.
The first is a reference checker (citation-check). This most directly addresses the problem under consideration here. It inventories every in-text citation and every entry in the bibliography of a paper, then runs the checks on them. It confirms that each cited work exists, using programmatic indexes like Crossref and OpenAlex rather than the model’s memory. It confirms that a DOI resolves to the cited paper rather than merely to some paper. It cross-checks a suspicious title against the named author’s actual body of work. And it will not call anything fabricated until it has failed to find it by title and by author. What comes back sorts the serious problems from the cosmetic ones and flags the uncertain cases for a human. It is not perfect, but it works.
The second addresses a different, arguably much deeper problem. It is one that is much less likely to go viral on X. A citation can be real and perfectly formatted and still not support the claim it is associated with. The source-claim checker (fact-check) reads the cited source and asks whether it supports the specific sentence. It flags overclaiming and direction errors. Notably, it needs the sources to read. As I’ve written the skill, you cannot just direct it to the internet or ask it be “get creative”. This takes us to the third skill.
The knowledge-base scaffold (research-repo) builds a project around its source library. The original formats (e.g., .pdf) are kept in one folder while another holds the machine-readable conversions the checker reads (markdown, preferably). Each is keyed to a bibliography entry for a document you have actually filed so there is a direct connection between source and its bibliographic metadata. The claim checker refuses to run against an empty or half-built library. To confirm you have represented a source correctly, you have to actually have the source — or, where the source is not available, a file that faithfully summarizes it.
Try it yourself
None of this is as elaborate as it might sound, and the best way to figure this stuff out is to just do it yourself. I have put up a small, self-contained walk-through on a synthetic manuscript seeded with deliberate errors, so you can watch the checks fire away. It is the first demo in a teaching hub (working prototype) I am building for students and colleagues at Leiden University, with the setup guide and slides alongside it.
We are in an odd period. The technology is too useful to ignore, yet unreliable enough to embarrass or even discredit us. At the same time, we have not used it long enough to develop sound habits around it. That is a dangerous combination.
The answer is not a moratorium or a rejection of the technology. Nor should it be a hunt for the next person to make an example of on social media. We need humility about how easily these tools can fool us, some grace for those who have misused them, and stricter standards for how we use them. These positions are not contradictory. We can hold all three at once.
The article appeared in Korea Observer 57, no. 1 (Spring 2026): 1–21, https://doi.org/10.29152/KOIKS.2026.57.1.1, a 21-page review of South Korean politics after the December 2024 martial law crisis. It was submitted on February 10 and accepted on February 21, and the journal advertises itself as a peer-reviewed quarterly. The widely shared thread on June 13 flagged six citations as fabricated, and two of the scholars named confirmed the attributed works do not exist. Two of the named authors, asked by news outlet NK News, confirmed they had never written the works attributed to them.
Charles K. Armstrong, Tyranny of the Weak: North Korea and the World, 1950–1992 (Ithaca, NY: Cornell University Press, 2013). The citation problems were first documented in detail by the historian Balázs Szalontai. A Columbia University investigation found research misconduct, concluding that Armstrong had “cited nonexistent or irrelevant sources in at least 61 instances,” alongside plagiarism; Szalontai’s own count ran higher still. In 2017 he returned the American Historical Association’s John K. Fairbank Prize, writing, “Due to the numerous citation errors in my book, I have decided to return the prize out of respect for the AHA,” and he retired from Columbia in 2020.
Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang, “Why Language Models Hallucinate,” arXiv:2509.04664 (2025), p. 1: “language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty … language models are optimized to be good test-takers, and guessing when uncertain improves test performance.”
William H. Walters and Esther Isabelle Wilder, “Fabrication and Errors in the Bibliographic Citations Generated by ChatGPT,” Scientific Reports 13 (2023): 14045, https://doi.org/10.1038/s41598-023-41032-5, p. 5. Across 636 citations, “55% of the GPT-3.5 citations but just 18% of the GPT-4 citations are fabricated. Likewise, 43% of the real GPT-3.5 citations but just 24% of the real GPT-4 citations include substantive citation errors.”
Maxim Topaz, Nir Roguin, Pallavi Gupta, Zhihong Zhang, and Laura-Maria Peltonen, “Fabricated Citations: An Audit across 2·5 Million Biomedical Papers,” The Lancet 407, no. 10541 (2026): 1779–1781, https://doi.org/10.1016/S0140-6736(26)00603-3. The audit scanned the PubMed Central open-access subset, about 2.5 million papers and 125.6 million references, of which 97.1 million carried an identifier and could be verified. It found the rate of fabricated references rising more than twelvefold, from roughly four per 10,000 papers in 2023 to 51.3 per 10,000 in the fourth quarter of 2025 and 56.9 in early 2026, with the sharp inflection in mid-2024. The share of papers with at least one fabricated reference rose from one in 2,828 in 2023 to one in 277 in the first seven weeks of 2026. The authors cite earlier work estimating that 30 to 69 percent of LLM-generated references in biomedicine are fabricated.
Leiden University, Faculty of Humanities, Course and Examination Regulations, Art. 7: “Fraud is defined as any activity or omission carried out by a student aimed at completely or partially hindering a correct assessment of the student’s knowledge, understanding and skills, including at least the following: … c. unauthorized use of AI software; d. inventing research data …” The enumerated list does not use the words “fabricated references” verbatim, so mapping a hallucinated citation to a specific clause is an interpretation, but coverage under the general definition is clear.
Reported across Nature, TechCrunch, Times Higher Education, Inside Higher Ed, and others in mid-May 2026, and announced by an arXiv moderator and computer-science section chair rather than, at first, as codified policy text on arXiv’s official pages. The stated trigger is “incontrovertible evidence” of unchecked LLM output, with hallucinated references the main example. This is distinct from arXiv’s earlier (October 2025) requirement that review and survey articles in its CS category clear peer review before posting, which is a submission rule, not an author ban.




