AI Chatbots Citing Retracted Research Threaten Scientific Reliability

Retracted research appearing in AI responses

Recent studies show that some AI chatbots and research tools rely on material from retracted scientific papers when answering user queries. Rather than fabricating sources, these systems can surface real papers that have been formally withdrawn from the scientific record, which risks misleading users who do not verify the original sources.

What the studies found

Researchers tested several AI systems using questions based on a set of retracted papers in medical imaging and other fields. In one study, OpenAI’s ChatGPT running on the GPT-4o model referenced retracted papers in multiple replies and only occasionally warned users about the retraction status. Another group tested ChatGPT-4o mini with 217 retracted or low-quality articles and found that the chatbot did not mention retractions or express concerns in any of its responses.

Independent tests of tools marketed for research, such as Elicit, Ai2 ScholarQA, Perplexity, and Consensus, produced similar findings. These tools cited many retracted papers without signaling that the material had been withdrawn, though some services later reduced their number of references to retracted items after adding retraction data sources.

Why this matters

People use chatbots for medical advice, diagnostic suggestions, literature summaries, and quick research overviews. If an AI system relies on retracted studies, it can amplify discredited results and create false confidence in flawed findings. The risk is especially acute for nonexperts who may not click through to review the original paper or understand the meaning of a retraction notice.

Retracted papers can remain accessible across preprint servers, institutional repositories, and other websites, so copies can be scattered on the web. Models trained on datasets that are not up to date may continue to surface those papers even after retractions are issued.

How companies are responding

Some providers have started integrating retraction metadata into their pipelines. Consensus, for example, now aggregates retraction data from publishers, data aggregators, web crawls, and Retraction Watch, which curates retraction notices by hand. After updating its sources, Consensus reduced the number of times it cited retracted papers in tests.

Other tools report partial solutions: removing flagged items from indexes, aggregating retraction sources, or warning users about possible inaccuracies. But complete coverage is difficult because retraction notices are published in varied formats and often require manual curation to ensure accuracy.

Limitations of retraction databases

Retraction Watch and other databases are valuable, but building a truly comprehensive and perfectly current retraction database is resource intensive. Publishers label corrections and retractions inconsistently, using tags such as correction, erratum, expression of concern, or retracted, and the reasons for those labels vary widely.

Because of these inconsistencies and the decentralized nature of scientific publishing, automatic detection of retracted content remains imperfect. Models with training cutoffs predating a retraction will not know about postcutoff withdrawals, and many academic search engines do not perform real-time checks against retraction lists.

Recommendations for users and institutions

Experts suggest making more contextual material available for AI models, including peer reviews, PubPeer critiques, and linked retraction notices, so systems can better assess the status and quality of a paper. Publishers that post retraction notices openly and link them to the original article help AI systems and users detect withdrawn content.

Until detection improves, users should verify sources by visiting original papers, checking publisher sites for retraction notices, and consulting curated retraction databases. Scientists, institutions, and funders working on research-oriented AI models should prioritize integration of retraction metadata to protect downstream users and preserve trust in research tools.