When AI and Wikipedia Entrapped Endangered Languages in a 'Garbage In, Garbage Out' Loop

September 25, 2025 · 4 min

A mirage of native content

When Kenneth Wehr took stewardship of the Greenlandic Wikipedia four years ago, he found a site that looked healthy on the surface: some 1,500 articles and tens of thousands of words. But most of that content had been written by non-speakers and many pages were clearly generated or translated by machines. Wehr deleted large swaths of the site to try to save what little linguistic integrity remained.

How automated translation distorted small-language Wikipedias

As machine translation tools became easier to use, volunteers and newcomers began populating smaller Wikipedias with automatically produced text. For many under-resourced languages, Wikipedia is one of the largest sources of online text. That makes it both valuable and vulnerable: poor machine translations put garbage into the web, and AI systems trained on that web reproduce and amplify those errors.

Researchers and volunteer editors have found alarmingly high shares of uncorrected machine translations in some editions: volunteers working on several African-language Wikipedias estimate 40–60% of articles are unedited machine output, and audits of Inuktitut suggest more than two-thirds of multi-sentence pages contain translated portions.

The feedback loop that threatens linguistic data

Modern translation models learn from huge bodies of online text. If Wikipedia pages for a small language are largely flawed machine translations, models ingest those flaws and produce worse translations in return. That fosters a vicious cycle: AI creates poor entries, people use those tools to make more entries, and AI trains on the growing pile of errors. Kevin Scannell, who builds software for endangered languages, emphasizes that these models often start with nothing but raw scraped text when it comes to under-resourced languages—no grammar books, no dictionaries—so the quality of their outputs depends entirely on input quality.

Human communities strained by bad automation

Small-language Wikipedias often lack the active, knowledgeable communities needed to spot and fix errors. Contributors who use Google Translate or ChatGPT may have good intentions, thinking they are seeding content that native speakers will improve. Frequently, nobody shows up to correct the mistakes. Volunteers such as Abdulkadir Abdulkadir (Fulfulde) and Lucy Iwuala (Igbo) describe spending hours cleaning up pages that are incomprehensible or dangerously misleading—for example, mistranslations that could harm farmers seeking agricultural advice.

Automation vs community governance

Wikipedia has tools like Content Translate that facilitate automatic translation, but these depend on external machine-translation engines and often produce substandard results. The Wikimedia Foundation leaves many decisions about tool use and moderation to individual language communities. That works when a community is active, as in the success story of Inari Saami, but fails where communities are thin or absent. English-language Wikipedia has restricted Content Translate after finding that many automatically created articles did not meet quality standards.

Success where speakers organize

There are positive examples. Inari Saami, once nearly extinct, now has several hundred speakers and a thriving Wikipedia with thousands of carefully edited articles. Local activists turned Wikipedia into a curated repository and teaching tool, integrating it into schools and using it to coin modern vocabulary. This model shows that Wikipedia can aid language revitalization when native speakers lead and protect quality.

What happened to Greenlandic and what it signals

Wehr ultimately petitioned the Wikipedia Language Committee to close the Greenlandic edition because AI-produced content had so often produced nonsense and misrepresented the language. The committee agreed to move remaining entries to the Incubator. But by then, errors had already leaked into the machine-translation ecosystem: common tools still cannot reliably perform basic Greenlandic tasks like counting to ten.

Why this matters beyond Wikipedia

The consequences go beyond encyclopedia pages. AI systems trained on contaminated datasets can produce phrasebooks, learning materials, and automated tools that mislead learners and communities. Linguists describe some AI-generated language books sold online as nonsense, and worry that younger learners will adopt incorrect forms that undermine revitalization efforts.

Paths forward

Stopping the downward spiral requires human-led curation: active speaker communities, careful moderation of automated contributions, and investment in high-quality, human-generated resources. Where communities organize and insist on quality, Wikipedia can be an asset; where they cannot, automated tools risk accelerating language erosion. The problem is not simply technological—it is social, institutional, and urgent.