Massive AI Dataset Exposed: Hundreds of Millions of Personal Documents and Faces Included

Discovery of Personal Data in AI Training Sets

Recent research has revealed that one of the largest open-source AI training datasets, DataComp CommonPool, contains millions of images featuring personal data such as passports, credit cards, birth certificates, and identifiable faces. The research team audited only 0.1% of the dataset and estimated that hundreds of millions of such personal images exist throughout the entire dataset.

Scope of Sensitive Information Found

The researchers uncovered thousands of verified identity documents, including credit cards, driver’s licenses, passports, and birth certificates. Over 800 validated job application documents like résumés and cover letters were also found, confirmed via LinkedIn and other online sources. Sensitive details disclosed in these résumés include disability status, background check results, birthdates and birthplaces of dependents, race, contact information, government IDs, home addresses, and references.

Dataset Background and Usage

DataComp CommonPool, released in 2023 with 12.8 billion image-text pairs, is the largest publicly available dataset for training generative text-to-image AI models. It was created as a successor to the LAION-5B dataset and sourced from web scraping by Common Crawl conducted between 2014 and 2022. Its license permits both academic and commercial use.

Privacy Challenges and Limitations

Despite efforts by dataset curators to preserve privacy, such as automatic face blurring, the researchers found over 800 faces escaped detection in their sample, estimating that 102 million faces were missed across the entire dataset. Moreover, no filters were applied to detect personally identifiable information strings like emails or social security numbers. Face blurring is optional and can be disabled, and metadata and captions often contain additional personal information.

Attempts at Mitigation

Hugging Face, the platform distributing CommonPool, offers a tool enabling individuals to search and request removal of their data, though this requires users to be aware their data is included. Removal from a dataset does not guarantee privacy protection if the trained AI models retain the information.

Legal and Ethical Considerations

The research highlights the difficulty of regulating data scraped from the web. Laws like GDPR and CCPA have limitations and do not cover all cases, especially academic datasets. Publicly available data is often assumed to be free for use, but this research challenges that assumption by revealing the sensitive nature of much of this data.

The Need for Rethinking Data Collection Practices

The study urges the machine learning community to reconsider indiscriminate web scraping practices. The presence of large volumes of personal information in datasets raises significant ethical and legal questions about consent, privacy, and data protection in AI development.

Quotes from Experts

William Agnew of Carnegie Mellon University states, “Anything you put online can and probably has been scraped.” Rachel Hong, lead author, warns that many AI models are trained on these datasets, perpetuating privacy risks. Ben Winters from the Consumer Federation of America calls this the "original sin" of AI systems built on public data, emphasizing its extractive and hazardous nature.

Conclusion

This research serves as a wake-up call to the AI and machine learning fields to address privacy concerns proactively, improve data filtering technologies, and develop policies that better protect individuals whose data is used without explicit consent.