Massive AI Dataset Exposed: Hundreds of Millions of Personal Documents and Faces Included
A new study reveals that millions of personal documents and identifiable faces are present in one of the largest AI training datasets, highlighting significant privacy risks in AI development.
Discovery of Personal Data in AI Training Sets
Recent research has revealed that one of the largest open-source AI training datasets, DataComp CommonPool, contains millions of images featuring personal data such as passports, credit cards, birth certificates, and identifiable faces. The research team audited only 0.1% of the dataset and estimated that hundreds of millions of such personal images exist throughout the entire dataset.
Scope of Sensitive Information Found
The researchers uncovered thousands of verified identity documents, including credit cards, driver’s licenses, passports, and birth certificates. Over 800 validated job application documents like résumés and cover letters were also found, confirmed via LinkedIn and other online sources. Sensitive details disclosed in these résumés include disability status, background check results, birthdates and birthplaces of dependents, race, contact information, government IDs, home addresses, and references.
Dataset Background and Usage
DataComp CommonPool, released in 2023 with 12.8 billion image-text pairs, is the largest publicly available dataset for training generative text-to-image AI models. It was created as a successor to the LAION-5B dataset and sourced from web scraping by Common Crawl conducted between 2014 and 2022. Its license permits both academic and commercial use.
Privacy Challenges and Limitations
Despite efforts by dataset curators to preserve privacy, such as automatic face blurring, the researchers found over 800 faces escaped detection in their sample, estimating that 102 million faces were missed across the entire dataset. Moreover, no filters were applied to detect personally identifiable information strings like emails or social security numbers. Face blurring is optional and can be disabled, and metadata and captions often contain additional personal information.
Attempts at Mitigation
Hugging Face, the platform distributing CommonPool, offers a tool enabling individuals to search and request removal of their data, though this requires users to be aware their data is included. Removal from a dataset does not guarantee privacy protection if the trained AI models retain the information.
Legal and Ethical Considerations
The research highlights the difficulty of regulating data scraped from the web. Laws like GDPR and CCPA have limitations and do not cover all cases, especially academic datasets. Publicly available data is often assumed to be free for use, but this research challenges that assumption by revealing the sensitive nature of much of this data.
The Need for Rethinking Data Collection Practices
The study urges the machine learning community to reconsider indiscriminate web scraping practices. The presence of large volumes of personal information in datasets raises significant ethical and legal questions about consent, privacy, and data protection in AI development.
Quotes from Experts
William Agnew of Carnegie Mellon University states, “Anything you put online can and probably has been scraped.” Rachel Hong, lead author, warns that many AI models are trained on these datasets, perpetuating privacy risks. Ben Winters from the Consumer Federation of America calls this the "original sin" of AI systems built on public data, emphasizing its extractive and hazardous nature.
Conclusion
This research serves as a wake-up call to the AI and machine learning fields to address privacy concerns proactively, improve data filtering technologies, and develop policies that better protect individuals whose data is used without explicit consent.
Сменить язык
Читать эту статью на русском