Ultra-FineWeb: A Trillion-Token Dataset Revolutionizing LLM Accuracy Across Languages

Advancements in Data Filtering for LLM Pretraining

The quality of data used in pretraining large language models (LLMs) is crucial for their success. Traditional heuristic filtering methods like rule-based noise removal and deduplication have given way to model-driven filtering techniques, which utilize neural classifiers to identify high-quality samples. However, these approaches often face challenges such as inefficient validation mechanisms and dependence on manually curated seed datasets, introducing subjectivity.

Scaling Up with Model-Driven Filtering

Earlier datasets such as C4 and Pile laid foundational work, but recent datasets like RefinedWeb, Dolma, and DCLM have significantly scaled up to trillions of tokens. Model-driven filtering has become popular in these large corpora due to its ability to refine massive datasets and improve LLM performance across various benchmarks.

Innovations in Validation and Filtering

Despite progress, current validation methods remain costly and inefficient, and standards for seed data selection are unclear. New datasets like FineWeb-edu and Ultra-FineWeb have improved model performance by employing multiple classifiers to cross-verify data quality. These datasets outperform previous versions on benchmarks such as MMLU, ARC, and C-Eval, enhancing both English and Chinese language understanding. Some studies propose using LLMs themselves for multi-dimensional data evaluation through prompts or token-level perplexity scoring to reduce computational costs while improving data quality.

Introducing Ultra-FineWeb Dataset

Researchers from ModelBest Inc., Tsinghua University, and Soochow University developed an efficient data filtering pipeline to optimize LLM training. Their approach includes a verification strategy that uses a nearly-trained LLM to evaluate new data by monitoring performance gains during final training steps, significantly reducing computational overhead. Additionally, a lightweight fastText-based classifier accelerates filtering with high accuracy.

The pipeline was applied to FineWeb and Chinese FineWeb datasets, resulting in Ultra-FineWeb, which contains one trillion English tokens and 120 billion Chinese tokens. Models trained on Ultra-FineWeb demonstrated significant improvements in performance, confirming the effectiveness of the pipeline in enhancing data quality and training efficiency.

Efficient Filtering Pipeline Details

The pipeline starts with a cost-effective verification strategy to select reliable seed samples from candidates. These seeds, including positive samples from LLM annotations, curated datasets, textbooks, and synthesized content, along with negative samples from diverse sources, are used to train a classifier. Instead of over-training, emphasis is placed on high-quality seed selection. The scalable fastText-based classifier offers competitive performance at much lower inference costs compared to LLM-based methods. Preprocessing steps ensure balanced and clean data input.

Model Training and Evaluation

The models were trained using MegatronLM with the MiniCPM-1.2 B architecture on 100 billion tokens. Evaluations conducted with Lighteval across English and Chinese benchmarks showed that Ultra-FineWeb-trained models consistently outperformed those trained on FineWeb and FineWeb-edu individually and in mixed-language scenarios. Ultra-FineWeb-en achieved the highest average English benchmark scores, while Ultra-FineWeb-zh improved Chinese task performance. Ablation studies confirmed balanced token lengths and the benefits of efficient filtering strategies, underscoring the dataset’s superior quality and impact on model performance.

Summary

Ultra-FineWeb represents a high-quality multilingual dataset with approximately one trillion English tokens and 120 billion Chinese tokens. Built upon FineWeb datasets, it leverages an innovative and efficient data filtering pipeline featuring a fastText classifier and a low-cost verification strategy. This approach improves filtering accuracy, reduces manual seed dependency, and maintains robust performance with minimal computational demands. Experimental results demonstrate consistent performance gains across benchmarks, offering valuable insights for future LLM data quality optimization.

For more details, check out the original paper and dataset releases, and follow the researchers on Twitter or join the 90k+ ML SubReddit community.