<RETURN_TO_BASE

ByteDance Launches QuaDMix: Revolutionizing LLM Pretraining with Unified Quality and Diversity Optimization

ByteDance unveils QuaDMix, a unified framework that enhances large language model pretraining by jointly optimizing data quality and diversity, leading to significant performance gains.

Addressing the Challenge of Data Quality and Diversity in LLM Pretraining

The efficiency and generalization of large language models (LLMs) heavily depend on the quality and diversity of their training data. Conventional data curation methods often treat quality and diversity as separate objectives by first filtering for quality and then balancing domains. However, this approach neglects the complex relationship between quality and diversity. High-quality datasets often suffer from domain biases, while datasets optimized for diversity may compromise on quality. With fixed training resources, it is crucial to optimize both aspects simultaneously to maximize model performance, which remains a challenging task.

Introducing QuaDMix: A Unified Framework

ByteDance has developed QuaDMix, a novel data selection framework designed to balance quality and diversity concurrently during LLM pretraining. QuaDMix evaluates each data sample based on multiple quality metrics and domain labels, assigning sampling probabilities through a parameterized function. To efficiently optimize these parameters without exhaustive training, QuaDMix uses proxy models combined with LightGBM regression to predict downstream performance.

How QuaDMix Works

The framework operates in three main stages:

  1. Feature Extraction: Each document is annotated with domain labels and multiple quality scores.
  2. Quality Aggregation: Scores are normalized and combined using domain-specific parameters to produce an aggregated quality score.
  3. Quality-Diversity Aware Sampling: Sampling is guided by a sigmoid-based function that prioritizes higher-quality samples while maintaining domain balance via parameterized controls.

Optimization is achieved by training thousands of proxy models under different parameter settings. A regression model trained on these experiments predicts performance outcomes, enabling the identification of optimal sampling strategies. This structured exploration of the parameter space aligns data selection closely with downstream task objectives.

Benefits of QuaDMix

  • Unified optimization of data quality and domain diversity.
  • Adaptability to specific task requirements via proxy evaluation targets.
  • Efficient computation by avoiding full-scale retraining.
  • Consistent improvements in downstream benchmarks without additional compute costs.

Experimental Validation

Using the RefinedWeb dataset, QuaDMix was evaluated with models of 530 million parameters trained from scratch. Compared to baselines like Random Selection, Fineweb-edu, AskLLM, DCLM, DSIR, and RegMix, QuaDMix achieved an average performance improvement of 7.2%, reaching an average score of 39.5% across nine diverse benchmarks.

Key findings include:

  • Joint optimization outperforms methods focusing solely on quality or diversity.
  • Proxy model performance strongly correlates with large-scale model results.
  • Task-specific data mixtures enhance performance further.
  • Combining multiple quality criteria reduces biases and improves model robustness.
  • Increasing token diversity beyond a certain point yields limited benefits, emphasizing quality curation.

QuaDMix represents a significant advancement in data selection strategies for LLM pretraining, offering a scalable and effective approach to simultaneously optimize data quality and diversity.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский