NVIDIA XGBoost 3.0 Revolutionizes Terabyte-Scale Machine Learning on a Single Grace Hopper Superchip

Terabyte-Scale Training with XGBoost 3.0

NVIDIA introduces a groundbreaking update in scalable machine learning with XGBoost 3.0, which can now train gradient-boosted decision tree (GBDT) models on datasets ranging from gigabytes up to an impressive 1 terabyte (TB) using a single GH200 Grace Hopper Superchip. This advancement significantly simplifies processing massive datasets for critical applications such as fraud detection, credit risk analysis, and algorithmic trading.

Overcoming GPU Memory Limitations

A core innovation enabling this leap is the External-Memory Quantile DMatrix feature in XGBoost 3.0. Historically, GPU training was restricted by the GPU’s onboard memory capacity, forcing reliance on complex multi-node systems or limiting dataset size. Leveraging the Grace Hopper Superchip’s coherent memory architecture and ultrafast 900GB/s NVLink-C2C bandwidth, XGBoost 3.0 streams pre-binned and compressed data directly from host RAM to the GPU. This approach bypasses memory bottlenecks and eliminates the need for massive RAM servers or extensive GPU clusters.

Real-World Benefits: Speed, Cost Efficiency, and Ease of Use

Organizations like the Royal Bank of Canada (RBC) have experienced up to 16 times faster model training and a 94% reduction in total cost of ownership by adopting GPU-accelerated XGBoost pipelines. This efficiency gain is especially valuable for workflows involving continuous model tuning and dynamic data volumes, enabling faster feature optimization and scalable growth.

How the External Memory Approach Works

External-Memory Quantile DMatrix: It pre-bins features into quantile buckets, compresses data in host RAM, and streams it on demand, maintaining model accuracy while minimizing GPU memory usage.
Single-Chip Scalability: The GH200 Superchip combines 80GB of HBM3 GPU RAM with 480GB of LPDDR5X system RAM to handle terabyte-scale datasets that previously required multi-GPU clusters.
Seamless Integration: For RAPIDS users, enabling this method requires minimal code changes, making adoption straightforward.

Technical Recommendations

Use grow_policy='depthwise' for optimal tree construction performance with external memory.
Run on CUDA 12.8 or newer with an HMM-enabled driver to fully utilize Grace Hopper features.
Dataset shape impacts scalability mostly by row count; both wide and tall datasets perform well on GPU.

Additional Features in XGBoost 3.0

Experimental distributed external memory support across GPU clusters.
Lower memory footprint and faster initialization, especially for mostly dense datasets.
Support for categorical features, quantile regression, and SHAP explainability in external-memory mode.

Industry Impact

By enabling terabyte-scale GBDT training on a single chip, NVIDIA democratizes access to large-scale machine learning for enterprises and financial institutions alike. This development accelerates iteration cycles, reduces costs, and simplifies IT infrastructure complexity.

For more details, visit the Technical details page and check out our GitHub for tutorials, code, and notebooks. Follow us on Twitter and join our community on the ML SubReddit.