Huawei Unveils Pangu Ultra MoE: A 718B-Parameter Sparse LLM Optimized for Ascend NPUs

Efficient Sparse Language Models with Mixture of Experts

Sparse large language models (LLMs) leveraging the Mixture of Experts (MoE) framework have become popular due to their ability to scale efficiently by activating only a subset of parameters for each token. This selective activation helps maintain high model capacity while reducing computation per token. However, as these models grow to trillions of parameters, training them efficiently demands advanced algorithms and tight hardware-software integration.

Challenges in Training Sparse LLMs on Specialized Hardware

One of the main issues in training sparse LLMs on non-standard AI accelerators like Ascend NPUs is the inefficient hardware utilization. Because only some parameters are active per token, workloads across devices become uneven, causing synchronization delays and underutilization of processing power. Memory usage is also imbalanced, with some experts handling more tokens than their capacity allows. Such inefficiencies multiply at scale across thousands of AI chips, where communication and memory bottlenecks limit throughput, restricting the practical deployment of sparse models on Ascend NPUs.

Existing Approaches and Their Limitations

Various strategies have been explored to address these problems, such as auxiliary losses to balance token distribution and drop-and-pad methods to prevent expert overload. However, these can degrade model performance or introduce new inefficiencies. Heuristic expert placement and conventional communication patterns like All-to-All dispatch often fail to scale or maintain high throughput. Standard memory-saving techniques, such as recomputation, are generally coarse-grained and increase runtime without proportionate memory benefits.

Huawei's Pangu Ultra MoE: A Tailored Solution

The Huawei Cloud Pangu team developed Pangu Ultra MoE, a 718 billion parameter sparse LLM specifically optimized for Ascend NPUs. Their method starts with a simulation-driven configuration process that assesses thousands of architecture variants based on real hardware metrics. This approach enables informed hyperparameter tuning and saves computational resources before training begins.

Advanced Parallelism and Communication Techniques

The simulation explores combinations of layers, hidden sizes, and expert counts using a five-dimensional parallelism scheme: Pipeline, Tensor, Expert, Data, and Context Parallelism. The final model uses 256 experts, a hidden size of 7680, and 61 transformer layers. To boost performance, an Adaptive Pipe Overlap mechanism masks communication latency, and hierarchical All-to-All communication reduces inter-node data transfer. Fine-grained recomputation targets only key-value vectors in attention modules, while tensor swapping offloads activation memory dynamically to host devices.

Performance and Benchmark Results

Pangu Ultra MoE achieved a Model Flops Utilization (MFU) of 30.0% and processed 1.46 million tokens per second on 6,000 Ascend NPUs, outperforming the baseline MFU of 18.9% and 0.61 million tokens per second on 4,000 NPUs. Dynamic expert placement improved load balance and boosted MFU by 10%. The model scored competitively on benchmarks: 81.3% on AIME2024, 97.4% on MATH500, 94.8% on CLUEWSC, and 91.5% on MMLU. In healthcare, it surpassed DeepSeek R1 with 87.1% on MedQA and 80.8% on MedMCQA, proving its domain-specific effectiveness.

Implications for Scalable AI Training

This research highlights how systematic architecture search, optimized communication, and memory management can unlock the potential of massive sparse models on specialized hardware. Huawei’s Pangu Ultra MoE sets a precedent for future system-aware AI designs that harmonize algorithm and hardware capabilities to achieve scalable, efficient training.