<RETURN_TO_BASE

PARSCALE: Revolutionizing Language Model Scaling with Parallel Computation

PARSCALE introduces a parallel computation approach to scale language models efficiently, reducing memory use and latency while improving performance across various tasks.

Challenges in Scaling Language Models

Language models have grown significantly by increasing their parameters or computational power, demanding vast memory and resources. Traditional methods like Dense Scaling and Mixture-of-Experts increase trainable parameters, causing heavy memory use, while inference-time scaling extends sequence length or reasoning steps, adding latency and slowing deployment. These approaches struggle in low-resource environments such as mobile or embedded devices.

Introduction to PARSCALE

Researchers from Zhejiang University and Alibaba Group introduced PARSCALE (Parallel Scaling), a novel approach that enhances model performance by increasing parallel computations rather than model size or output length. PARSCALE applies multiple learnable transformations to the input, allowing the model to execute several forward passes simultaneously. The outputs are then dynamically aggregated, maintaining the original parameter count while boosting computational diversity.

Technical Details of PARSCALE

PARSCALE appends several unique, learnable prefixes to the input, creating multiple parallel input streams processed simultaneously. Outputs from these streams are combined via a dynamic weighted sum computed by a multilayer perceptron. This introduces only about 0.2% additional parameters per stream, a minimal increase compared to traditional scaling methods. Prefix tuning enables each stream to use distinct key-value caches, promoting efficient memory reuse. The method benefits from GPU-friendly parallelization, keeping latency low despite extra computations. Importantly, PARSCALE does not require changes to the core model architecture and can be applied to frozen pretrained models by training only the new prefixes and aggregation parameters.

Experimental Results

Extensive experiments on models with 0.5B to 4.4B parameters and parallel streams (P) from 1 to 8 showed that models with P=8 matched the performance of much larger models but with substantially less memory and latency. For example, a 1.6B parameter model with PARSCALE consumed 22 times less memory increase and 6 times less latency increase compared to parameter scaling for equivalent performance. On downstream tasks like GSM8K and MMLU, PARSCALE improved results by up to 34% and 23%, respectively. Coding capabilities also saw significant gains; a 1.6B model with P=8 performed comparably to a 4.4B parameter model. Furthermore, PARSCALE remained effective during post-training and parameter-efficient fine-tuning while keeping core parameters fixed.

Impact and Future Directions

PARSCALE offers a fresh perspective on scaling language models by focusing on efficient reuse of computation rather than inflating model size or inference length. This approach tackles memory and time inefficiencies, maintaining or improving performance while enabling scalable deployment in resource-constrained environments. It represents a promising direction for future research and practical applications of advanced language models.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский