UNC Researchers Unveil TACQ: Maintaining LLM Accuracy at 2-Bit Precision Through Task-Aware Quantization
Researchers at UNC Chapel Hill introduced TACQ, a task-aware quantization method that preserves critical weight circuits, allowing large language models to maintain high accuracy even at ultra-low 2-bit precision compression.
Challenges of Large Language Models (LLMs) in Deployment
Large Language Models (LLMs) demonstrate remarkable capabilities but are limited by high computational and memory demands. These limitations particularly affect scenarios requiring local deployment for privacy, such as handling sensitive patient data, or environments with restricted computing resources like real-time customer service systems and edge devices.
Post-Training Quantization and Its Limitations
Post-training quantization (PTQ) offers a promising avenue to compress pre-trained models and reduce memory use by 2-4 times. However, PTQ struggles with compression beyond 4-bit precision, as performance significantly drops at 2- or 3-bit levels. Most existing PTQ methods use small mini-batches of general pre-training data to adjust for the activation changes caused by quantization.
Existing Quantization Techniques
LLM compression methods include:
- Uniform Quantization: Compresses 16-bit float weights by mapping each row independently to integers based on channel-wise min and max values.
- GPTQ-based Quantization: Focuses on minimizing reconstruction loss through layer-wise reconstruction.
- Mixed-Precision Quantization: Assigns varying bit-widths depending on weight importance, preserving critical high-sensitivity weights at higher precision.
Introduction to TACQ: TaskCircuit Quantization
Researchers at UNC Chapel Hill introduced TACQ, a novel mixed-precision post-training quantization method. TACQ is inspired by automated circuit discovery and conditions the quantization on specific weight circuits linked to downstream task performance. By comparing unquantized weights with uniformly quantized ones, TACQ estimates expected changes and uses gradient information to predict task impact, enabling preservation of task-critical weights.
TACQ consistently outperforms existing baselines while using less calibration data and weight budgets, especially excelling at 2-bit and 3-bit precision levels.
How TACQ Works: Saliency Metric Components
TACQ defines a saliency metric identifying critical weights that must be preserved:
- Quantization-aware Localization (QAL): Estimates how performance changes with expected weight changes due to quantization.
- Magnitude-sharpened Gradient (MSG): A generalized absolute weight importance metric adapted from input attribution that stabilizes TACQ and corrects biases from QAL.
This combined saliency metric is efficiently evaluated in a single backward pass, allowing the top percentage of high-scoring weights to be maintained at 16-bit precision.
Performance Benefits of TACQ
At 2-bit precision, TACQ achieves significant improvements over SliM-LLM: +16.0% accuracy on GSM8k, +14.1% on MMLU, and +21.9% on Spider datasets. Other methods like GPTQ, SqueezeLLM, and SPQR drop to near-random performance at this compression.
At 3-bit precision, TACQ preserves approximately 91%, 96%, and 89% of unquantized accuracy on GSM8k, MMLU, and Spider respectively, outperforming SliM-LLM by 1-2%.
TACQ is uniquely capable of retaining non-negligible performance in generation tasks, especially in the Spider text-to-SQL task requiring sequential token output.
Significance and Applications
TACQ represents a major advancement in task-aware PTQ, enabling LLMs to maintain high accuracy at ultra-low bit widths where previous methods fail. By selectively preserving a small fraction of salient weights, it aligns with automatic circuit discovery research highlighting sparse weight circuits' influence on tasks.
This approach is particularly suited for generation and program-prediction tasks, as well as agent-based systems that generate multiple executable outputs, combining efficiency with strong performance.
For more details, check the original paper and GitHub repository. Stay updated by following related channels on Twitter, Telegram, and LinkedIn.
Сменить язык
Читать эту статью на русском