SmallThinker: Breakthrough Efficient LLMs Designed for Local Devices

Rethinking Language Models for Local Deployment

Traditional large language models (LLMs) are primarily designed for cloud data centers, making them unsuitable for deployment on local devices such as laptops, smartphones, or embedded systems. SmallThinker challenges this norm by being architected from the ground up to work efficiently within local hardware constraints.

Innovative Architecture for Efficiency

SmallThinker utilizes a fine-grained Mixture-of-Experts (MoE) architecture, activating only a subset of experts per token, which drastically reduces memory and computational demands. The two main variants are:

SmallThinker-4B-A0.6B: 4 billion parameters total, 600 million active per token.
SmallThinker-21B-A3B: 21 billion parameters total, 3 billion active per token.

This approach maintains high capacity while keeping resource usage low.

Additional architectural innovations include:

ReGLU-Based Feed-Forward Sparsity: Over 60% neuron inactivity within activated experts reduces computation.
NoPE-RoPE Hybrid Attention: Alternates global NoPositionalEmbedding layers with local RoPE sliding windows to support long context lengths (up to 32K tokens for 4B and 16K for 21B) while minimizing cache size.
Pre-Attention Router and Intelligent Offloading: Predicts necessary experts before attention steps, prefetching parameters from SSD/flash storage and caching hot experts in RAM, hiding I/O latency and improving throughput.

Training Methodology

SmallThinker models were trained from scratch on extensive datasets:

4B model trained on 2.5 trillion tokens.
21B model trained on 7.2 trillion tokens.

Training data includes curated open-source content, synthetic math and coding data, and supervised instruction-following corpora, focusing on STEM, mathematical reasoning, and coding capabilities.

Benchmark Performance

Academic Tasks: SmallThinker-21B-A3B matches or outperforms similar-sized models across math, coding, and general knowledge benchmarks such as MMLU, Math-500, GPQA-Diamond, and HumanEval.

Real Hardware Performance: The 4B variant runs smoothly with as little as 1 GiB RAM, and the 21B variant requires only 8 GiB RAM, maintaining fast inference speeds (e.g., 20 tokens/sec on standard CPUs), outperforming competitors under tight memory constraints.

Sparsity and Expert Specialization

Activation data shows that most experts are used sparsely, with a small set specialized for particular domains or languages, enabling efficient caching. Neuron-level sparsity remains high throughout layers, contributing to reduced compute requirements.

Limitations and Future Directions

Current limitations include a smaller training corpus compared to leading cloud models, lack of reinforcement learning from human feedback, and primary focus on English and Chinese with STEM topics. Future plans involve expanding datasets and incorporating RLHF to improve model alignment and safety.

Availability

SmallThinker-4B-A0.6B-Instruct and SmallThinker-21B-A3B-Instruct are open-source and available for researchers and developers, demonstrating a new paradigm where model design is tailored for local deployment rather than cloud scale.

For further information, the research paper, tutorials, and community resources are accessible online.