K2 Think: A 32B Open-Source Reasoning System That Rivals Much Larger Models

September 9, 2025 · 4 min

What K2 Think is

K2 Think is a 32-billion-parameter open reasoning system released by researchers at MBZUAI’s Institute of Foundation Models and G42. The project bundles a fully open release of weights, data, and code with an inference scaffold that combines long chain-of-thought supervised fine-tuning, reinforcement learning with verifiable rewards, agentic planning, test-time scaling, and hardware-aware optimizations. The team emphasizes parameter efficiency: a compact backbone plus post-training recipes and runtime techniques produce frontier-level math performance and competitive code and science results.

Core design and the six pillars

The system is built by post-training an open-weight Qwen2.5-32B base and adding a lightweight test-time compute scaffold. The core recipe combines six pillars that together raise pass@1 on competition-grade benchmarks while keeping response length and latency manageable:

Long chain-of-thought (CoT) supervised fine-tuning
Reinforcement Learning with Verifiable Rewards (RLVR)
Agentic planning before solving
Test-time scaling via best-of-N selection with verifiers
Speculative decoding
Inference on wafer-scale hardware

Each pillar targets a different aspect of accuracy, efficiency, or deployability and is tuned to complement the small-but-fast design philosophy.

Long CoT SFT

Phase-1 supervised fine-tuning uses curated long chain-of-thought traces and instruction/response pairs spanning math, code, science, instruction following, and general chat (dataset referenced as AM-Thinking-v1-Distilled). This trains the model to externalize intermediate reasoning steps and adopt a consistent structured output format. Rapid gains in pass@1 appear early in training (around half an epoch), with checkpoints showing strong stabilization on major math splits before applying RL.

Reinforcement learning with verifiable rewards (RLVR)

K2 Think applies RLVR using the verl library and a GRPO-style policy-gradient algorithm on the Guru dataset, which contains roughly 92k prompts across six domains: Math, Code, Science, Logic, Simulation, and Tabular. The authors note an important trade-off: starting RL from a strong SFT checkpoint yields modest absolute gains and can plateau, whereas applying RL from the base model produces larger relative improvements. Another ablation indicates that reducing the maximum sequence length during multi-stage RL (for example, shifting from 32k to 16k) can harm learned reasoning and fail to recover SFT baseline performance.

Agentic planning and test-time scaling

At inference, K2 Think first elicits a compact plan, then samples multiple candidate solutions (best-of-N, typically N=3) and applies verifiers to select the most likely correct answer. This combined scaffold improves final-answer quality while producing shorter outputs than the post-training checkpoint alone. Across benchmarks the approach reduces average token counts (up to about 11.7% on some splits) and keeps final responses comparable to much larger open models, which reduces both latency and cost.

Speculative decoding and wafer-scale inference

K2 Think targets inference on the Cerebras Wafer-Scale Engine with speculative decoding to reach practical throughput for production use. The team advertises per-request throughput upwards of ~2,000 tokens/sec on the wafer-scale engine, making the test-time scaffold viable for research and product loops and reinforcing the system’s small-but-fast approach.

Evaluation protocol

Benchmarks include competition-level math (AIME'24, AIME'25, HMMT'25, Omni-MATH-HARD), code (LiveCodeBench v5 and SciCode), and science knowledge/reasoning (GPQA-Diamond, HLE). The reported standardized setup uses a max generation length of 64k tokens, temperature 1.0, top-p 0.95, a stop marker , and averages each score over 16 independent pass@1 evaluations to reduce run-to-run variance.

Results at a glance

Math micro-average across evaluated splits reaches 67.99, placing K2 Think at the top of the open-weight cohort and competitive with much larger models. Individual scores include AIME'24 90.83, AIME'25 81.24, HMMT'25 73.75, and Omni-HARD 60.73. Code performance on LiveCodeBench v5 is 63.97, exceeding similarly sized peers and some larger open models. On SciCode the system posts 39.2/12.0 (sub/main). Science benchmarks show GPQA-Diamond 71.08 and HLE 9.95, demonstrating broader competence beyond math.

Other key figures: the backbone is Qwen2.5-32B, RL data comes from the Guru dataset (~92k prompts), the inference scaffold combines plan-before-you-think and best-of-N with verifiers, and the target throughput on Cerebras WSE with speculative decoding is approximately 2k tokens/sec. Safety macro score is reported at 0.75 with breakdowns across refusal, conversational robustness, cybersecurity, and jailbreak metrics.

Open release and implications

K2 Think is released openly with weights, training data, deployment code, and test-time optimization tools. The project demonstrates that an integrative approach—post-training, test-time compute scaffolds, and hardware-aware inference—can close a significant portion of the gap to larger proprietary reasoning systems while remaining tractable to fine-tune and serve. Links to the technical report and project pages are provided by the authors for those who want to explore the paper, model card on Hugging Face, GitHub, and tutorials.