Kimi K2 Thinking: Moonshot AI's 1T-Parameter 'Thinking' Agent That Runs 200

What Kimi K2 Thinking is

Moonshot AI has published Kimi K2 Thinking, an open-weights thinking agent built on the Kimi K2 Mixture of Experts (MoE) design. The model is engineered to interleave chain-of-thought reasoning with dynamic tool calls during inference, enabling it to read, reason, call a tool, and continue reasoning for hundreds of sequential steps without human intervention. The project page is available at https://moonshotai.github.io/Kimi-K2/thinking.html.

Architecture and key specs

K2 Thinking inherits the Kimi K2 MoE architecture and is described as a 1 trillion parameter system with 32 billion activated parameters per token. Its architecture includes 61 layers (one dense layer), 384 experts with 8 experts selected per token plus one shared expert, 64 attention heads, an attention hidden dimension of 7168, and an MoE hidden dimension of 2048 per expert. The model uses a vocabulary of 160K tokens and a 256K token context window. Attention is implemented as Multi-head Latent Attention and the activation function is SwiGLU.

Test-time scaling and long-horizon reasoning

A central design goal for K2 Thinking is test-time scaling: the model is trained to expand reasoning depth and tool-call length on harder problems instead of relying on a fixed, short chain of thought. Moonshot reports that K2 Thinking maintains coherent behavior across roughly 200 to 300 sequential tool calls. The model sets new results on Humanity's Last Exam and BrowseComp while operating under large token budgets.

Reported benchmark protocols include token budgets such as 96K thinking tokens for several tasks (HLE, AIME25, HMMT25, GPQA), 128K for IMO AnswerBench and other hard tasks, and 32K completion tokens for Longform Writing. Step caps vary by evaluation: for HLE the max step limit is 120 with a 48K reasoning budget per step, while some agentic search tasks allow up to 300 steps with a 24K reasoning budget per step.

Benchmarks: reasoning, agentic search, and coding

K2 Thinking achieves competitive to leading scores across many benchmarks. Examples reported by Moonshot include:

Humanity's Last Exam (no tools): 23.9; with tools: 44.9; heavy setting: 51.0.
AIME25 with Python: 99.1; HMMT25 with Python: 95.1.
IMO AnswerBench: 78.6; GPQA: 84.5.

In agentic search and browsing tasks with tools, reported scores include 60.2 on BrowseComp, 62.3 on BrowseComp ZH, 56.3 on Seal 0, 47.4 on FinSearchComp T3, and 87.0 on Frames. On general knowledge and other benchmarks the model reports 84.6 on MMLU Pro, 94.4 on MMLU Redux, 73.8 on Longform Writing, and 58.0 on HealthBench.

For coding, reported results include 71.3 on SWE bench Verified with tools, 61.1 on SWE bench Multilingual with tools, 41.9 on Multi SWE bench with tools, 44.8 on SciCode, 83.1 on LiveCodeBenchV6, 48.7 on OJ Bench (C++), and 47.1 on Terminal Bench with simulated tools.

Moonshot also describes a Heavy Mode that runs eight trajectories in parallel and aggregates results to improve accuracy in some reasoning evaluations.

Native INT4 quantization and deployment

K2 Thinking is released as a native INT4 model. Moonshot applies Quantization Aware Training during post-training stages and uses INT4 weight-only quantization on MoE components. This supports INT4 inference with roughly 2x generation speed improvements in low-latency mode while preserving benchmark performance. All reported benchmark scores are under INT4 precision.

Checkpoints are saved in compressed tensor formats and can be unpacked to higher-precision formats such as FP8 or BF16 via the official compressed tensor tools. Recommended inference engines include vLLM, SGLang, and KTransformers. K2 Thinking is already available in chat mode at kimi.com and via the Moonshot platform API; a dedicated agentic mode is planned to expose full tool orchestration behavior.

What this means for open-source reasoning agents

Kimi K2 Thinking demonstrates that open-weights reasoning agents with very large context windows and long-horizon tool use are becoming practical. The combination of a trillion-parameter MoE, 256K context length, native INT4 training, and explicit test-time scaling points toward production-capable agentic systems that can sustain hundreds of sequential tool calls without human intervention.