Zhipu GLM-4.6 Launches: 200K Context, Token Savings and Open Weights

What’s new in GLM-4.6

Zhipu AI’s GLM-4.6 is a significant update in the GLM family that focuses on agentic workflows, long-context reasoning, and applied coding tasks. The release raises the input context window to 200K tokens and supports up to 128K output tokens, while also delivering lower token consumption on practical multi-turn coding evaluations. The model ships with open weights for local deployment and is available via Z.ai and OpenRouter.

Context window and output limits

GLM-4.6 supports a 200K input context window and a 128K maximum output token limit. That extended context enables longer documents, extended agent trajectories, and multi-file coding interactions that previously required workarounds or external memory.

Real-world coding and token efficiency

On an extended CC-Bench benchmark that uses multi-turn tasks evaluated by humans in isolated Docker environments, GLM-4.6 reaches near parity with Claude Sonnet 4, recording a 48.6% win rate. Zhipu reports that GLM-4.6 uses roughly 15% fewer tokens than GLM-4.5 to complete those tasks. Task prompts and agent trajectories from these evaluations are published for inspection, offering transparency into applied performance.

Zhipu also notes that while GLM-4.6 shows clear gains over GLM-4.5 across a range of public benchmarks, it still trails Sonnet 4.5 on some coding metrics. That caveat is useful when choosing a model for specific coding workloads.

Benchmarks and positioning

The team summarizes clear improvements over the previous GLM-4.5 across eight public benchmarks and states parity with Claude Sonnet 4/4.6 on several benchmarks. The results suggest GLM-4.6 is an incremental but material step, improving practical throughput and token efficiency rather than delivering a radical leap in raw capabilities.

Availability and ecosystem integration

GLM-4.6 is available through the Z.ai API and via OpenRouter. It integrates with several popular coding agents including Claude Code, Cline, Roo Code, and Kilo Code. Existing users of Coding Plan can upgrade to GLM-4.6 simply by switching the model name.

Open weights and licensing

The Hugging Face model card lists the model under an MIT license and presents a 355B parameter MoE configuration with BF16/F32 tensors. Note that MoE total parameter counts are not the same as active parameters per token, and the model card does not state an active-params figure for GLM-4.6.

Weights are hosted on Hugging Face and ModelScope, enabling researchers and developers to download them for local experimentation and deployment.

Local inference and tooling

Local serving options are documented with vLLM and SGLang, and community quantizations are emerging for workstation-class hardware. The release encourages on-prem and research usage by providing weights and integration examples on public repositories and model hubs.

FAQ

  1. What are the context and output token limits?

GLM-4.6 supports a 200K input context and a 128K maximum output token limit.

  1. Are open weights available and under what license?

Yes. The Hugging Face model card lists an MIT license and a 357B-parameter MoE configuration with BF16/F32 tensors.

  1. How does GLM-4.6 compare to GLM-4.5 and Claude Sonnet 4 on applied tasks?

On the extended CC-Bench, GLM-4.6 reportedly uses about 15% fewer tokens than GLM-4.5 and reaches near-parity with Claude Sonnet 4 with a 48.6% win-rate.

  1. Can I run GLM-4.6 locally?

Yes. Zhipu provides weights on Hugging Face and ModelScope and documents local inference with vLLM and SGLang. Community quantizations are appearing for workstation-class hardware.

For more technical details and downloads, see https://z.ai/blog/glm-4.6 and the Hugging Face model card.