Zhipu AI Unveils GLM-4.7-Flash for Local Coding

Zhipu AI's Latest Release

GLM-4.7-Flash is a new entrant in the GLM 4.7 family, designed specifically for developers who seek robust coding and reasoning capabilities in a model practical for local deployment. Zhipu AI (Z.ai) characterizes GLM-4.7-Flash as a 30B-A3B MoE model, making it the premier model in the 30B class, focused on optimizing both performance and efficiency.

Model Class and Position in GLM 4.7

GLM-4.7-Flash boasts a text generation architecture with 31B parameters, supporting BF16 and F32 tensor types. Identified with the architecture tag glm4_moe_lite, it effectively serves both English and Chinese, making it versatile for conversational applications. This model is strategically placed alongside larger variants like GLM-4.7 and GLM-4.7-FP8, providing a lightweight alternative for coding and reasoning tasks.

Architecture and Context Length

Utilizing a Mixture of Experts architecture allows GLM-4.7-Flash to store more parameters than it activates per token. This design fosters specialization while keeping effective computation per token akin to that of smaller dense models.

Supporting a context length of 128k tokens, GLM-4.7-Flash achieves impressive results on coding benchmarks relative to its peers. This capacity is ideal for navigating large codebases and lengthy technical documents, alleviating the need for extreme chunking often required by other models.

Employing a standard causal language modeling interface and a chat template, integration into existing LLM stacks demands minimal adjustments.

Competitive Benchmark Performance

In head-to-head comparisons, Z.ai evaluates GLM-4.7-Flash against Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. Results show GLM-4.7-Flash either leads or competes strongly across a range of benchmarks in math, reasoning, and coding.

Evaluation Parameters and Thinking Mode

The default configurations for GLM-4.7-Flash include a temperature of 1.0, top p of 0.95, and a max token output of 131072. These settings create a comparatively open sampling framework with an extensive generation budget.

For specialized tasks, tighter configurations are applied: temperature at 0.7 and max new tokens set to 16384 for Terminal Bench and SWE-bench Verified. The τ²-Bench utilizes a temperature of 0 with max new tokens also capped at 16384, providing greater stability for multi-step interactions.

Z.ai also endorses activating the Preserved Thinking mode during multi-turn agent tasks to maintain internal reasoning across interactions—vital for complex function call sequences.

Alignment with Developer Workflows

GLM-4.7-Flash encompasses numerous features beneficial for coding-centric applications:

A 30B-A3B MoE architecture with 31B parameters and a 128k token context length.
Strong benchmark results on AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp.
Documented evaluation settings and Preserved Thinking mode for multi-turn agent tasks.
First-class support for vLLM, SGLang, and Transformer-based inference.
A growing library of fine-tunes and quantizations available in the Hugging Face ecosystem.

Explore and Engage

For further details, check out the Model weight and follow us on Twitter. Join our Machine Learning SubReddit and subscribe to our Newsletter for more updates.