IBM Granite 4.0: Hybrid Mamba-2/Transformer Models Slash Memory Use, Keep Performance

Granite 4.0 marks IBM’s move away from monolithic Transformer stacks toward a hybrid Mamba-2/Transformer architecture designed to dramatically cut serving memory while preserving model quality.

Hybrid architecture and design

Granite 4.0 interleaves a small fraction of self-attention blocks with mostly Mamba-2 state-space layers in roughly a 9:1 ratio. That hybrid approach is intended to keep the expressivity of attention where it matters while using Mamba-2 layers to handle long-range state more memory-efficiently. IBM reports memory reductions greater than 70% for long-context and multi-session inference compared with conventional Transformer-only LLMs, which can translate into significantly lower GPU cost at fixed throughput and latency targets.

Released variants and sizes

IBM ships both Base and Instruct variants across four initial models:

All models are released under the Apache-2.0 license, cryptographically signed, and — according to IBM — Granite is the first open model family covered by an accredited ISO/IEC 42001:2023 AI management system certification. Reasoning-optimized or ‘Thinking’ variants are planned later in 2025.

Training, context length, and weight formats

Granite 4.0 was trained on samples up to 512k tokens and evaluated up to 128k tokens. Public checkpoints on Hugging Face are provided in BF16 format. IBM also publishes quantized and GGUF conversions for easier local use. FP8 is available as an execution option on supported hardware, but FP8 is not the released format of the weights.

Performance signals and benchmarks

IBM highlights instruction following and tool-use evaluations for enterprise-relevant behavior:

More details and score summaries are available on IBM’s announcement page: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

Availability and ecosystem support

Granite 4.0 is available via IBM watsonx.ai and distributed across multiple platforms including Docker Hub, Hugging Face, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio/Enterprise Hub, Kaggle and others. IBM notes ongoing enablement work for vLLM, llama.cpp, NexaML, and MLX to support hybrid serving patterns.

Checkpoint and deployment options make it practical to evaluate and run these models locally or in enterprise stacks. BF16 checkpoints and GGUF conversions simplify local pipelines, while signed artifacts and ISO/IEC 42001 coverage address provenance and compliance needs that are often important for production deployments.

Practical implications

The combination of hybrid Mamba-2/Transformer layers and active-parameter MoE appears to be a practical path to lower total cost of ownership. Memory reductions of more than 70% for long-context workloads, along with strong instruction-following and tool-use behavior, enable smaller GPU fleets for the same throughput. For enterprises, the auditable nature of the releases and broad distribution channels reduce friction for adoption and experimentation.

For full technical details, model cards and downloads, see the Hugging Face model page and IBM’s technical announcement. Additional resources and community links are available through IBM’s GitHub and social channels.