IBM Granite 4.0: Hybrid Mamba-2/Transformer Models Slash Memory Use, Keep Performance
Granite 4.0 marks IBM’s move away from monolithic Transformer stacks toward a hybrid Mamba-2/Transformer architecture designed to dramatically cut serving memory while preserving model quality.
Hybrid architecture and design
Granite 4.0 interleaves a small fraction of self-attention blocks with mostly Mamba-2 state-space layers in roughly a 9:1 ratio. That hybrid approach is intended to keep the expressivity of attention where it matters while using Mamba-2 layers to handle long-range state more memory-efficiently. IBM reports memory reductions greater than 70% for long-context and multi-session inference compared with conventional Transformer-only LLMs, which can translate into significantly lower GPU cost at fixed throughput and latency targets.
Released variants and sizes
IBM ships both Base and Instruct variants across four initial models:
- Granite-4.0-H-Small: 32B total parameters, ~9B active (hybrid MoE).
- Granite-4.0-H-Tiny: 7B total parameters, ~1B active (hybrid MoE).
- Granite-4.0-H-Micro: 3B (hybrid dense).
- Granite-4.0-Micro: 3B (dense Transformer for stacks that do not yet support hybrids).
All models are released under the Apache-2.0 license, cryptographically signed, and — according to IBM — Granite is the first open model family covered by an accredited ISO/IEC 42001:2023 AI management system certification. Reasoning-optimized or ‘Thinking’ variants are planned later in 2025.
Training, context length, and weight formats
Granite 4.0 was trained on samples up to 512k tokens and evaluated up to 128k tokens. Public checkpoints on Hugging Face are provided in BF16 format. IBM also publishes quantized and GGUF conversions for easier local use. FP8 is available as an execution option on supported hardware, but FP8 is not the released format of the weights.
Performance signals and benchmarks
IBM highlights instruction following and tool-use evaluations for enterprise-relevant behavior:
- IFEval (HELM): Granite-4.0-H-Small ranks ahead of most open-weights models and trails only larger models such as Llama 4 Maverick in certain tests.
- BFCLv3 (Function Calling): H-Small shows competitive function-calling performance at lower price points compared with larger open and closed models.
- MTRAG (multi-turn RAG): Granite improves reliability on complex retrieval-augmented generation workflows.
More details and score summaries are available on IBM’s announcement page: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models
Availability and ecosystem support
Granite 4.0 is available via IBM watsonx.ai and distributed across multiple platforms including Docker Hub, Hugging Face, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio/Enterprise Hub, Kaggle and others. IBM notes ongoing enablement work for vLLM, llama.cpp, NexaML, and MLX to support hybrid serving patterns.
Checkpoint and deployment options make it practical to evaluate and run these models locally or in enterprise stacks. BF16 checkpoints and GGUF conversions simplify local pipelines, while signed artifacts and ISO/IEC 42001 coverage address provenance and compliance needs that are often important for production deployments.
Practical implications
The combination of hybrid Mamba-2/Transformer layers and active-parameter MoE appears to be a practical path to lower total cost of ownership. Memory reductions of more than 70% for long-context workloads, along with strong instruction-following and tool-use behavior, enable smaller GPU fleets for the same throughput. For enterprises, the auditable nature of the releases and broad distribution channels reduce friction for adoption and experimentation.
For full technical details, model cards and downloads, see the Hugging Face model page and IBM’s technical announcement. Additional resources and community links are available through IBM’s GitHub and social channels.