LFM2-8B-A1B: Liquid AI’s Sparse MoE That Runs a 1.5B Active Path on Phones

October 11, 2025 · 3 min

What LFM2-8B-A1B aims to do

Liquid AI released LFM2-8B-A1B, a small-scale Mixture-of-Experts model engineered for on-device execution under tight memory, latency, and energy constraints. The model exposes 8.3 billion total parameters while activating only about 1.5 billion parameters per token through sparse expert routing. That design targets phones, laptops, and embedded systems rather than cloud batch serving, trading dense per-token compute for a sparse, high-capacity representation.

Architecture and routing

LFM2-8B-A1B builds on the LFM2 fast backbone and injects sparse-MoE feed-forward blocks to grow capacity without substantially increasing active compute. The backbone includes 18 gated short-convolution blocks and 6 grouped-query attention (GQA) blocks. All layers except the first two incorporate an MoE block; the first two layers remain dense to improve stability.

Each MoE block contains 32 experts. Routing selects the top-4 experts per token using a normalized-sigmoid gating function with an adaptive routing bias to balance expert load and stabilize training. The model supports a context length of 32,768 tokens, a vocabulary size of 65,536, and reports a pre-training budget around 12 trillion tokens.

This sparse approach keeps per-token FLOPs and cache growth bounded by the active path (attention plus four expert MLPs), while the total parameter budget enables broader specialization across domains such as multilingual knowledge, math, and code—areas where very small dense models often underperform.

Performance signals

Liquid AI reports CPU benchmarks where LFM2-8B-A1B runs noticeably faster than Qwen3-1.7B using an internal XNNPACK-based stack and a custom CPU MoE kernel. Public plots focus on int4 quantization with int8 dynamic activations evaluated on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra. The team positions LFM2-8B-A1B’s quality as comparable to 3–4B dense models while keeping active compute near 1.5B parameters per token.

The model card includes results across 16 benchmarks, covering knowledge (MMLU/MMLU-Pro/GPQA), instruction following (IFEval/IFBench/Multi-IF), math (GSM8K/GSMPlus/MATH500/MATH-Lvl-5), and multilingual tests (MGSM/MMMLU). Scores show competitive instruction-following and math performance within the small-model band and improved knowledge capacity versus LFM2-2.6B, consistent with the larger total parameter budget.

Deployment and tooling

LFM2-8B-A1B is available with Transformers/vLLM support for GPU inference and GGUF builds for llama.cpp. The official GGUF repo lists common quantized builds from Q4_0 (~4.7 GB) up to F16 (~16.7 GB) for local runs. Running in llama.cpp requires a recent build with lfm2moe support (b6709+), otherwise users may see “unknown model architecture” errors.

Liquid AI’s CPU validation used Q4_0 with int8 dynamic activations on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra, where LFM2-8B-A1B showed higher decode throughput than Qwen3-1.7B at a similar active-parameter class. ExecuTorch is referenced as an option for mobile and embedded CPU deployment.

The model is released under the LFM Open License v1.0 (lfm1.0) and is published with standard weights and GGUF files for local, low-latency usage.

Practical implications

LFM2-8B-A1B demonstrates that sparse MoE architectures can be practical below the traditional server-scale regime. By keeping token compute near 1.5B while expanding total capacity to 8.3B parameters, the model offers a path to on-device assistants and copilots that balance quality, latency, and private local inference. With standard tooling paths (llama.cpp, vLLM, ExecuTorch) and permissive licensing, it is a concrete option for developers targeting high-end consumer and edge hardware.