LFM2-VL-3B: Liquid AI's 3B Vision-Language Model Built for Edge Devices

October 24, 2025 · 4 min

What LFM2-VL-3B brings

Liquid AI launched LFM2-VL-3B, a 3 billion parameter vision language model focused on image text to text tasks. It expands the LFM2-VL family beyond the 450M and 1.6B variants, targeting improved accuracy while retaining the LFM2 architecture’s efficient runtime characteristics. The model is available on LEAP and Hugging Face under the LFM Open License v1.0.

Input interface and prompt style

LFM2-VL-3B accepts interleaved image and text inputs and produces text outputs. It uses a ChatML like template where the processor inserts an sentinel that is replaced with encoded image tokens at runtime. The model supports a large default text context length of 32,768 tokens. These interface details make it straightforward to reproduce evaluations and to integrate the model into existing multimodal pipelines.

Architecture overview

The stack pairs a language tower with a shape aware vision tower and a projector. The language tower is LFM2-2.6B, a hybrid backbone combining convolution and attention. The vision tower is SigLIP2 NaFlex at 400M parameters, designed to preserve native aspect ratios and avoid distortion. The connector is a 2 layer MLP with pixel unshuffle that compresses image tokens before fusion with the language space. This design lets users cap vision token budgets without retraining the model.

The encoder processes native resolutions up to 512×512. Larger inputs are split into non overlapping 512×512 patches and a thumbnail pathway provides global context during tiling. The model card documents efficient token mapping with concrete examples: a 256×384 image maps to 96 tokens, while a 1000×3000 image maps to 1,020 tokens. The model card also exposes user controls for minimum and maximum image tokens and a tiling switch so developers can tune speed and quality at inference time.

Inference recommendations

The Hugging Face model card lists recommended parameters for generation. Text decoding uses temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Vision settings suggest min image tokens 64, max image tokens 256, and image splitting enabled. The processor applies the chat template and the image sentinel automatically. Example usage in the model card uses AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.

Training and data strategy

Liquid AI describes a staged training approach. The team runs joint mid training that adjusts the text to image ratio over time, followed by supervised fine tuning focused on image understanding. Training data includes large scale open datasets augmented with in house synthetic vision data to improve task coverage.

Reported benchmarks

The research team reports competitive results among lightweight open VLMs. Key reported scores include MM-IFEval 51.83, RealWorldQA 71.37, MMBench dev en 79.81, and a POPE score of 89.01. Language capability remains close to the LFM2-2.6B backbone, with about 30 percent on GPQA and 63 percent on MMLU, which matters when perception tasks include knowledge queries. The team also highlights expanded multilingual visual understanding across English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese, and Korean.

Why edge users should care

The architecture keeps compute and memory within small device budgets. Image tokens are compressible and user constrained so throughput is predictable. The SigLIP2 400M NaFlex encoder preserves aspect ratios which helps fine grained perception. The projector reduces tokens at the connector which improves tokens per second. Liquid AI also published a GGUF build for on device runtimes. These properties make LFM2-VL-3B useful for robotics, mobile, and industrial clients that need local processing and strict data boundaries.

Key takeaways

Compact multimodal stack combining LFM2-2.6B language tower with 400M SigLIP2 NaFlex vision encoder and a 2 layer MLP projector.
Native resolution handling up to 512×512 with non overlapping tiling and a thumbnail pathway for global context; documented token mappings such as 256×384 -> 96 tokens and 1000×3000 -> 1,020 tokens.
ChatML like prompting with an sentinel, large default text context, recommended decoding settings, and processor controls for image splitting and token budgets.
Competitive open VLM performance for its size, open weights and a GGUF build ease on device integration.

For more technical details and the model card see the Liquid AI blog and the Hugging Face model page.