LFM2-VL: Liquid AI's Ultra-Fast, Open-Weight Vision-Language Models for On-Device Use

What LFM2-VL is and why it matters

Liquid AI has released LFM2-VL, a family of vision-language foundation models built for low-latency, device-aware deployment. The family includes two compact variants, LFM2-VL-450M and LFM2-VL-1.6B, designed to bring multimodal AI capabilities to smartphones, laptops, wearables, and embedded systems without sacrificing responsiveness or benchmark performance.

Speed, efficiency, and target devices

Both LFM2-VL variants focus on inference speed and resource efficiency. Liquid AI reports up to 2× faster GPU inference compared with many existing vision-language models while delivering competitive results on image description, visual question answering, and multimodal reasoning. The 450M-parameter model targets highly resource-constrained environments, and the 1.6B-parameter model provides more capability while remaining suitable for single-GPU or high-end mobile deployment.

Technical innovations

Modular architecture: Each LFM2-VL model pairs a language backbone (LFM2-1.2B or LFM2-350M) with a SigLIP2 NaFlex vision encoder (400M or 86M parameters) and a multimodal projector. The pipeline uses a 'pixel unshuffle' technique to dynamically reduce image token counts and accelerate processing.
Native resolution handling: Images are processed at native resolution up to 512×512 without upscaling. Larger images are split into non-overlapping 512×512 patches to preserve detail and aspect ratio. The 1.6B model additionally encodes a downscaled thumbnail of the full image to provide global context.
Flexible inference controls: Developers can tune the speed-quality tradeoff at runtime by adjusting maximum image tokens and patch counts, enabling real-time adaptation to device constraints and application needs.

Training and datasets

LFM2-VL models were pre-trained on the LFM2 backbone, then jointly mid-trained to fuse vision and language capabilities using a progressive rebalancing of text-to-image data ratios. Final fine-tuning for image understanding used roughly 100 billion multimodal tokens, producing models calibrated for common vision-language tasks.

Benchmark performance and availability

LFM2-VL achieves competitive scores on benchmarks such as RealWorldQA, MM-IFEval, and OCRBench, often rivaling larger models like InternVL3 and SmolVLM2 while using less memory and offering much faster processing. Both model sizes are open-weight and downloadable on Hugging Face under an Apache 2.0 based license, permitting free research and commercial use by companies; larger enterprises should contact Liquid AI for a commercial license. Integration with Hugging Face Transformers and support for quantization enable additional efficiency gains on edge hardware.

Use cases and integration

LFM2-VL targets developers and organizations building on-device multimodal applications. Typical uses include real-time image captioning, visual search, interactive multimodal chatbots, smart cameras, robotics, IoT devices, and mobile assistants. The models can be run with example inference setups like llama.cpp and quantized for varying hardware. Liquid AI also supports further customization and multi-platform edge deployment via its LEAP platform.

How to get started

Both LFM2-VL models are available now in Liquid AI's Hugging Face collection. The project page and documentation include technical details, example inference code, and links to GitHub tutorials and notebooks. For ongoing updates and community engagement, Liquid AI points to its social channels and newsletter.