Apple's FastVLM Revolutionizes Vision Language Models with Superior Speed and Accuracy

Challenges in High-Resolution Vision Language Models

Vision Language Models (VLMs) integrate textual and visual data processing, but handling high-resolution images poses significant challenges. Pretrained vision encoders often underperform with high-resolution inputs due to inefficient pretraining and increased computational demands. High-resolution images generate more visual tokens, increasing latency and time-to-first-token (TTFT), which combines vision encoder latency and language model prefilling time.

Existing Approaches and Limitations

Models like Frozen, Florence, LLaVA, mPLUG-Owl, MiniGPT-4, and Cambrian-1 utilize various architectures such as cross-attention and autoregressive mechanisms to merge image and text understanding. Vision transformers pretrained on CLIP remain popular, with variants like SigLIP, EVA-CLIP, InternViT, and DFNCLIP used for encoding. Dynamic token pruning methods and hierarchical backbones like ConvNeXT and FastViT attempt to reduce token counts and improve efficiency. Recently, ConvLLaVA introduced a pure convolutional vision encoder for VLMs.

Introducing FastVLM: Optimized Resolution, Latency, and Accuracy

Apple researchers have developed FastVLM to balance image resolution, latency, and accuracy effectively. FastVLM uses FastViTHD, a hybrid vision encoder that outputs fewer tokens and cuts encoding time for high-resolution images. By scaling input images, FastVLM achieves an optimal trade-off, showing a 3.2× improvement in TTFT in the LLaVA1.5 setup and outperforming LLaVA-OneVision at maximum resolution using the same 0.5B language model. It delivers 85× faster TTFT with a 3.4× smaller vision encoder.

Architecture and Training Details

FastVLM models are trained on a single node with eight NVIDIA H100-80GB GPUs. Stage 1 training is fast, taking about 30 minutes with a Qwen2-7B decoder. FastViTHD enhances FastViT by adding a downsampling stage that reduces tensor size by a factor of 32 for self-attention, generating 4× fewer tokens and reducing encoding latency. The architecture comprises five stages: the first three use RepMixer blocks for efficient processing, and the last two use multi-headed self-attention blocks, balancing computational efficiency and image understanding.

Performance and Benchmark Comparisons

Compared to ConvLLaVA with the same language model and training data, FastVLM achieves 8.4% higher performance on TextVQA and 12.5% on DocVQA while running 22% faster. At higher resolutions, FastVLM processes images twice as fast as ConvLLaVA across benchmarks. It matches or exceeds MM1 performance using intermediate pretraining with 15 million samples for resolution scaling and generates 5× fewer tokens. FastVLM also outperforms Cambrian-1 while being 7.9× faster. With scaled instruction tuning, it achieves better results using 2.3× fewer visual tokens.

Benchmarking and Efficiency

FastVLM demonstrates state-of-the-art trade-offs between resolution, latency, and accuracy on M1 MacBook Pro hardware. The hybrid FastViTHD backbone pretrained on reinforced image-text data reduces visual token output with minimal accuracy loss. This results in competitive performance across multiple VLM benchmarks and significant efficiency gains in TTFT and vision backbone parameters.

For further details, check the original paper. Credit goes to the Apple researchers behind this project.