Hugging Face Unveils SmolVLA: Efficient and Affordable Vision-Language-Action Model for Robotics
Hugging Face launches SmolVLA, an efficient and affordable vision-language-action model enabling real-time robotic control on low-cost hardware with open-source resources.
Addressing the Challenges in Robotic Control
Recent advancements in large-scale vision-language-action (VLA) models have propelled robotic control forward, but real-world application remains limited by the substantial hardware and data requirements. Most existing VLA models utilize transformer-based architectures with billions of parameters, incurring high memory and computational costs. This restricts experimentation to well-resourced labs and cloud environments, leaving out those working with lower-cost hardware. Furthermore, much of the progress in VLA research is proprietary or non-reproducible, hindering open research. Data heterogeneity across robotic platforms, such as variations in morphology, sensors, and control methods, further complicates generalization and cross-platform learning.
Introducing SmolVLA: A Lightweight, Open-Source VLA Model
Hugging Face introduces SmolVLA, a compact vision-language-action model designed for affordability and efficient deployment. Unlike traditional VLAs, SmolVLA is trained exclusively on community-collected datasets and optimized to operate on single-GPU or CPU systems. Its architecture combines a trimmed pretrained vision-language model (SmolVLM-2) with a transformer-based action expert, enabling effective low-level control based on natural language instructions and RGB camera inputs.
A key innovation is SmolVLA's asynchronous inference stack, which separates action prediction from execution. This design reduces latency, making it suitable for real-time control even on resource-limited hardware. The model is released under an open license, complete with code, training data, and deployment tools.
Architectural Design and Efficiency Considerations
SmolVLA consists of two main components:
-
Perception Module (SmolVLM-2): A compact vision-language encoder pretrained to process sequences of RGB images, sensorimotor states, and language instructions. Efficiency is achieved by downsampling visual tokens and using only the lower half of transformer layers, as earlier layers provide more transferable features.
-
Action Expert: A lightweight transformer trained with flow matching that predicts sequences of continuous control actions. It alternates between self-attention and cross-attention layers to maintain action coherence and condition on perception inputs, with causal masking enforcing temporal consistency.
To minimize computational load, linear projections align token dimensions across modalities. Actions are predicted in chunks rather than single steps, reducing inference frequency. Training utilizes bfloat16 precision and Torch’s JIT compilation for runtime performance optimization.
Performance in Simulation and Real-World Tasks
SmolVLA was evaluated on simulation benchmarks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. Trained from scratch on approximately 23,000 episodes across 481 community datasets with auto-generated task labels, it achieved impressive results.
In the LIBERO benchmark, SmolVLA (0.45B parameters) reached an average success rate of 87.3%, rivaling or exceeding larger models like π₀ (3.3B parameters). On Meta-World, it outperformed diffusion policies and smaller VLAs across various task difficulties. These outcomes are especially notable given SmolVLA’s smaller training footprint and lack of robotics-specific pretraining.
In real-world tests, SmolVLA achieved a 78.3% success rate on pick-place, stacking, and sorting tasks, outperforming both ACT (trained from scratch) and π₀ (finetuned). It also demonstrated strong generalization across different robotic embodiments, maintaining performance on SO101 despite training only on SO100 data.
Advantages of Asynchronous Inference
The asynchronous inference stack enhances control efficiency by overlapping prediction and execution phases. Compared with traditional synchronous inference, this approach reduces average task completion time by about 30% and doubles the number of actions completed within fixed time frames. This is particularly advantageous for edge deployments where inference latency can hamper real-time control.
Impact and Future Directions
SmolVLA proves that compact, reproducible, and open-source VLA models can enable effective robotic control on affordable hardware. Architectural strategies such as layer pruning, chunked action prediction, and asynchronous execution allow the model to retain performance while significantly lowering computational demands.
The open training and deployment resources, along with real-world validation, provide a solid foundation for continued research into efficient and accessible robot learning. Future work aims to expand cross-embodiment datasets, scale model capacity without increasing latency, and explore joint training on diverse multimodal corpora beyond robotics data.
For more information, check the paper and model on Hugging Face. All credit belongs to the researchers behind this project.
Сменить язык
Читать эту статью на русском