Hugging Face Launches nanoVLM: Train a Vision-Language Model in Just 750 Lines of PyTorch Code

Introducing nanoVLM: A Minimalist Vision-Language Framework

Hugging Face has unveiled nanoVLM, a compact and educational PyTorch library designed to train vision-language models (VLMs) from scratch using only 750 lines of code. Inspired by projects like nanoGPT by Andrej Karpathy, nanoVLM emphasizes readability and modularity without sacrificing practical usefulness.

Core Architecture and Components

At its foundation, nanoVLM integrates three main components: a visual encoder, a lightweight language decoder, and a modality projection layer that connects the two. The visual encoder is based on the SigLIP-B/16 transformer architecture, known for effective image feature extraction. This encoder converts images into embeddings suitable for language processing.

The language side employs SmolLM2, a causal decoder-style transformer optimized for simplicity and efficiency. Despite its compact design, it can generate coherent captions that are contextually relevant to the input images.

The projection layer aligns image embeddings into the input space of the language model, facilitating seamless interaction between vision and language modalities. The entire architecture is intentionally transparent and modular, making it ideal for learning and rapid experimentation.

Performance Highlights

nanoVLM achieves competitive results despite its simplicity. Trained on 1.7 million image-text pairs from the open-source the_cauldron dataset, it attains 35.3% accuracy on the MMStar benchmark. This performance rivals larger models such as SmolVLM-256M but requires fewer parameters and less computational power.

The released pre-trained nanoVLM-222M model contains 222 million parameters, striking a balance between scale and efficiency. This demonstrates that thoughtful design can yield strong baseline results in vision-language tasks without massive resource demands.

Designed for Education and Extension

Unlike many complex production frameworks, nanoVLM prioritizes transparency and minimal abstraction. Every component is clearly structured, enabling developers to easily follow data flow and model logic. This makes it an excellent resource for educational purposes, reproducibility efforts, and workshops.

Its modular design also allows users to swap in more powerful vision encoders, language decoders, or projection mechanisms, making nanoVLM a solid foundation for exploring advanced research topics such as cross-modal retrieval, zero-shot captioning, or multimodal instruction-following agents.

Open Source and Community Integration

In alignment with Hugging Face's commitment to openness, the nanoVLM codebase and pre-trained models are freely available on GitHub and the Hugging Face Hub. This integration facilitates use with popular tools like Transformers, Datasets, and Inference Endpoints, simplifying deployment and fine-tuning.