NVIDIA Unveils Describe Anything 3B: Advanced Multimodal Model for Precise Image and Video Captioning
NVIDIA introduces Describe Anything 3B, a multimodal large language model that excels in detailed, region-specific captioning for images and videos, outperforming existing models on multiple benchmarks.
Challenges in Region-Specific Captioning
Describing precise areas within images or videos has long been a challenge in vision-language models. General-purpose models generate broad captions but struggle with detailed, localized descriptions. This difficulty increases with video data due to temporal dynamics and occlusions. Key issues include loss of fine-grained visual details during feature extraction, lack of sufficient annotated datasets for regional descriptions, and evaluation methods that sometimes penalize accurate captions because of incomplete references.
Introducing Describe Anything 3B
NVIDIA presents Describe Anything 3B (DAM-3B), a multimodal large language model designed specifically for fine-grained, localized captioning of images and videos. The system, including DAM-3B-Video, can process inputs that specify regions via points, bounding boxes, scribbles, or masks, generating contextually rich descriptive text. It supports both static images and dynamic videos, and the models are publicly accessible on Hugging Face.
Architecture and Innovations
DAM-3B features two main innovations: a focal prompt and a localized vision backbone with gated cross-attention. The focal prompt combines the full image with a high-resolution crop of the target region, preserving both detailed regional information and overall context. The localized vision backbone embeds image and mask inputs and uses cross-attention to merge global and focal features before passing them to the large language model. This design maintains computational efficiency by avoiding token length inflation.
DAM-3B-Video extends this architecture to handle temporal sequences, encoding frame-wise region masks and integrating them over time. This enables accurate region-specific descriptions in videos despite occlusions and motion.
Training Data and Evaluation
To address data scarcity, NVIDIA developed the DLC-SDP pipeline, a semi-supervised approach combining segmentation datasets with unlabeled web-scale images to create a training set of 1.5 million localized examples. A self-training method refines region descriptions to ensure high-quality captions.
For evaluation, NVIDIA introduced DLC-Bench, focusing on attribute-level correctness rather than exact matches with reference captions. DAM-3B outperforms competitors like GPT-4o and VideoRefer across seven benchmarks, excelling in keyword-level, phrase-level, and multi-sentence localized captioning tasks. On DLC-Bench, it achieves an average accuracy of 67.3%, leading in both detail and precision.
Broad Implications
Describe Anything 3B overcomes previous limitations in region-specific captioning by integrating a context-aware architecture with scalable, high-quality data generation. Its capabilities extend across applications such as accessibility tools, robotics, and video content analysis. NVIDIA’s release offers a robust benchmark and a clear technical roadmap for future multimodal AI advancements.
Check out the Paper, Model on Hugging Face, and Project Page. Follow NVIDIA on Twitter, Telegram, and LinkedIn. Join the 90k+ ML SubReddit for more updates.
Сменить язык
Читать эту статью на русском