NVIDIA Open-Sources ViPE: Scalable 3D Video Annotation Engine for Spatial AI

September 15, 2025 · 3 min

Turning Ordinary Video into 3D Data

ViPE is a new open-source engine from NVIDIA that converts raw, in-the-wild video into the core 3D elements required for Spatial AI: camera intrinsics, precise camera motion (pose), and dense metric depth maps. It was designed to overcome long-standing tradeoffs between accuracy, robustness, and scalability that have limited 3D computer vision for years.

Why extracting 3D from 2D video is so hard

Most recorded footage is 2D, yet real-world perception systems need 3D geometry. Classical geometric pipelines like SLAM and SfM deliver high precision in controlled settings but are brittle when scenes include motion, low texture, or unknown cameras. End-to-end deep learning approaches are robust but become computationally intractable as video length grows. The result has been a stalemate: the field needs massive, high-quality 3D annotations, but available tools are either too fragile or too slow to scale.

ViPE’s hybrid design

ViPE breaks this deadlock by combining the mathematical rigor of classical optimization with the learned robustness of modern neural networks. At its core ViPE uses a keyframe-based bundle adjustment pipeline for efficiency, augmented with learned components and priors that make the system robust to real-world video artifacts.

Key innovations

Dense flow from learned optical networks to get robust correspondences across frames even under challenging motion and occlusions.
High-resolution sparse feature tracks to preserve classical geometric precision where it matters for localization.
Metric depth priors from state-of-the-art monocular depth models to recover results in true real-world scale.
Advanced segmentation with tools like GroundingDINO and Segment Anything to detect and mask moving objects, ensuring camera pose estimation is driven by the static environment.
Support for multiple camera models, including standard, wide-angle/fisheye, and 360 degree panoramic video, with automatic intrinsics optimization.

Speed, fidelity, and versatility

ViPE runs at roughly 3 to 5 frames per second on a single GPU, making it significantly faster than many competing methods while preserving high geometric accuracy. A post-processing alignment step fuses high-detail depth maps with geometrically consistent outputs from the core pipeline, producing depth that is both high-fidelity and temporally stable.

Proven performance and datasets

Benchmarks show ViPE outperforms uncalibrated pose baselines by notable margins: about 18% improvement on the TUM dataset and about 50% on KITTI. Crucially, ViPE recovers consistent metric scale where other methods often fail.

Beyond the engine itself, NVIDIA used ViPE to generate massive annotated datasets to accelerate Spatial AI research:

Dynpose-100K++: nearly 100,000 internet videos totaling about 15.7 million frames with high-quality poses and dense geometry.
Wild-SDG-1M: about 1 million AI-generated videos totaling roughly 78 million frames.
Web360: a set of annotated panoramic video clips.

These datasets total roughly 96 million annotated frames and are intended to fuel training of next-generation 3D foundation models and world-generation systems like NVIDIA Gen3C and Cosmos.

Where to get ViPE

NVIDIA has open-sourced ViPE and published the code, research page, and datasets. Links from the original release include the project page and GitHub repository, plus dataset releases on Hugging Face. Researchers and practitioners can use ViPE both as a tool and as a large-scale annotation pipeline to create diverse geometric training data for robotics, AR/VR, and autonomous systems.

Project resources: