MapAnything: A Single Transformer for Metric 3D Reconstruction from Images

Why a universal 3D reconstruction model?

Image-based 3D reconstruction has long relied on multi-stage pipelines: feature extraction, pose estimation, bundle adjustment, multi-view stereo or monocular depth networks. These modular approaches deliver good results but demand task-specific tuning, expensive optimization and heavy post-processing. MapAnything aims to replace that fragmented stack with a single feed-forward transformer that directly regresses factored, metric 3D scene geometry from images and optional sensor inputs.

Core ideas and representation

MapAnything uses a factored scene representation that separates the geometric outputs into interpretable components: per-view ray directions (camera calibration), depth along rays (predicted up-to-scale), camera poses relative to a reference view and a single global metric scale factor. This explicit factorization reduces redundancy and lets the same model handle many tasks — monocular depth, multi-view stereo, structure-from-motion (SfM), two-view reconstruction and depth completion — without bespoke heads.

The architecture is a multi-view alternating-attention transformer. Input images are encoded with DINOv2 ViT-L features, while optional geometric inputs such as rays, depth and poses are embedded into the same latent space via shallow CNNs or MLPs. A learnable scale token enables metric normalization across views, giving the model the ability to output metric-consistent reconstructions without iterative bundle adjustment.

Flexible inputs and large-scale inference

MapAnything accepts up to 2,000 images in a single pass and can consume auxiliary information when available: camera intrinsics, poses and depth priors. During training, probabilistic input dropout randomly omits geometric inputs so the model learns to operate under heterogeneous configurations. Covisibility-based sampling ensures input views overlap meaningfully, enabling reconstructions that aggregate information across 100+ views.

Training at scale and released variants

The team trained MapAnything across 13 diverse indoor, outdoor and synthetic datasets, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++ and TartanAirV2. Two model variants are released: an Apache 2.0 licensed model trained on six datasets and a CC BY-NC model trained on all thirteen datasets for stronger performance. Training used 64 H200 GPUs with mixed precision, gradient checkpointing and curriculum schedules that scaled inputs from 4 to 24 views.

Key training techniques include probabilistic geometric input dropout, covisibility-based view sampling and factored losses applied in log-space to optimize depth, scale and pose with scale-invariant and robust regression objectives.

Benchmarks and performance highlights

Across multiple benchmarks MapAnything attains state-of-the-art results in dense multi-view reconstruction, two-view tasks, single-view calibration and metric depth estimation. Examples include:

Overall, the unified training and factored outputs yield up to a 2× improvement over previous SoTA in many tasks while drastically reducing the need for task-specific pipelines and expensive post-processing.

Key contributions and open-source release

The work highlights four main contributions: a unified feed-forward model handling 12+ tasks, a factored scene representation that separates rays, depth, pose and metric scale, state-of-the-art performance across diverse benchmarks and an open-source release with data processing, training code, benchmarks and pretrained weights under Apache 2.0. The project page and paper provide full details and reproducible code: https://map-anything.github.io/assets/MapAnything.pdf

MapAnything demonstrates that a single transformer backbone, trained across heterogeneous data and inputs, can replace specialized reconstruction pipelines and serve as a general-purpose 3D reconstruction foundation.