Tencent Unveils PrimitiveAnything: Transforming 3D Shape Reconstruction with Auto-Regressive Primitive Generation

Understanding Shape Primitive Abstraction

Shape primitive abstraction simplifies complex 3D forms into basic geometric units, crucial for human visual perception and significant in computer vision and graphics. Existing 3D generation methods, including meshes, point clouds, and neural fields, produce high-fidelity models but often lack semantic depth and interpretability necessary for applications like robotics and scene understanding.

Challenges in Traditional Methods

Traditional approaches rely on optimization-based techniques that fit geometric primitives but tend to over-segment shapes semantically, or learning-based methods trained on limited, category-specific datasets that struggle to generalize. Early methods used simple shapes such as cuboids and cylinders, evolving into more expressive primitives like superquadrics. Yet, aligning abstraction methods with human cognition while maintaining broad applicability remains challenging.

Introducing PrimitiveAnything

Inspired by advances in 3D content generation using large datasets and auto-regressive transformers, Tencent researchers developed PrimitiveAnything. This novel framework treats shape abstraction as a generative task, sequentially assembling primitives to mimic human reasoning. It employs a decoder-only transformer conditioned on shape features to generate variable-length sequences of primitives.

Technical Innovations

PrimitiveAnything uses a unified, ambiguity-free parameterization for multiple primitive types, ensuring high geometric accuracy and efficient learning. The model encodes each primitive's type, position, rotation, and scale into discrete tokens, which a transformer predicts autoregressively. The cascaded decoder models dependencies between attributes, producing coherent primitive assemblies. Training optimizes cross-entropy losses, Chamfer Distance for reconstruction, and applies Gumbel-Softmax for differentiable sampling. Generation continues until an end-of-sequence token is reached, enabling flexible, human-like decomposition.

Dataset and Evaluation

The researchers created HumanPrim, a large dataset with 120,000 3D samples annotated with primitive assemblies. Evaluations using Chamfer Distance, Earth Mover’s Distance, Hausdorff Distance, Voxel-IoU, and segmentation metrics demonstrate that PrimitiveAnything outperforms existing optimization and learning-based methods. Ablation studies highlight the significance of each component. The framework also supports 3D content generation from text or images, offering user-friendly editing, high-quality modeling, and over 95% storage savings.

Potential Applications

PrimitiveAnything's efficient and modular design suits interactive 3D applications such as gaming, where performance and ease of manipulation are critical. Its ability to generalize across object categories and align with human abstraction patterns makes it a promising tool for robotics, scene understanding, and creative content generation.

For more details, explore the Paper, Demo, and GitHub page linked by the researchers. Follow their updates on Twitter and join the ML community discussions.