<RETURN_TO_BASE

PEVA: Revolutionizing Egocentric Video Prediction with Full-Body Motion

PEVA is a new AI model that predicts egocentric video frames conditioned on detailed full-body motion, improving visual anticipation in embodied systems.

Understanding Body Movement and Visual Perception

Human visual perception from an egocentric viewpoint is vital for building intelligent systems that can understand and interact with their surroundings. Movements ranging from walking to arm gestures directly influence what is seen from a first-person perspective. Grasping this connection enables machines and robots to anticipate visual input in a human-like way, especially in dynamic real-world environments.

Challenges in Linking Motion to Visual Changes

Teaching systems how physical actions affect perception is complex. Movements such as turning or bending alter visibility in subtle, sometimes delayed ways. Accurately capturing these effects requires more than just predicting future video frames; it demands linking the physical movement to the visual consequences. Without this capability, embodied agents cannot plan or interact effectively in changing settings.

Limitations of Existing Models

Previous models predicting video from human motion often relied on low-dimensional inputs like velocity or head direction, ignoring the complexity of whole-body movements. These approaches miss the detailed coordination required to mimic human actions accurately. Often, body motion was treated as a video output rather than a predictive driver, limiting the models’ usefulness for practical planning.

Introducing PEVA: Whole-Body Conditioned Video Prediction

To overcome these gaps, researchers from UC Berkeley, Meta FAIR, and NYU developed PEVA, a framework that predicts future egocentric video frames conditioned on full-body motion data from 3D pose trajectories. PEVA grounds the connection between action and perception using a conditional diffusion transformer trained on Nymeria, a large dataset of synchronized egocentric videos and full-body motion capture.

Structured Action Representation and Model Design

PEVA represents actions as 48-dimensional vectors including root translation and rotations of 15 upper body joints in 3D, normalized in a pelvis-centered coordinate frame. This comprehensive representation captures continuous and nuanced motion. The autoregressive diffusion model encodes video frames into latent states, predicting future frames based on previous states and body actions. Random time-skips during training enable learning of both immediate and delayed visual effects.

Performance and Results

PEVA demonstrated superior short-term (2 seconds) and long-term (up to 16 seconds) video prediction capabilities. It achieved lower LPIPS scores and higher DreamSim consistency than baselines, indicating better visual quality and semantic accuracy. The model also effectively decomposed movements into atomic actions like arm swings and rotations, enhancing fine-grained control. Long rollouts maintained sequence coherence while simulating delayed outcomes, confirming the benefit of full-body conditioning.

Advancing Physically Grounded Embodied AI

This work marks a significant step forward in egocentric video prediction by physically grounding models in full-body motion. Using structured pose data and diffusion-based learning, PEVA enables embodied AI to foresee future visual scenes with greater accuracy and realism, paving the way for smarter, more responsive systems.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский