<RETURN_TO_BASE

Meta AI Unveils Open-Sourced PE-AV for Multimodal Retrieval

Explore how Meta’s PE-AV encodes audio, video, and text into a unified structure.

Overview of PE-AV

Meta researchers have introduced Perception Encoder Audiovisual (PEAV) as a pioneering encoder designed for joint audio and video understanding. This model harnesses a large-scale contrastive training approach, utilizing approximately 100M audio-video pairs combined with textual captions to learn aligned representations across an embedding space.

From Perception Encoder to PE-AV

The Perception Encoder (PE) serves as the foundational vision stack in Meta’s Perception Models project. It encompasses a series of encoders specifically established for images, video, and audio, achieving state-of-the-art results across multiple benchmarks using a unified contrastive pretraining methodology. Notably, PE has successfully surpassed SigLIP2 on image tasks and InternVideo2 on video tasks. The extension to PEAV further refines audio-video-text alignment, allowing for improved cross-modal understanding.

Architecture: Separate Towers and Fusion

The architecture of PEAV is structured around several specialized encoders:

  • The video path employs the existing PE frame encoder on RGB frames, which is then supplemented with a temporal video encoder.
  • The audio path utilizes DAC VAE as a codec, converting raw waveforms into discrete audio tokens at a fixed frame rate, approximately one embedding every 40 milliseconds.

These individual components converge into an audio-video fusion encoder that learns a shared representation across both modalities. Additionally, a text encoder facilitates the projection of text queries into specialized spaces, enhancing retrieval capabilities across various modalities.

Data Engine: Synthetic Audiovisual Captions at Scale

Meta’s research team has constructed a two-stage audiovisual data engine that synthesizes high-quality captions for unlabeled clips. The first stage employs multiple weak audio caption models combined with video captioners to input into a large language model (LLM), which subsequently generates three types of captions per clip. This pipeline facilitates the training of an initial PEAV model using synthetic supervision. In the second stage, the PEAV model collaborates with a Perception Language Model decoder to refine the captions, significantly improving audiovisual correspondence.

Contrastive Objective Across Ten Modality Pairs

PEAV implements a sigmoid-based contrastive loss that operates across audio, video, text, and their fused representations. The model's pretraining process utilizes eight contrastive loss pairs, accommodating various modality combinations. As a result, PEAV supports a seamless integration of classification, retrieval, and correspondence tasks.

Performance Across Audio, Speech, Music, and Video

PEAV has achieved remarkable performance across several benchmarks, demonstrating its capabilities in zero-shot retrieval and classification tasks. Highlights include:

  • AudioCaps: text-to-audio retrieval improved from 35.4 R@1 to 45.8 R@1.
  • VGGSound: classification accuracy improved from 36.0% to 47.1%.
  • Speech retrieval on VCTK tasks: accuracy reached 85.6%.
  • ActivityNet: text-to-video retrieval surged from 60.4 R@1 to 66.5 R@1.
  • Kinetics 400: zero-shot video classification increased from 76.9% to 78.9%.

Conclusion

Overall, PEAV integrates audio, video, and text modalities through a sophisticated architecture that employs innovative techniques for data ingestion and retrieval. The development heralds a significant step forward in the field of multimodal learning and multimedia understanding.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский