Salesforce AI's FOFPred: Advancing Robot Control with AI

FOFPred: A Breakthrough in Optical Flow Prediction

Salesforce AI research team introduces FOFPred, a pioneering framework that integrates large vision language models with diffusion transformers for robust future optical flow prediction. By accepting one or more images alongside natural language instructions—like ‘moving the bottle from right to left’—FOFPred predicts four future optical flow frames that detail pixel motion over time.

Understanding Future Optical Flow

Optical flow denotes the pixel displacement between frames, but FOFPred focuses specifically on predicting future optical flow. This approach provides a compact representation for motion, ideal for robot control policies and as a input for video diffusion models. By emphasizing motion rather than static appearance, FOFPred simplifies output distribution without needing to model textures, therefore streamlining motion planning.

Unifying Vision Language Models and Diffusion Transformations

Utilizing a unified architecture, FOFPred incorporates:

Qwen2.5-VL for vision-language encoding.
Flux.1 VAE for latent image encoding.
DiT, an OmniGen-style diffusion transformer that generates latent future flow sequences.

Only the DiT and MLP projectors are trained, maintaining frozen weights in Qwen2.5-VL and Flux.1 to leverage previous image editing pretraining and multimodal reasoning capabilities. Temporal modeling is enhanced through spatio-temporal positional encoding, yielding efficient attention mechanisms without additional parameters.

Training on Noisy Video Data

The training corpus comprises 500,000 video-caption pairs sourced from the Something Something V2 and EgoDex datasets. FOFPred employs an end-to-end flow matching objective, with features like classifier-free guidance to enhance robustness. Additionally, it utilizes relative optical flow calculations for creating reliable training targets:

Dense optical flow is computed.
Camera motion is estimated and subtracted.
Only segments with notable motion are retained for training.

Advancements in Robot Manipulation

FOFPred's first application is in robot control, where it is fine-tuned using robot video captions to predict future flow from various camera outputs. Coupled with a diffusion policy network, FOFPred achieves impressive performance metrics on benchmarks:

CALVIN ABCD: Average task chain length of 4.48.
RoboTwin 2.0: Average success rate of 68.6%.

Enhancing Text-to-Video Generation

In text-to-video tasks, FOFPred enhances motion control when integrated with the Go with the Flow video diffusion model. This combination leads to remarkable performance improvements on metrics like SSIM and PSNR against benchmark methods.

Key Takeaways

FOFPred innovatively predicts future optical flows, enabling compact motion representation for critical applications.
The use of a unified VLM Diffusion architecture facilitates effective training, bolstered by robust datasets.
In robot manipulation, FOFPred demonstrates state-of-the-art success across various benchmarks, thereby affirming its practical utility in robotics and motion synthesizing tasks.