Apple Researchers Launch StreamBridge to Enable Real-Time Streaming for Video-LLMs
'Apple researchers introduced StreamBridge, a framework that upgrades offline Video-LLMs for real-time streaming video understanding with enhanced multi-turn and proactive response capabilities.'
Challenges in Streaming Video-LLMs
Traditional Video-LLMs process entire pre-recorded videos in one go, which limits their utility in applications requiring real-time perception such as robotics and autonomous driving. These scenarios demand causal, continuous understanding and prompt responses. Key challenges include maintaining multi-turn real-time understanding by processing recent video segments along with historical context, and enabling proactive response generation where the model anticipates and reacts to visual content without explicit prompts.
Advances in Streaming Video Understanding
Recent efforts like VideoLLMOnline, Flash-VStream, MMDuet, and ViSpeak have introduced specialized objectives, memory architectures, and components to handle sequential inputs and proactive responses. Benchmark suites such as StreamingBench, StreamBench, SVBench, OmniMMI, and OVO-Bench help evaluate streaming capabilities.
Introducing StreamBridge Framework
Apple and Fudan University researchers developed StreamBridge, a framework converting offline Video-LLMs into streaming-capable models. StreamBridge tackles multi-turn real-time understanding through a memory buffer combined with a round-decayed compression strategy that supports extensive context. It also includes a decoupled lightweight activation model to enable proactive responses, seamlessly integrating with existing Video-LLMs.
Alongside, the Stream-IT dataset was introduced featuring mixed video-text sequences with a variety of instruction formats, designed specifically for streaming video understanding.
Evaluation and Performance
StreamBridge was evaluated with popular offline Video-LLMs including LLaVA-OV-7B, Qwen2-VL-7B, and Oryx-1.5-7B. The Stream-IT dataset was augmented with around 600K samples from datasets like LLaVA-178K, VCG-Plus, and ShareGPT4Video to preserve general video understanding.
Evaluation on multi-turn real-time tasks using OVO-Bench and StreamingBench demonstrated improvements, particularly with Qwen2-VL scores rising significantly after Stream-IT fine-tuning. Oryx-1.5 also showed notable gains. Although LLaVA-OV saw slight performance drops initially, fine-tuning improved results. Qwen2-VL ultimately outperformed proprietary models such as GPT-4o and Gemini 1.5 Pro, highlighting StreamBridge's effectiveness.
Impact and Future Directions
StreamBridge offers a generalizable solution to transform static offline Video-LLMs into dynamic, responsive models suitable for continuously evolving visual environments. This advancement is crucial for real-time applications in robotics and autonomous driving where timely and proactive video understanding is vital.
For more details, check the original research paper and join the broader ML community on platforms like Twitter and r/machinelearningnews.
Сменить язык
Читать эту статью на русском