UniversalRAG: Dynamic Multimodal Retrieval for Smarter AI Responses

Enhancing LLM Accuracy with Multimodal Retrieval

Retrieval-Augmented Generation (RAG) has significantly improved the factual accuracy of large language models by grounding their outputs in relevant external data. However, most traditional RAG systems focus solely on text-based corpora, limiting their usability when queries require diverse modalities such as images or videos.

Challenges of Existing Multimodal RAG Systems

Recent attempts to extend RAG to multiple modalities often operate within a single modality-specific corpus. This approach restricts their ability to address queries needing multimodal reasoning and typically retrieves from all modalities indiscriminately, reducing efficiency and adaptability.

The Need for Adaptive Modality and Granularity Selection

To overcome these limitations, adaptive RAG systems must determine the most relevant modality and the appropriate retrieval granularity based on query context. Strategies include routing queries based on complexity and model confidence, as well as indexing corpora at finer levels like paragraphs or specific video clips to improve retrieval relevance.

Introducing UniversalRAG: A Next-Gen Framework

Researchers from KAIST and DeepAuto.ai have developed UniversalRAG, a novel RAG framework that dynamically routes queries across modality-specific sources (text, image, video) and multiple granularity levels. Unlike traditional methods that embed all modalities into a shared space, UniversalRAG employs a modality-aware routing mechanism to select the most relevant corpus dynamically.

Each modality is organized into granularity-specific corpora, such as paragraphs or video clips, enhancing retrieval precision. Validated on eight multimodal benchmarks, UniversalRAG consistently outperforms both unified and modality-specific baselines, showcasing its versatility.

How UniversalRAG Works

UniversalRAG separates knowledge into text, image, and video corpora with fine- and coarse-grained levels. A routing module determines the optimal modality and granularity for each query, selecting from options like paragraphs, full documents, video clips, or entire videos. This router can be a training-free LLM-based classifier or a trained model using heuristic labels.

An LVLM (large vision-language model) then utilizes the selected content to generate accurate responses.

Comprehensive Benchmarking and Performance

The system was evaluated across six retrieval scenarios: no retrieval, paragraph, document, image, clip, and video. Diverse datasets such as MMLU, SQuAD, Natural Questions, HotpotQA, WebQA, LVBench, and VideoRAG were used to cover various modalities and retrieval granularities.

Advancing Multimodal Reasoning with Flexible Retrieval

UniversalRAG’s dynamic routing addresses modality gaps and rigid retrieval structures found in existing RAG methods. This flexibility, combined with fine-grained retrieval and both trained and train-free routing mechanisms, enables robust and efficient multimodal reasoning, pushing forward the capabilities of retrieval-augmented generation.