MetaEmbed: Serve-Time Tuning for Multimodal Retrieval with Learnable Meta Tokens

MetaEmbed introduces a single, practical control surface for multimodal retrieval: choose how many compact learnable “Meta Tokens” to use on queries and candidates at serve time to trade accuracy, latency, and index size.

Core idea

MetaEmbed avoids the extremes of collapsing items into one vector (CLIP-style) or exploding into hundreds of patch/token vectors (ColBERT-style). During training, the model appends a fixed, learnable set of Meta Tokens to each input. At inference, the final hidden states of those Meta Tokens are reused as a small set of multi-vector embeddings. That gives an adjustable granularity for cross-modal matching without retraining.

Matryoshka Multi-Vector Retrieval (MMR)

To make per-budget selection meaningful, Meta Tokens are organized in prefix-nested groups — a Matryoshka structure — so each prefix is independently discriminative. Training with Matryoshka Multi-Vector Retrieval (MMR) means the model learns representations that work at multiple granularities. At inference you pick a retrieval budget, a tuple ((r_q, r_c)), specifying how many query-side and candidate-side Meta Tokens to use (examples: ((1,1),(2,4),(4,8),(8,16),(16,64))).

Scoring and late interaction

MetaEmbed uses a ColBERT-like MaxSim late-interaction over L2-normalized Meta Token embeddings. This preserves fine-grained cross-modal detail while keeping the vector set compact and enabling budgeted MaxSim scoring.

Benchmarks

MetaEmbed was evaluated on MMEB (Massive Multimodal Embedding Benchmark) and ViDoRe v2 (Visual Document Retrieval), both designed to stress retrieval across modalities and realistic document queries.

On MMEB with Qwen2.5-VL backbones, MetaEmbed reports overall scores at the largest budget ((16,64)): 3B = 69.1, 7B = 76.6, 32B = 78.7. Results grow monotonically with budget and the gains widen with model scale. On ViDoRe v2 the method improves average nDCG@5 versus a single-vector baseline and a naive fixed-length multi-vector baseline trained identically; the advantage increases at higher budgets.

Ablations show that MMR is critical for test-time scaling. Disabling MMR (NoMMR) collapses performance at low budgets; with MMR enabled MetaEmbed matches or exceeds single-vector baselines across budgets and sizes.

Efficiency and memory profile

The paper reports scoring cost and index memory on an A100 with 100k candidates per query and a scoring batch size of 1,000. As the budget grows from ((1,1)) to ((16,64)):

Crucially, query encoding dominates end-to-end latency. Encoding an image query with 1,024 tokens costs about 42.72 TFLOPs and 788 ms, several orders larger than scoring for moderate candidate sets. This means operators should prioritize encoder throughput and balance index placement (GPU vs CPU) or offload indexes when needed.

How MetaEmbed compares

Practical takeaways

MetaEmbed provides a clear, actionable recipe for retrieval stacks that must reconcile fast recall with precise re-ranking across image–text and document scenarios. The paper and accompanying resources include detailed benchmarks and implementation notes for teams that want to adopt the approach.