MetaEmbed: Serve-Time Tuning for Multimodal Retrieval with Learnable Meta Tokens
MetaEmbed introduces a single, practical control surface for multimodal retrieval: choose how many compact learnable “Meta Tokens” to use on queries and candidates at serve time to trade accuracy, latency, and index size.
Core idea
MetaEmbed avoids the extremes of collapsing items into one vector (CLIP-style) or exploding into hundreds of patch/token vectors (ColBERT-style). During training, the model appends a fixed, learnable set of Meta Tokens to each input. At inference, the final hidden states of those Meta Tokens are reused as a small set of multi-vector embeddings. That gives an adjustable granularity for cross-modal matching without retraining.
Matryoshka Multi-Vector Retrieval (MMR)
To make per-budget selection meaningful, Meta Tokens are organized in prefix-nested groups — a Matryoshka structure — so each prefix is independently discriminative. Training with Matryoshka Multi-Vector Retrieval (MMR) means the model learns representations that work at multiple granularities. At inference you pick a retrieval budget, a tuple ((r_q, r_c)), specifying how many query-side and candidate-side Meta Tokens to use (examples: ((1,1),(2,4),(4,8),(8,16),(16,64))).
Scoring and late interaction
MetaEmbed uses a ColBERT-like MaxSim late-interaction over L2-normalized Meta Token embeddings. This preserves fine-grained cross-modal detail while keeping the vector set compact and enabling budgeted MaxSim scoring.
Benchmarks
MetaEmbed was evaluated on MMEB (Massive Multimodal Embedding Benchmark) and ViDoRe v2 (Visual Document Retrieval), both designed to stress retrieval across modalities and realistic document queries.
On MMEB with Qwen2.5-VL backbones, MetaEmbed reports overall scores at the largest budget ((16,64)): 3B = 69.1, 7B = 76.6, 32B = 78.7. Results grow monotonically with budget and the gains widen with model scale. On ViDoRe v2 the method improves average nDCG@5 versus a single-vector baseline and a naive fixed-length multi-vector baseline trained identically; the advantage increases at higher budgets.
Ablations show that MMR is critical for test-time scaling. Disabling MMR (NoMMR) collapses performance at low budgets; with MMR enabled MetaEmbed matches or exceeds single-vector baselines across budgets and sizes.
Efficiency and memory profile
The paper reports scoring cost and index memory on an A100 with 100k candidates per query and a scoring batch size of 1,000. As the budget grows from ((1,1)) to ((16,64)):
- Scoring FLOPs increase from 0.71 GFLOPs to 733.89 GFLOPs.
- Scoring latency grows from 1.67 ms to 6.25 ms.
- bfloat16 index memory grows from 0.68 GiB to 42.72 GiB.
Crucially, query encoding dominates end-to-end latency. Encoding an image query with 1,024 tokens costs about 42.72 TFLOPs and 788 ms, several orders larger than scoring for moderate candidate sets. This means operators should prioritize encoder throughput and balance index placement (GPU vs CPU) or offload indexes when needed.
How MetaEmbed compares
- Single-vector (CLIP-style): minimal index and fast dot-product scoring, but limited instruction sensitivity and compositional detail. MetaEmbed improves precision by using a small contextual multi-vector set while keeping independent encoding.
- Naive multi-vector (ColBERT-style): rich token-level detail but prohibitive index size and compute when images appear on both sides. MetaEmbed reduces vector counts by orders of magnitude with a few Meta Tokens and enables budgeted MaxSim.
Practical takeaways
- One model, many budgets: train once, choose ((r_q, r_c)) at serve time to trade recall against cost. Low budgets suit initial recall; higher budgets can be used for re-ranking.
- Optimize the encoder: image tokenization and VLM throughput are the main latency drivers; scoring is comparatively lightweight for typical candidate set sizes.
- Plan memory: index size grows linearly with budget, so decide on sharding and placement (GPU vs CPU) based on expected budgets and traffic.
MetaEmbed provides a clear, actionable recipe for retrieval stacks that must reconcile fast recall with precise re-ranking across image–text and document scenarios. The paper and accompanying resources include detailed benchmarks and implementation notes for teams that want to adopt the approach.