xRouter: RL-Powered Router That Cuts LLM Offloading Costs by Up to 80%
Salesforce AI's xRouter uses reinforcement learning and a cost-aware reward to route queries among 20+ LLMs, approaching top-model accuracy while cutting offloading costs dramatically
What xRouter is
xRouter is a tool-calling orchestration system from Salesforce AI Research that uses reinforcement learning to decide when to answer a query locally and when to call external large language models. Built on Qwen2.5-7B-Instruct as the router backbone, the model is instruction-tuned and has tool-calling capabilities to invoke downstream models, craft prompts for them, and either synthesize or select a final answer.
Architecture and routing backbone
The router acts over a heterogeneous pool of more than 20 LLM tools spanning premium, standard, budget and specialized tiers. The full catalog includes models such as GPT-5, GPT-4.1, GPT-5-Mini, GPT-5-Nano, o3, Kimi K2, DeepSeek-R1, Qwen3 variants and multiple GPT-OSS models. An offloading pool of 12 models is used for experiments and includes GPT-5, GPT-4o and several compact and open-source variants.
xRouter exposes an OpenAI-compatible API and implements routing decisions via function-like tool calls. The implementation uses DAPO (Distributional Advantage Policy Optimization) inside the Verl reinforcement learning framework.
Cost-aware reward and success gating
Routing is framed as an RL problem with a cost-aware objective. Each episode's reward combines a binary success signal and a cost penalty. Concretely, reward = quality − λ × normalized_cost, where λ is a cost penalty coefficient. If the final answer is incorrect, the trajectory receives zero reward regardless of cost. This success-gated, cost-shaped objective forces the router to prioritize correctness first, then reduce cost among correct strategies.
The team trains three variants by varying the cost penalty, producing xRouter-7B-1, xRouter-7B-2 and xRouter-7B-3, which trade off accuracy and cost differently.
Training data and robustness techniques
Training uses Reasoning360, a dataset of math, code and general reasoning tasks with difficulty bands estimated by a strong reference model (Qwen3-32B). Samples are stratified into easy, medium and hard, and the dataset is augmented with simple chit-chat, retrieval and factual prompts so the router learns when it can answer directly. Each sample includes model descriptions and per-call prices; the catalog and costs are periodically perturbed to avoid overfitting to a static price table.
Failed trajectories, such as an expensive model producing a wrong answer or unnecessary offloads, incur full cost and yield zero reward. This produces a clear learning signal: correctness gates reward and cost shapes the policy within successful trajectories.
Inference behavior and execution modes
At inference, the router supports three modes: (1) answer directly from the backbone, (2) call one or more downstream models and synthesize a response, or (3) call downstream models and use a select_response tool to pick one reply as final. These flows are executed through an OpenAI-style function call interface and run through LiteLLM and SGLang.
Empirically, trained xRouter models mix direct and synthesized responses. Many off-the-shelf routers (GPT-4o, GPT-4.1, GPT-5 and various Qwen models) tend to respond directly even when instructed to offload, which partly explains xRouter's efficiency gains.
Quantitative results and cost utility
Across a range of benchmarks — Minerva, MATH-500, Olympiad Bench, AIME-24, AMC-23, Codeforces, Code-Contests and Human-EvalPlus — xRouter-7B variants consistently improve accuracy compared to using the same base model untrained as a router. For example, xRouter-7B-2 reaches near GPT-5 accuracy on Olympiad Bench while using roughly one eighth of GPT-5's evaluation cost.
System-level comparisons on LiveCodeBenchv5, GPQADiamond, AIME25, MT-Bench and others show xRouter-7B-3 achieving top average accuracy on some suites with moderate cost. In tasks like GPQA, xRouter variants reach about 80–90% of GPT-5 accuracy while consuming under one fifth of the cost. The authors report cost reductions up to 60–80% in different evaluation setups and up to 60% cost reduction on the model weights card for comparable quality in other settings.
The team defines cost utility as accuracy divided by cost. Small, low-cost open-source models can achieve high cost utility but lower absolute accuracy. xRouter sits between those extremes, sacrificing some cost utility to deliver stronger absolute performance, which aligns with production priorities.
Practical implications
xRouter demonstrates that a midsized, RL-trained router can approach the accuracy of top-tier models while significantly reducing offloading costs. The success-gated, cost-shaped reward and training on difficulty-stratified Reasoning360 teach the router when to answer itself and when to delegate, making it a practical solution for orchestrating heterogeneous LLM fleets in cost-sensitive production environments.
For further details see the paper and model weights linked by the authors.
Сменить язык
Читать эту статью на русском