AgentFlow: Planner-Only RL and Flow-GRPO for Modular, Tool-Using Agents
What is AgentFlow?
AgentFlow is a trainable framework that structures tool-using AI agents into four modular components: Planner, Executor, Verifier, and Generator. An explicit, evolving memory keeps track of states, tool calls, and verification signals. The Planner proposes sub-goals and chooses tools and contexts each turn, the Executor invokes the selected tool, the Verifier decides whether to continue, and the Generator emits the final answer when the process terminates. Only the Planner is trained in the loop; the other modules can be fixed engines.
Architecture and memory
The framework formalizes multi-turn, tool-integrated reasoning as a Markov Decision Process (MDP). Memory is structured to record trajectory elements while constraining context growth, improving auditability of the agent’s decisions and tool usage. This design makes it easier to inspect and verify intermediate steps and tool calls across long interactions.
Flow-GRPO: training the Planner in the loop
Flow-GRPO, short for Flow-based Group Refined Policy Optimization, converts long-horizon, sparse-reward problems into tractable single-turn updates.
- Final-outcome reward broadcast: a trajectory-level correctness signal, judged by an LLM-as-judge, is assigned to every turn. This aligns local planning actions with the global outcome.
- Token-level clipped objective: importance-weighted ratios are computed at the token level, using a PPO-style clipping mechanism and a KL penalty to a reference policy to avoid catastrophic drift.
- Group-normalized advantages: variance reduction is achieved by normalizing advantages across groups of on-policy rollouts, stabilizing updates.
Only the Planner receives these on-policy updates, while the Executor, Verifier, and Generator can remain as fixed, high-quality engines.
Benchmarks and results
The team evaluated AgentFlow across ten benchmarks spanning four task families: knowledge-intensive search (Bamboogle, 2Wiki, HotpotQA, Musique), agentic reasoning (GAIA textual split), math (AIME-24, AMC-23, Game of 24), and science (GPQA, MedQA). Using a 7B backbone tuned with Flow-GRPO, the reported average improvements over strong baselines are:
- Search: +14.9%
- Agentic: +14.0%
- Math: +14.5%
- Science: +4.1%
The authors report that their 7B system surpasses GPT-4o on the evaluated suite. Training effects include improved planning quality, fewer tool-calling errors (up to 28.4% reduction on GAIA), and benefits from larger turn budgets and model scale. Ablation studies show online Flow-GRPO yields a +17.2% gain versus a frozen-planner baseline, while offline supervised fine-tuning of the planner degrades performance by −19.0% on their composite metric.
Implementation and licensing
A public implementation accompanies the paper, with a modular toolkit exposing components such as base_generator, python_coder, google_search, wikipedia_search, and web_search. The repository includes quick-start scripts for inference, training, and benchmarking, and is released under an MIT license. The technical paper and code are available at
https://arxiv.org/pdf/2510.05592
Why it matters
AgentFlow provides a clear, auditable structure for tool-using agents and a practical on-policy training method that bridges long-horizon rewards and token-level updates. By focusing training on the Planner and relying on modular tools and verification, the approach improves reliability of tool use and overall task performance while keeping other components stable and inspectable.