DeepAgent: One-Stream AI That Thinks, Finds Tools, and Acts

Why current agent loops fall short

Most agent frameworks follow a predefined Reason, Act, Observe loop and rely on a fixed set of tools injected into the prompt. That approach can work for small tasks, but it breaks down when the toolset is large, tasks are long, or the agent needs to change strategy mid-reasoning. Agents cannot dynamically discover new tools and struggle with context overflow during long interactions.

Unified reasoning and on-demand tool discovery

DeepAgent, proposed by researchers at Renmin University of China and Xiaohongshu, keeps the entire agent loop inside a single reasoning stream. The model can emit four action types directly in text: internal thought, tool search, tool call, and memory fold. When the agent decides to search, it queries a dense index of tool descriptions drawn from large registries (for example, over 16,000 RapidAPI tools and nearly 3,900 ToolHop tools). The system returns only the top-ranked tools to the model in-context, so the agent discovers tools dynamically instead of relying on a front-loaded tool list. This design aligns agents with realistic environments where available tools change over time.

For more details see the paper: https://arxiv.org/pdf/2510.21618

Autonomous memory folding for long-horizon tasks

Long sequences of tool calls, web results, and code outputs can overflow model context windows. DeepAgent addresses this with an autonomous memory folding step. When the model emits a fold token, an auxiliary LLM compresses the full interaction history into three compact memories:

Episodic Memory: records task events and significant outcomes.
Working Memory: holds the current subgoal and recent issues.
Tool Memory: stores tool names, arguments, and outcomes.

Those structured memories are fed back into the agent as compact, information-rich text so the agent can continue reasoning without losing crucial context.

ToolPO: reinforcement learning focused on tool calls

Supervised traces alone do not reliably teach robust tool use, because correct tool calls are just a few tokens inside long generations. The team introduces Tool Policy Optimization (ToolPO) to fix this. ToolPO runs rollouts on LLM-simulated APIs, which makes training stable and inexpensive. It attributes reward to the exact tool call tokens — a method called token-level tool call advantage attribution — and optimizes with a clipped PPO-style objective. This trains the agent not only to call tools correctly but also to decide when to search for tools and when to fold memory.

Benchmarks and empirical results

The authors evaluate DeepAgent on five tool-use benchmarks (ToolBench, API Bank, TMDB, Spotify, ToolHop) and four downstream tasks (ALFWorld, WebShop, GAIA, HLE). In the labeled-tool setting, where the correct tools are provided, a DeepAgent 32B RL model with a QwQ 32B backbone reports competitive scores: 69.0 on ToolBench, 75.3 on API Bank, 89.0 on TMDB, 75.4 on Spotify, and 51.3 on ToolHop. These results are the strongest 32B-level outcomes across all five datasets, showing a uniform performance advantage over workflow baselines.

In the more realistic open-set retrieval setting, DeepAgent first must find tools and then call them. There DeepAgent 32B RL reaches 64.0 on ToolBench and 40.6 on ToolHop, outperforming workflow baselines that score 55.0 and 36.2 respectively. The researchers also show that autonomous tool retrieval improves workflow agents as well, but DeepAgent gains more, indicating the architecture and training are well matched to large toolsets.

On downstream environments under a 32B reasoning model, DeepAgent reports 91.8% success on ALFWorld, 34.4% success and 56.3 score on WebShop, 53.3 on GAIA, and a higher score than workflow agents on HLE. These longer, noisier tasks likely benefit from the combined effects of memory folding and ToolPO.

Practical implications

DeepAgent demonstrates a practical path toward agents that do not rely on fixed tool prompts. By unifying continuous internal reasoning, dense tool retrieval across very large registries, structured tool calling, and memory folding, the architecture enables more robust long-horizon behavior. The use of LLM-simulated APIs during ToolPO training is an engineering trade-off that addresses latency and instability issues encountered by prior tool agents. Overall, the results suggest end-to-end agents with memory and RL are emerging as a strong and practical pattern for working with large, changing tool ecosystems.

Resources: https://arxiv.org/pdf/2510.21618