Agent Lightning: Train Any AI Agent with RL from Real Execution Traces
'Microsoft open sourced Agent Lightning to convert agent execution traces into RL transitions, enabling training of LLM policies with minimal integration and support for standard RL trainers.'
What Agent Lightning is and why it matters
Microsoft AI released Agent Lightning, an open source framework that turns real agent runs into reinforcement learning ready transitions without rewriting your agent stack. It separates training from execution, standardizes how traces are recorded, and introduces LightningRL, a hierarchical method that converts complex multi step agent behavior into transitions that single turn RL trainers can optimize.
Modeling agents as decision processes
Agent Lightning formalizes an agent run as a partially observable Markov decision process. Observations correspond to the current input to the policy LLM, actions map to model calls, and rewards can be terminal or intermediate. From each run the system extracts only the calls made by the policy model together with inputs, outputs, and rewards. By trimming framework noise it produces clean, focused transitions for downstream training.
LightningRL and credit assignment
LightningRL tackles credit assignment across multi step episodes. It converts long agent trajectories into per call transitions while preserving information needed to assign credit to earlier decisions. The resulting transitions are compatible with single turn RL objectives, so teams can reuse standard RL trainers and algorithms such as PPO or GRPO. The paper notes compatibility with trainers like VeRL that implement these interfaces.
Training Agent Disaggregation and system architecture
The design separates runtime and training tiers. A Lightning Client runs alongside existing agents in production, capturing traces of prompts, tool calls, and reward signals. A Lightning Server receives streamed traces, performs training on GPU resources, and serves updated models via an OpenAI like API. This keeps tools, browsers, shells, and other dependencies near production while training stays centralized and scalable.
Tracing, telemetry, and unified data interface
The runtime supports two tracing paths. The default path uses OpenTelemetry spans so you can pipe agent telemetry through standard collectors. A lightweight embedded tracer is offered for teams that prefer not to deploy OpenTelemetry. Both tracing paths converge on the same store for training.
Each model call and each tool call is recorded as a span with inputs, outputs, and metadata. The algorithm layer adapts spans into ordered triplets of prompt, response, and reward. That selective extraction lets teams optimize a single agent inside a larger workflow or multiple agents at once, without touching orchestration code. The same traces can also be reused for prompt optimization or supervised finetuning.
Experiments and datasets
The paper evaluates three tasks using Llama 3.2 3B Instruct as the policy model. For text to SQL the team used the Spider benchmark with an agent implemented in LangChain composed of writer, rewriter, and checker agents. The writer and rewriter were optimized while the checker remained fixed, and rewards improved steadily during training and at test time.
For retrieval augmented generation the setup used the MuSiQue benchmark and a Wikipedia scale index of about 21 million documents. The retriever used BGE embeddings with cosine similarity and the agent was built with the OpenAI Agents SDK. Reward was a weighted sum of format score and F1 correctness, and reward curves showed stable gains in training and evaluation.
For math question answering with tool use the agent used AutoGen and called a calculator tool on the Calc X dataset. Training improved the agent's ability to invoke tools correctly and to integrate tool outputs into final answers.
Key takeaways for teams
Agent Lightning enables near zero code integration with existing agent frameworks such as LangChain, OpenAI Agents SDK, AutoGen, or CrewAI. LightningRL converts multi step trajectories into transitions and applies credit assignment so standard single turn RL methods can be applied. The Automatic Intermediate Rewarding mechanism converts runtime signals like tool return status into denser intermediate rewards, addressing sparse reward issues in long workflows. Traces recorded via OpenTelemetry or an embedded tracer stream to the training server and the server serves updated models through an OpenAI compatible endpoint, enabling scalable rollouts without moving tool dependencies.
Сменить язык
Читать эту статью на русском