Memory-R1: Reinforcement Learning That Teaches LLMs What to Remember

Why LLMs Struggle with Memory

Large language models excel at many tasks but are inherently stateless: each query is processed without persistent knowledge carried forward. Fixed context windows limit long-horizon reasoning and multi-session continuity. Common workarounds like retrieval-augmented generation (RAG) append past items to prompts, but without intelligent filtering this often buries LLMs in noisy or irrelevant context.

What Memory-R1 Does

Memory-R1 is a framework that trains LLM agents to actively manage external memory with reinforcement learning. Instead of relying on hand-crafted heuristics for what to store or remove, Memory-R1 uses outcome-based rewards so agents learn policies that optimize final question answering performance. The approach generalizes across backbones and tasks and requires surprisingly little labeled data.

Two RL-Fine-Tuned Agents

Memory-R1 uses two cooperating components:

Both agents are trained with reinforcement learning methods such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO). Crucially, the reward signal is simply the correctness of the final answer, so the system learns memory operations indirectly through outcome-based feedback rather than needing explicit, expensive annotations of memory edits.

Memory Manager: Learning to Edit Knowledge

The Memory Manager learns when to add new facts, when to update existing ones, when to delete outdated or contradictory entries, and when to leave memory unchanged. Training ties the manager’s actions to downstream answer quality: if an edit leads the Answer Agent to a better response, the manager receives positive reward. That encourages consolidation instead of fragmentation of user information. For example, when a user first mentions adopting Buddy and later mentions adopting Scout, a trained memory manager will merge both facts so the memory records that the user adopted two dogs.

Answer Agent: Selective Reasoning

The Answer Agent avoids dumping dozens of retrieved items into the prompt. Instead it filters retrieved candidates to a compact, relevant subset and then reasons over that distilled context. Trained with the same outcome-driven reward (exact match to gold answers), it learns to suppress noisy entries and focus on context that improves factual accuracy and reasoning.

Data Efficiency and Benchmarks

Memory-R1 is data-efficient: strong results were obtained with only 152 question-answer pairs for training. The LOCOMO benchmark, which features long multi-turn dialogues and diverse QA types including temporal and multi-hop reasoning, serves as a realistic testbed for long-horizon memory management.

Experimental Results

Memory-R1 was evaluated on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct against prior baselines. Metrics included F1, BLEU-1, and an LLM-as-a-judge metric for factual accuracy. Memory-R1 with GRPO produced the best performance, with large relative gains over the previous best baseline across metrics and question types, and gains held across model architectures.

Why It Matters

By framing memory management and memory distillation as reinforcement learning problems, Memory-R1 enables LLM agents to:

Memory-R1 represents a step toward agentic, memory-aware AI that can remember, learn, and reason over long-term interactions, delivering more coherent and useful experiences for users.