Unifying Long and Short Term Memory in LLM Agents

Designing Autonomous Memory Management in LLMs

How do you design an LLM agent that decides for itself what to store in long term memory, what to keep in short term context, and what to discard without hand-tuned heuristics or extra controllers? Can a single policy learn to manage both memory types through the same action space as text generation?

Researchers from Alibaba Group and Wuhan University introduce Agentic Memory, or AgeMem, a framework that enables large language model agents to learn how to manage both long term and short term memory as part of a single policy. Instead of relying on hand-written rules or external controllers, the agent autonomously decides when to store, retrieve, summarize, and forget, utilizing integrated memory tools within the model's action space.

Current Challenges in LLM Memory Management

Why Current LLM Agents Struggle with Memory

Most agent frameworks treat memory as two loosely coupled systems. Long term memory retains user profiles, task information, and past interactions across sessions, while short term memory encompasses the current context window, which contains active dialogue and retrieved documents.

Existing systems construct these two components in isolation. Long term memory is managed via external stores like vector databases, utilizing simple add and retrieve triggers, whereas short term memory is handled using retrieval-augmented generation or summarization schedules. This separation presents several challenges:

Long and short term memory are optimized independently, lacking end-to-end interaction training.
Heuristics manipulate when to write to memory and when to summarize. These rules can be fragile, potentially omitting rare but significant events.
Adding external controllers or expert models raises both costs and complexity.

AgeMem eliminates the external controller, integrating memory operations directly into the agent's policy.

Memory Tools in Agent Action Space

Memory as Tools

In AgeMem, memory operations are exposed as tools in the agent's action space. At every step, the model can emit either normal text tokens or a tool call, with six tools defined:

For long term memory:

ADD: stores new memory items along with content and metadata.
UPDATE: modifies existing memory entries.
DELETE: removes obsolete or low-value items.

For short term memory:

RETRIEVE: performs semantic searches over long term memory, injecting relevant items into the current context.
SUMMARY: condenses dialogue spans into shorter summaries.
FILTER: eliminates context segments that do not aid future reasoning.

The interaction protocol adopts a structured format. Each step initiates with a <think> block where the model privately weighs its options, followed by either a <tool_call> block with a JSON list of tool actions or an <answer> block containing the user-facing response. Memory operations are thus first-class decisions, rather than byproducts.

Reinforcement Learning for Unified Memory Management

Three-Stage Reinforcement Learning

AgeMem is trained with reinforcement learning that couples long and short term memory behaviors. The state at time t includes the current conversational context, the long term memory store, and the task specification. The policy determines whether to act on a token or a tool call. The training trajectory for each sample unfolds across three stages:

Stage 1: Long Term Memory Construction
The agent engages in a casual setting, gathering information that will be relevant later, utilizing ADD, UPDATE, and DELETE to establish and maintain long term memory, with the short term context naturally expanding during this phase.
Stage 2: Short Term Memory Management Under Distractors
The short term context resets while long term memory remains intact. The agent faces distracting content that is related but unnecessary, managing short term memory through SUMMARY and FILTER to retain valuable insights and clear out noise.
Stage 3: Integrated Reasoning
With the final query, the agent retrieves from long term memory using RETRIEVE, governs the short term context, and formulates the answer.

The critical aspect is that long term memory persists throughout all stages, whereas short term memory clears between Stage 1 and Stage 2, compelling the model to depend on retrieval rather than residual context, thereby exposing real-world, long-horizon dependencies.

Reward Design and Step-Wise GRPO

AgeMem employs a step-wise variant of Group Relative Policy Optimization (GRPO). For each task, the system samples multiple trajectories forming a group. A terminal reward is calculated for each trajectory, subsequently normalized within the group to create an advantage signal. This advantage is disseminated across all trajectory steps, enabling the training of intermediate tool decisions based on final outcomes.

Total reward comprises three main components:

A task reward scoring answer quality (0-1 scale) using an LLM judge.
A context reward evaluating short term memory operations including compression, early summarization, and retention of query-relevant content.
A memory reward assessing long term memory quality, factoring in the fraction of high-quality stored items and the relevance of retrieved items to the query.

Uniform weights for each component ensure equal contribution to the learning signal, with penalty terms applied for exceeding the maximum dialogue length or context overflow.

Experimental Setup and Key Findings

Experimental Setup and Main Results

The research team fine-tuned AgeMem on the HotpotQA training split and evaluated it across five benchmarks:

ALFWorld for text-based embodied tasks.
SciWorld for science-themed environments.
BabyAI for instruction-following tasks.
PDDL tasks for planning activities.
HotpotQA for multi-hop question answering.

Metrics include success rates for ALFWorld, SciWorld, and BabyAI, progress rates for PDDL tasks, and an LLM judge score for HotpotQA. They also establish a Memory Quality metric using an LLM evaluator to contrast stored memories against HotpotQA supporting facts.

Baselines encompass LangMem, A Mem, Mem0, Mem0g, and a no-memory agent. Backbone models include Qwen2.5-7B-Instruct and Qwen3-4B-Instruct.

On Qwen2.5-7B-Instruct, AgeMem achieves an average score of 41.96 across the five benchmarks, surpassing Mem0, the best baseline, which reaches 37.14. In tests with Qwen3-4B-Instruct, AgeMem attains 54.31, compared to 45.74 for the top baseline, A Mem.

Memory quality also shows significant improvement: AgeMem scores 0.533 on HotpotQA with Qwen2.5-7B and 0.605 with Qwen3-4B, outperforming all baselines.

Short term memory tools contribute to reduced prompt lengths yet preserved performance, achieving about 3-5% fewer tokens per prompt than RAG-style baselines.

Ablation studies confirm that each component is essential. Merely adding long term memory tools to a no-memory baseline yields considerable gains. Reinforcement learning applied to these tools further enhances scores. The complete system, with both long and short term tools alongside RL, shows up to 21.7% improvement over the no-memory baseline in SciWorld.

Implications for LLM Agent Design

AgeMem offers a design pattern for future agentic systems. Memory should be integrated within the learned policy, rather than treated as two disparate subsystems. By converting storage, retrieval, summarization, and filtering into explicit tools and training them in conjunction with language generation, agents learn when to remember, when to forget, and how to efficiently manage context over extended periods.

Key Takeaways

AgeMem transforms memory operations into explicit tools, allowing the same policy that generates text to also dictate when to ADD, UPDATE, DELETE, RETRIEVE, SUMMARY, and FILTER memory.
Joint training of long and short term memory occurs through a three-stage RL process, where long term memory persists throughout stages while short term context resets to encourage retrieval-based reasoning.
The reward function amalgamates task accuracy, context management quality, and long term memory quality with uniform weights, alongside penalties for context overflow and excessive dialogue length.
Across benchmarks such as ALFWorld, SciWorld, BabyAI, PDDL tasks, and HotpotQA, AgeMem consistently eclipses memory baselines like LangMem, A Mem, and Mem0 in average scores and memory quality metrics.
Short term memory tools yield around 3-5% reduction in prompt lengths compared to RAG-style baselines, while maintaining or enhancing performance, indicating the efficacy of learned summarization and filtering over handcrafted context management rules.

Unifying Long and Short Term Memory in LLM Agents

Designing Autonomous Memory Management in LLMs

Current Challenges in LLM Memory Management

Why Current LLM Agents Struggle with Memory

Memory Tools in Agent Action Space

Memory as Tools

Reinforcement Learning for Unified Memory Management

Three-Stage Reinforcement Learning

Reward Design and Step-Wise GRPO

Experimental Setup and Key Findings

Experimental Setup and Main Results

Implications for LLM Agent Design

Implications for LLM Agent Design

Key Takeaways

Сменить язык