M3-Agent Brings Human-Like Long-Term Memory to Multimodal AI

Why long-term memory matters for multimodal agents

Multimodal agents that operate in the real world must do more than parse isolated inputs. To act sensibly over days or weeks they need to observe continuous audiovisual streams, compress and store experiences, and draw on that accumulated knowledge when making decisions. Simple episodic storage of raw observations is not enough for lasting consistency. Instead, agents should form richer, entity-centric and semantic memories that resemble how humans internalize facts, identities, and relationships.

The M3-Agent approach

M3-Agent is a new framework designed to give multimodal agents both episodic and semantic long-term memory. It ingests real-time video and audio, processes them clip by clip, and writes memory entries into an external long-term memory database structured as a memory graph. Each node in this graph holds a unique ID, modality information, raw content, embeddings, and metadata. By organizing memory around entities and multimodal signals, M3-Agent builds a coherent, evolving model of the environment instead of a jumble of isolated observations.

Memorization and control pipelines

The framework runs two parallel processes: memorization and control. During memorization, the agent generates episodic memory entries for raw inputs and extracts higher-level semantic memory such as identities, relationships, and abstract facts. For control, the agent performs multi-turn reasoning: it searches the memory graph across up to H rounds to fetch relevant items and uses them to plan or answer queries. Reinforcement learning is used to optimize the full pipeline, with separate models trained for the memorization and control roles to maximize end-to-end performance.

Evaluation with M3-Bench and results

To measure long-video understanding and long-term consistency, the team developed M3-Bench. M3-Agent and a set of baselines were evaluated on M3-Bench-robot, M3-Bench-web, and VideoMME-long. Key results include:

On M3-Bench-robot, M3-Agent improves accuracy by 6.3% over the strongest baseline MA-LLM.
On M3-Bench-web and VideoMME-long, it outperforms GeminiGPT4o-Hybrid by 7.7% and 5.3%, respectively.
On human understanding and cross-modal reasoning metrics, M3-Agent beats MA-LMM by 4.2% and 8.5% on M3-Bench-robot, and shows gains of 15.5% and 6.7% on M3-Bench-web compared to Gemini-GPT4o-Hybrid.

These improvements highlight M3-Agent's ability to maintain character consistency, improve human-understandable behavior, and integrate multimodal information more effectively than previous methods.

How M3-Agent differs from past attempts

Past approaches often appended raw trajectories, summaries, latent embeddings, or structured representations directly to memory. Video-focused methods either tried to extend context windows or compress visual tokens, but those strategies struggle to scale for long streams. Memory-based methods that store visual features scale better but can lose long-term coherence. Language-based descriptions, such as those generated by the Socratic Models framework, are scalable but face difficulty tracking evolving entities and events. M3-Agent tackles these issues by combining multimodal representation, entity-centric organization, and separated memorization and control models.

Limitations and future directions

Case studies in the paper identify remaining gaps: improving attention mechanisms for semantic memory, making visual memory systems more efficient, and refining memory update strategies for evolving entities. These directions aim to further narrow the gap between current multimodal agents and truly human-like understanding and continuity.

Resources

The authors provide a paper and a GitHub page with tutorials, code, and notebooks for those interested in reproducing or extending M3-Agent. The framework and the M3-Bench benchmarks pave the way for more consistent, memory-aware multimodal agents in real-world applications.