OpenAI Unveils GPT-5.1-Codex-Max: Autonomous Long‑Horizon Coding with Compaction

What GPT-5.1-Codex-Max is

OpenAI has released GPT-5.1-Codex-Max, a specialized agentic coding model designed for long running software engineering tasks that can span millions of tokens and multi hour sessions. The model is available now across Codex surfaces including the CLI, IDE extension, cloud integrations and code review tools, with API access planned soon.

Training focus and intended use

GPT-5.1-Codex-Max is built on an updated foundational reasoning model that was trained on agentic tasks across software engineering, math, research and other domains. On top of that base, the model received further training on real world software engineering workloads such as pull request creation, code review, frontend implementation and developer Q&A. OpenAI positions this model for frontier coding evaluations and agentic workflows rather than general conversational use. It is recommended for Codex or Codex like environments, not as a direct replacement for general purpose GPT-5.1 chat models.

Native Windows and CLI behaviour

This is the first Codex model explicitly trained to operate in Windows environments. Training included tasks that improve collaboration inside the Codex CLI and sandbox, such as safer command execution and better file handling behaviour. Those improvements aim to make the model a more reliable partner when running commands and manipulating project files.

Compaction and long running sessions

A core capability of GPT-5.1-Codex-Max is compaction. While the model still operates within a fixed context window, it is natively trained to prune and compress its interaction history so it can continue work across multiple context windows. When a Codex session approaches the context limit, the model automatically compacts the session state into a fresh context window that preserves essential task information, then continues executing. That cycle repeats until the task is complete.

OpenAI reports internal runs where GPT-5.1-Codex-Max worked autonomously on a single task for more than 24 hours, iterating on implementations, fixing failing tests and ultimately producing successful results.

Reasoning effort, speed and token efficiency

GPT-5.1-Codex-Max inherits the reasoning effort control introduced with GPT-5.1 but tuned for coding agents. Reasoning effort determines how many thinking tokens the model uses before committing to an answer. On SWE-bench Verified, the model at medium reasoning effort achieves higher accuracy than GPT-5.1-Codex at the same effort while using about 30% fewer thinking tokens. For non latency sensitive tasks, OpenAI introduced an Extra High mode written as xhigh that allows longer internal reasoning to reach better answers. Medium remains the recommended default for most workloads.

Benchmarks with compaction enabled show measurable gains. With GPT-5.1-Codex at high effort and GPT-5.1-Codex-Max at xhigh, OpenAI reports scores on 500 SWE-bench Verified issues of 73.7% and 77.9% respectively. On SWE-Lancer IC SWE the reported results are 66.3% versus 79.9%, and on Terminal-Bench 2.0 the numbers are 52.8% versus 58.1%. Terminal-Bench 2.0 runs inside the Codex CLI harness and all reported evaluations used compaction.

In qualitative tests, GPT-5.1-Codex-Max produced high quality frontend designs with comparable functionality and visual quality to GPT-5.1-Codex while consuming fewer tokens overall thanks to more efficient reasoning traces.

Availability and use cases

GPT-5.1-Codex-Max is available in Codex CLI, IDE extensions, cloud integrations and code review surfaces today, with API access coming later. It targets workflows that require sustained, autonomous progress: long lived feature development, multi file refactors, extended test repair and other multi hour engineering tasks where preserving and compressing state across context windows matters.

What this means for engineering teams

The model signals a shift toward operationalizing long horizon, agentic coding in developer tools rather than focusing on short, single shot edits. Compaction, explicit reasoning effort controls and performance wins on frontier coding benchmarks make GPT-5.1-Codex-Max a practical test case for integrating extended autonomous runs into real engineering pipelines. The Preparedness Framework and Codex sandbox will likely play important roles as teams adopt these capabilities in production workflows.