Anthropic’s Claude Sonnet 4.5 Sets New Benchmark for Coding and Long‑Horizon Agents

What Sonnet 4.5 introduces

Anthropic has released Claude Sonnet 4.5, an update focused on real-world software engineering, extended agentic workflows, and reliable computer use. The release pairs model improvements with concrete product features: Claude Code checkpoints, a native VS Code extension, new API memory and context tools, and an Agent SDK that reproduces Anthropic’s internal scaffolding. Pricing stays the same as Sonnet 4 at $3 per million input tokens and $15 per million output tokens.

Benchmarks and performance highlights

Sonnet 4.5 sets new reported records on several practical evaluations. On the 500-problem SWE-bench Verified dataset, Anthropic reports 77.2% accuracy using a simple two-tool scaffold (bash + file edit), averaged over 10 runs with no test-time compute and a 200K token “thinking” budget. Increasing context to 1M tokens yields 78.2%, while higher compute with parallel sampling and rejection reaches 82.0%.

For computer-use tasks, Sonnet 4.5 posts a notable jump on OSWorld-Verified: 61.4%, up from Sonnet 4’s 42.2%. Anthropic attributes this to stronger tool control and improved UI manipulation for browser and desktop workflows. The team also observed sustained focus exceeding 30 continuous hours on multi-step coding tasks, demonstrating practical long-horizon autonomy.

Reasoning and math evaluations also show “substantial gains,” according to the release notes, and Anthropic reports an elevated safety posture (ASL-3) with improved protections against prompt-injection attacks.

Agent capabilities and SDK

A core theme of Sonnet 4.5 is addressing brittle parts of agentic systems: extended planning, memory management, and dependable tool orchestration. The Claude Agent SDK exposes production patterns Anthropic uses internally, including memory management for long-running sessions, permissioning, and sub-agent coordination. This goes beyond a simple LLM endpoint by providing scaffolding that helps keep multi-hour jobs coherent and reversible.

Claude Code has been updated with checkpoints, a refreshed terminal, and VS Code integration, allowing teams to reproduce the same scaffolding for multi-step coding and RPA-style tasks. Measured improvements on OSWorld-Verified correlate with better navigation, spreadsheet manipulation, and web flow completion in Anthropic’s demos, which suggests fewer human interventions during execution for enterprise automation scenarios.

Where Sonnet 4.5 is available

Implications for teams and enterprises

With documented wins on SWE-bench Verified and OSWorld-Verified plus pragmatic product changes and SDK tooling, Sonnet 4.5 is positioned for long-running, tool-heavy agent workloads rather than short prompt demos. Independent replication will be important to validate the “best for coding” claim, but the release clearly targets autonomy, production scaffolding, and improved computer control—areas that map directly to current enterprise pain points.