Anthropic’s Claude Sonnet 4.5 Sets New Benchmark for Coding and Long‑Horizon Agents
What Sonnet 4.5 introduces
Anthropic has released Claude Sonnet 4.5, an update focused on real-world software engineering, extended agentic workflows, and reliable computer use. The release pairs model improvements with concrete product features: Claude Code checkpoints, a native VS Code extension, new API memory and context tools, and an Agent SDK that reproduces Anthropic’s internal scaffolding. Pricing stays the same as Sonnet 4 at $3 per million input tokens and $15 per million output tokens.
Benchmarks and performance highlights
Sonnet 4.5 sets new reported records on several practical evaluations. On the 500-problem SWE-bench Verified dataset, Anthropic reports 77.2% accuracy using a simple two-tool scaffold (bash + file edit), averaged over 10 runs with no test-time compute and a 200K token “thinking” budget. Increasing context to 1M tokens yields 78.2%, while higher compute with parallel sampling and rejection reaches 82.0%.
For computer-use tasks, Sonnet 4.5 posts a notable jump on OSWorld-Verified: 61.4%, up from Sonnet 4’s 42.2%. Anthropic attributes this to stronger tool control and improved UI manipulation for browser and desktop workflows. The team also observed sustained focus exceeding 30 continuous hours on multi-step coding tasks, demonstrating practical long-horizon autonomy.
Reasoning and math evaluations also show “substantial gains,” according to the release notes, and Anthropic reports an elevated safety posture (ASL-3) with improved protections against prompt-injection attacks.
Agent capabilities and SDK
A core theme of Sonnet 4.5 is addressing brittle parts of agentic systems: extended planning, memory management, and dependable tool orchestration. The Claude Agent SDK exposes production patterns Anthropic uses internally, including memory management for long-running sessions, permissioning, and sub-agent coordination. This goes beyond a simple LLM endpoint by providing scaffolding that helps keep multi-hour jobs coherent and reversible.
Claude Code has been updated with checkpoints, a refreshed terminal, and VS Code integration, allowing teams to reproduce the same scaffolding for multi-step coding and RPA-style tasks. Measured improvements on OSWorld-Verified correlate with better navigation, spreadsheet manipulation, and web flow completion in Anthropic’s demos, which suggests fewer human interventions during execution for enterprise automation scenarios.
Where Sonnet 4.5 is available
- Anthropic API and apps: model ID claude-sonnet-4-5, parity pricing with Sonnet 4. File creation and code execution are available in Claude apps for paid tiers.
- AWS Bedrock: available via Bedrock with AgentCore integration and emphasis on long-horizon sessions, memory/context features, and operational controls like observability and session isolation.
- Google Cloud Vertex AI: generally available with multi-agent orchestration support through ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and prompt caching.
- GitHub Copilot: public preview in Copilot Chat (VS Code, web, mobile) and Copilot CLI; organizations can enable via policy and use BYO keys in VS Code.
Implications for teams and enterprises
With documented wins on SWE-bench Verified and OSWorld-Verified plus pragmatic product changes and SDK tooling, Sonnet 4.5 is positioned for long-running, tool-heavy agent workloads rather than short prompt demos. Independent replication will be important to validate the “best for coding” claim, but the release clearly targets autonomy, production scaffolding, and improved computer control—areas that map directly to current enterprise pain points.