Agent Wars: How Google, OpenAI and Anthropic Are Building Autonomous AI Workers
OpenAI, Google, and Anthropic are racing to productize agentic AI across perception, tool calling, orchestration, and governance. Each vendor takes a distinct path: OpenAI favors a programmable, developer-first substrate; Google emphasizes enterprise governance and cross-suite integration; Anthropic prioritizes human-in-the-loop ergonomics and rapid internal app building.
OpenAI’s programmable substrate
OpenAI bundles three core pieces: a Computer-Using Agent (CUA) for GUI control, the Responses API as a unified integration surface, and AgentKit to standardize agent lifecycle. CUA combines vision and RL-trained policies to perform on-screen actions like mouse and keyboard interactions, aiming to generalize desktop and web tasks. Responses collapses chat, tool use, state, and multimodality into a single endpoint to host tools and persistent reasoning. AgentKit supplies visual design, connectors, evaluation hooks, and embeddable UIs to reduce orchestration sprawl.
Risk and operational considerations for OpenAI
Early third-party evaluations show brittle behavior for GUI automations: flaky DOM targets, window focus loss, and brittle recovery on layout changes. Teams should instrument retries, stabilize selectors, and gate high-risk actions behind human review. Use execution-based evaluation such as OSWorld to validate GUI tasks and pair CUA experiments with a robust runner.
Google’s governed enterprise stack
Google positions Gemini 2.0 and Project Astra as the perception and low-latency runtime layer, with Vertex AI Agent Builder as the GCP-native control plane for orchestration. Gemini Enterprise aims to be a governed front door that provides discovery, central policy, and visibility across agents, with cross-suite context spanning Google Workspace and Microsoft 365 and connectors for business apps like Salesforce and SAP.
Application surface and enterprise fit
Google extends agentic control into end-user workflows via app features and projects such as Agent Mode and Project Mariner. These consumer and prosumer surfaces serve both as proving grounds for UI safety patterns and as data sources for guardrails. If your priority is centralized policy, fleet-level visibility, and integration with existing enterprise suites, Google offers the most prescriptive path today.
Anthropic’s human-in-the-loop path
Anthropic blends Computer Use capabilities with Artifacts, an inline canvas that evolved into an app-hosting and sharing surface. Computer Use simulates cursor and keyboard interactions with a conservative rollout and explicit error profiles. Artifacts lets teams rapidly build and publish interactive mini-apps backed by Claude, enabling quick prototyping and a clear billing model for published usage.
Positioning and operational stance
Anthropic favors a cautious, policy-first expansion: rapid co-pilot experiences and human validation rather than blind autonomy. This is a good fit for teams that want fast iteration with explicit checkpoints and lower operational complexity.
Benchmarks that matter
- Function and tool calling: BFCL V4 measures multi-turn planning, tool routing, fidelity, and hallucination metrics. Use it to evaluate tool orchestration quality.
- GUI and computer use: OSWorld provides execution-based tests across hundreds of desktop tasks and is a practical minimum bar for GUI agents.
- Conversational tool agents: τ-Bench and τ²-Bench simulate domain rules and dual-control scenarios to validate policy adherence and multi-trial reliability.
- Software engineering assistants: SWE-Bench Verified or Pro should be used for end-to-end engineering tasks rather than lightweight tests.
Comparative takeaways
- OpenAI: a programmable substrate with a single API surface (Responses), lifecycle tooling (AgentKit), and a universal GUI controller (CUA). Best when teams want tight control and can operate runners and evaluation pipelines.
- Google: a governed enterprise plane with Vertex AI Agent Builder for orchestration and Gemini Enterprise for fleet policy and visibility. Best for centralized IT management and cross-suite integrations.
- Anthropic: a human-in-the-loop approach with Computer Use and Artifacts for rapid internal apps. Best when teams want quick prototypes with built-in human checkpoints.
Deployment guidance for technical teams
- Lock the runner before the model: keep execution harnesses, selectors, and OS-level setups constant while iterating on models and prompts.
- Decide where governance lives: choose Google for prescriptive fleet governance, OpenAI for a programmable substrate you manage, or Anthropic for product-level policy and human validation.
- Design for GUI failure and recovery: implement retries, page-checks, and gated irreversible actions to mitigate selector drift and focus loss.
- Optimize for your iteration style: Anthropic for rapid prototyping, OpenAI for programmable pipelines and hosted tools, Google for IT-managed, large-scale rollouts.
Bottom line by vendor
- OpenAI: attractive for developer-first teams that will own evaluation and operations; validates well with OSWorld and τ-Bench.
- Google: most prescriptive for enterprise governance and cross-suite context; validate with BFCL and OSWorld before scaling.
- Anthropic: pragmatic human-in-the-loop path for fast internal tools and safe rollouts; use τ-Bench and OSWorld to measure policy adherence and GUI reliability.
Editorial note
The 2025 agentic AI market is defined by three philosophies: programmable substrate, governed enterprise, and human-supervised app building. Technical superiority alone won’t decide winners; the platform that reduces deployment friction and aligns with enterprise operational realities will likely capture the largest share of adoption.