OpenAI Unveils GPT-Realtime: Unified Speech-to-Speech with SIP Calling and MCP Support

GPT-Realtime offers a single audio pipeline

OpenAI’s GPT-Realtime and the updated Realtime API move beyond the traditional chain of speech-to-text, language processing, and text-to-speech. The model ingests audio and processes it in a unified pipeline, which reduces latency and helps preserve vocal nuances that often get lost when audio is converted to text and back.

Measured performance gains

Benchmarks show meaningful improvements, though not a complete breakthrough. On the Big Bench Audio evaluation for reasoning, GPT-Realtime scores 82.8% accuracy versus 65.6% for OpenAI’s December 2024 model. Instruction following improves as well: MultiChallenge audio rises to 30.5% from 20.6%. Function calling on ComplexFuncBench moves to 66.5% from 49.7%.

These are significant jumps, but the absolute numbers underline remaining limits. An instruction following rate near 30% implies many complex directions are still mishandled.

Enterprise-focused features

OpenAI added several capabilities aimed at production deployments. SIP integration lets voice agents connect directly to phone networks and PBX systems, enabling traditional telephony workflows. Model Context Protocol server support allows external tools and services to be connected more easily. Image input enables grounding of conversations in visual context, so users can reference screenshots or photos.

A key operational feature is asynchronous function calling. The model can continue a conversation or speech output while waiting for long-running backend tasks such as database queries or API calls, addressing a common hurdle in real-world business applications.

Pricing and market positioning

OpenAI set pricing at 32 USD per million audio input tokens and 64 USD per million audio output tokens, about a 20% reduction from the previous model. The move appears designed to pressure competitors; reports indicate Google and others are vying with lower-cost offerings for similar realtime voice functionality.

Industry adoption statistics cited by OpenAI show heavy enterprise interest, with widespread use of OpenAI products across many companies. Still, voice AI specialists warn that direct API access is rarely the full solution for enterprise rollouts, which often require integration, customization, and robustness improvements.

Ongoing technical challenges

Persistent issues remain in adverse acoustic conditions and varied accents. Background noise, domain-specific vocabulary, and long-term conversational context still degrade accuracy. Independent tests indicate that even top speech systems see significant performance drops in noisy environments or with diverse speaker accents.

Latency improvements are real but not universal. Achieving sub-500ms responses is still challenging when agents must perform complex logic or call external services. Asynchronous function calling mitigates some use cases but does not remove the tradeoff between model complexity and responsiveness.

Practical implications

GPT-Realtime is a clear step forward: a more integrated audio architecture, enterprise-oriented features, and competitive pricing make production deployments more viable for contact centers, education tools, and assistants. However, existing accuracy and robustness limitations mean that truly seamless, natural voice AI for complex, noisy, or highly specialized tasks remains an area for further progress.

For technical details refer to OpenAI’s announcement and linked resources.