In crowded voice AI market, OpenAI bets on instruction-following and expressive speech to win enterprise adoption

OpenAI has introduced a new voice model, named GPT-Realtime, which aims to enhance the competitive landscape in the enterprise voice AI sector. The model is designed to follow complex prompts and feature more natural-sounding voices. As the demand for realistic AI voices grows, particularly in applications such as customer service and real-time translation, OpenAI faces competition from companies like ElevenLabs, which also specializes in voice technology.

GPT-Realtime will be integrated into OpenAI’s Realtime API, which is now generally available. Alongside this model, two new voices, Cedar and Marin, have also been introduced, and existing voices have been updated for better compatibility. In a recent livestream, OpenAI indicated that they collaborated with partners developing voice applications to align GPT-Realtime with real-world needs, such as academic tutoring and customer support.

The model operates on a speech-to-speech framework, enabling it to respond vocally to spoken inquiries effectively. For instance, it can assist customers making product returns by answering questions as a human-like voice assistant. OpenAI has already partnered with companies such as T-Mobile and Zillow to demonstrate its capabilities in customer interaction scenarios. GPT-Realtime is marketed as the company’s most advanced voice model, capable of switching languages mid-sentence and executing complex instructions.

However, OpenAI’s new offering competes with existing voice models like ElevenLabs’ Conversational AI 2.0 and the emulation capabilities of Hume’s EVI 3 model. The landscape is further complicated by other AI providers, including Mistral and Google, who are also enhancing real-time voice capabilities.

OpenAI claims that GPT-Realtime has improved understanding of native audio, achieving an accuracy score of 82.8% in benchmarks, although it did not provide comparative data against competitor models. The new Realtime API also supports image inputs and can connect applications to telephony systems, broadening its potential use in contact centers.

In addition to the model updates, OpenAI has reduced the pricing for GPT-Realtime by 20%. This pricing adjustment makes it $32 per million audio input tokens and $64 for audio output tokens.

Source: https://venturebeat.com/ai/in-crowded-voice-ai-market-openai-bets-on-instruction-following-and-expressive-speech-to-win-enterprise-adoption/

Leave a Comment Cancel Reply