gpt-realtime-whisper
Direct from Azure models are a select portfolio curated for their market-differentiated capabilities:
- Secure and managed by Microsoft: Purchase and manage models directly through Azure with a single license, consistent support, and no third-party dependencies, backed by Azure's enterprise-grade infrastructure.
- Streamlined operations: Benefit from unified billing, governance, and seamless PTU portability across models hosted on Azure - all part of Microsoft Foundry.
- Future-ready flexibility: Access the latest models as they become available, and easily test, deploy, or switch between them within Microsoft Foundry; reducing integration effort.
- Cost control and optimization: Scale on demand with pay-as-you-go flexibility or reserve PTUs for predictable performance and savings.
Learn more about Direct from Azure models .
About this model
Gpt-realtime-whisper is a low‑latency, streaming speech‑to‑text model designed for real‑time transcription of live audio. It continuously processes incoming audio streams and converts spoken language into text with high accuracy, making it well suited for conversational AI, voice assistants, and live captioning scenarios. The model is optimized for robustness across diverse accents, speaking styles, and acoustic conditions, enabling reliable transcription in dynamic, real‑world environments while maintaining minimal latency for interactive applications.
Supported region: Canada Central, France Central, and India South. More coming soon
Key model capabilities
Key Features:
- Real-time speech-to-text transcription
Continuously converts streaming audio into text during live interactions. - Low-latency streaming performance
Processes incoming audio incrementally to support interactive and near real-time applications. - Optimized for live audio input
Designed to handle continuous microphone or call audio streams rather than batch uploads.
-Robust speech recognition
Transcribes spoken language across varied speaking styles and environments. - Supports conversational pipelines
Enables downstream use cases such as voice agents, live captioning, and transcription-driven workflows. - Text-only output (no audio generation)
Produces structured transcription output without generating synthesized speech. - Integrates with realtime model stack
Can be paired with translation or conversational models to build full end-to-end voice experiences.