Azure-Speech-Speech-to-text
Version: 1
Azure Speech
Azure Speech is a comprehensive suite of AI-powered speech capabilities that includes speech to text, text to speech, speech translation, and voice live AI. It enables developers to build intelligent voice-enabled applications with high accuracy, multilingual support, and customizable voice experiences.Key capabilities
About this model
Speech to text offers various options to transcribe audio data into text. Real-time speech to text: Instant transcription with intermediate results for streaming audio inputs.Fast transcription: Fastest synchronous file-based processing for situations with predictable latency.
Batch transcription: Efficient processing for large volumes of prerecorded audio files.
LLM speech (preview): Transcribe and translate audio files using LLM-enhanced speech models, with improved quality and support for prompt tuning.
Custom speech: Fine-tune models with enhanced accuracy for specific domains and use casess.
Key model capabilities
- Real time streaming, batch, or fast transcription of audio data
- LLM-powered audio file transcription and translation (preview)
- Multilingual audio processing
- Diarization
- Language identification
- Word timing
- Fine tuning
Use cases
See Responsible AI for additional consideration for responsible use.Key use cases
| Use case | Scenario | Solution |
|---|---|---|
| Live meeting transcriptions and captions | A virtual event platform needs to provide real-time captions for webinars. | Integrate real-time speech to text using the Speech SDK to transcribe spoken content into captions displayed live during the event. |
| Customer service enhancement | A call center wants to assist agents by providing real-time transcriptions of customer calls. | Use real-time speech to text via the Speech CLI to transcribe calls, enabling agents to better understand and respond to customer queries. |
| Video subtitling | A video-hosting platform wants to quickly generate a set of subtitles for a video. | Use fast transcription to quickly get a set of subtitles for the entire video. |
| Educational tools | An e-learning platform aims to provide transcriptions for video lectures. | Apply batch transcription through the speech to text REST API to process prerecorded lecture videos, generating text transcripts for students. |
| Healthcare documentation | A healthcare provider needs to document patient consultations. | Use real-time speech to text for dictation, allowing healthcare professionals to speak their notes and have them transcribed instantly. Use a custom model to enhance recognition of specific medical terms. |
| Media and entertainment | A media company wants to create subtitles for a large archive of videos. | Use batch transcription to process the video files in bulk, generating accurate subtitles for each video. |
| Market research | A market research firm needs to analyze customer feedback from audio recordings. | Employ batch transcription to convert audio feedback into text, enabling easier analysis and insights extraction. |
Out of scope use cases
Conversation transcription with speaker recognition: The Speech service is not designed to provide diarization with speaker recognition, and it cannot be used to identify individuals. In other words, speakers will be presented as Guest1, Guest2, Guest3, and so on, in the transcription. These will be randomly assigned and may not be used to identify individual speakers in the conversation. For each conversation transcription, the assignment of Guest1, Guest2, Guest3, and so on, will be random. To prevent any potential for misuse of Speech service for identification purposes, you are responsible for ensuring that you use the service, including the diarization feature, only for supported uses, and that you have a proper legal basis and any required consents in place for all uses of the service.Pricing
Pay-As-You-Go & Commitment Tiers See pricing details here .Technical specs
Speech to text offers the following core features: Real-time speech to text: Instant transcription with intermediate results for streaming audio inputs.Fast transcription: Fastest synchronous file-based processing for situations with predictable latency.
Batch transcription: Efficient processing for large volumes of prerecorded audio files.
LLM speech (preview): Transcribe and translate audio files using LLM-enhanced speech models, with improved quality and support for prompt tuning.
Custom speech: Fine-tune models with enhanced accuracy for specific domains and use cases.
Training cut-off date
This information is not available.Input formats
Real-time speech to text: 8khz/16-kHz mono audio, PCM, ALAW, MULAW, G722 Fast transcription and Batch transcription: WAV, MP3, OPUS/OGG, FLAC, WMA, AAC, ALAW in WAV container, MULAW in WAV container, AMR, WebM, SPEEXSupported language
Speech to text supports over 140 locales .Supported Azure regions
List of supported Azure regions .Sample JSON response
Please refer to the sample JSON for real-time transcription , fast transcription , or batch transcription according to your usage.Model architecture
This information is not available.Long context
Fast transcription2 hrs/300MB per audio file Batch transcription
4hrs/1GB per audio file
Optimizing model performance
Phrasse list: Lightweight at-runtime customization to improve recogntion quality using phrase list. Common example phrase are names, geographical locations, homonyms, words or acronyms unique to your industry or organization. Learn more about phrase lists .Custom speech: Fine-tune models with enhanced accuracy for specific domains and use casess by adding your own data. Learn more about custom speech
Additional assets
This information is not available.Distribution
You can deploy Azure AI Speech features in the cloud or on-premises. With containers , you can bring the service closer to your data for compliance, security, or other operational reasons. Speech service deployment in sovereign clouds is available for some government entities and their partners. For example, the Azure Government cloud is available to US government entities and their partners. Microsoft Azure operated by 21Vianet cloud is available to organizations with a business presence in China. For more information, see sovereign clouds . The Speech CLI is a command-line tool for using Speech service without having to write any code. Most features in the Speech SDK are available in the Speech CLI, and some advanced features and customizations are simplified in the Speech CLI. The Speech SDK exposes many of the Speech service capabilities you can use to develop speech-enabled applications. The Speech SDK is available in many programming languages and across all platforms. In some cases, you can't or shouldn't use the Speech SDK. In those cases, you can use REST APIs to access the Speech service. For example, use REST APIs for batch transcription and fast transcription .More information
Learn more in the full Azure AI Speech Service documentation .Responsible AI considerations
Safety techniques
Refer to the guidance for integration and responsible use with speech to text .Safety evaluations
This information is not available.Known limitations
Speech to text recognizes what's spoken in an audio input, and then generates transcription outputs. This requires proper setup for the expected languages used in the audio input and spoken styles. Non-optimal settings might lead to lower accuracy. Refer to Technical limitations, operational factors, and ranges for more details.Acceptable use
Acceptable use policy
The speech to text API offers convenient options for developing voice-enabled applications, but it is very important to consider the context in which you will integrate the API. You must ensure that you comply with all laws and regulations that apply to your application. This includes understanding your obligations under privacy and communication laws, including national and regional privacy, eavesdropping, and wiretap laws that apply to your jurisdiction. Collect and process only audio that is within the reasonable expectations of your users. This includes ensuring that you have all necessary and appropriate consents from users for you to collect, process, and store their audio data. Refer to Technical limitations, operational factors, and ranges for more details.Terms of Service
Terms of Service Link
Azure Speech - Speech to text is provided under Microsoft’s proprietary licensing terms. Access to the model is subscription-based and governed by Microsoft’s product licensing policies.- License Type: Proprietary
- Access Model: Subscription-based via Azure services
- Terms of Service: https://microsoft.com/licensing/terms/
Model Specifications
Last UpdatedDecember 2025
Input TypeAudio
Output TypeText
ProviderMicrosoft