Azure-Speech-Speech-to-text

Version: 1

Microsoft•Last updated December 2025

Transcribes streaming or recorded audio into readable text across 140+ languages and dialects. Accuracy can be further optimized with custom models for your specialized use cases.

Azure Speech

Azure Speech is a comprehensive suite of AI-powered speech capabilities that includes speech to text, text to speech, speech translation, and voice live AI. It enables developers to build intelligent voice-enabled applications with high accuracy, multilingual support, and customizable voice experiences.

Key capabilities

About this model

Speech to text offers various options to transcribe audio data into text. Real-time speech to text: Instant transcription with intermediate results for streaming audio inputs.

Fast transcription: Fastest synchronous file-based processing for situations with predictable latency.

Batch transcription: Efficient processing for large volumes of prerecorded audio files.

LLM speech (preview): Transcribe and translate audio files using LLM-enhanced speech models, with improved quality and support for prompt tuning.

Custom speech: Fine-tune models with enhanced accuracy for specific domains and use casess.

Key model capabilities

Real time streaming, batch, or fast transcription of audio data
LLM-powered audio file transcription and translation (preview)
Multilingual audio processing
Diarization
Language identification
Word timing
Fine tuning

Use cases

See Responsible AI for additional consideration for responsible use.

Key use cases

Use case	Scenario	Solution
Live meeting transcriptions and captions	A virtual event platform needs to provide real-time captions for webinars.	Integrate real-time speech to text using the Speech SDK to transcribe spoken content into captions displayed live during the event.
Customer service enhancement	A call center wants to assist agents by providing real-time transcriptions of customer calls.	Use real-time speech to text via the Speech CLI to transcribe calls, enabling agents to better understand and respond to customer queries.
Video subtitling	A video-hosting platform wants to quickly generate a set of subtitles for a video.	Use fast transcription to quickly get a set of subtitles for the entire video.
Educational tools	An e-learning platform aims to provide transcriptions for video lectures.	Apply batch transcription through the speech to text REST API to process prerecorded lecture videos, generating text transcripts for students.
Healthcare documentation	A healthcare provider needs to document patient consultations.	Use real-time speech to text for dictation, allowing healthcare professionals to speak their notes and have them transcribed instantly. Use a custom model to enhance recognition of specific medical terms.
Media and entertainment	A media company wants to create subtitles for a large archive of videos.	Use batch transcription to process the video files in bulk, generating accurate subtitles for each video.
Market research	A market research firm needs to analyze customer feedback from audio recordings.	Employ batch transcription to convert audio feedback into text, enabling easier analysis and insights extraction.

Out of scope use cases

Conversation transcription with speaker recognition: The Speech service is not designed to provide diarization with speaker recognition, and it cannot be used to identify individuals. In other words, speakers will be presented as Guest1, Guest2, Guest3, and so on, in the transcription. These will be randomly assigned and may not be used to identify individual speakers in the conversation. For each conversation transcription, the assignment of Guest1, Guest2, Guest3, and so on, will be random. To prevent any potential for misuse of Speech service for identification purposes, you are responsible for ensuring that you use the service, including the diarization feature, only for supported uses, and that you have a proper legal basis and any required consents in place for all uses of the service.

Pricing

Pay-As-You-Go & Commitment Tiers See pricing details here .

Technical specs

Speech to text offers the following core features: Real-time speech to text: Instant transcription with intermediate results for streaming audio inputs.

Fast transcription: Fastest synchronous file-based processing for situations with predictable latency.

Batch transcription: Efficient processing for large volumes of prerecorded audio files.

LLM speech (preview): Transcribe and translate audio files using LLM-enhanced speech models, with improved quality and support for prompt tuning.

Custom speech: Fine-tune models with enhanced accuracy for specific domains and use cases.

Training cut-off date

This information is not available.

Input formats

Real-time speech to text: 8khz/16-kHz mono audio, PCM, ALAW, MULAW, G722 Fast transcription and Batch transcription: WAV, MP3, OPUS/OGG, FLAC, WMA, AAC, ALAW in WAV container, MULAW in WAV container, AMR, WebM, SPEEX

Supported language

Speech to text supports over 140 locales .

Supported Azure regions

List of supported Azure regions .

Sample JSON response

Please refer to the sample JSON for real-time transcription , fast transcription , or batch transcription according to your usage.

Model architecture

This information is not available.

Long context

Fast transcription
2 hrs/300MB per audio file Batch transcription
4hrs/1GB per audio file

Optimizing model performance

Phrasse list: Lightweight at-runtime customization to improve recogntion quality using phrase list. Common example phrase are names, geographical locations, homonyms, words or acronyms unique to your industry or organization. Learn more about phrase lists .

Custom speech: Fine-tune models with enhanced accuracy for specific domains and use casess by adding your own data. Learn more about custom speech

Additional assets

This information is not available.

Distribution

You can deploy Azure AI Speech features in the cloud or on-premises. With containers , you can bring the service closer to your data for compliance, security, or other operational reasons. Speech service deployment in sovereign clouds is available for some government entities and their partners. For example, the Azure Government cloud is available to US government entities and their partners. Microsoft Azure operated by 21Vianet cloud is available to organizations with a business presence in China. For more information, see sovereign clouds . The Speech CLI is a command-line tool for using Speech service without having to write any code. Most features in the Speech SDK are available in the Speech CLI, and some advanced features and customizations are simplified in the Speech CLI. The Speech SDK exposes many of the Speech service capabilities you can use to develop speech-enabled applications. The Speech SDK is available in many programming languages and across all platforms. In some cases, you can't or shouldn't use the Speech SDK. In those cases, you can use REST APIs to access the Speech service. For example, use REST APIs for batch transcription and fast transcription .

More information

Learn more in the full Azure AI Speech Service documentation .

Responsible AI considerations

Safety techniques

Refer to the guidance for integration and responsible use with speech to text .

Safety evaluations

This information is not available.

Known limitations

Speech to text recognizes what's spoken in an audio input, and then generates transcription outputs. This requires proper setup for the expected languages used in the audio input and spoken styles. Non-optimal settings might lead to lower accuracy. Refer to Technical limitations, operational factors, and ranges for more details.

Acceptable use

Acceptable use policy

The speech to text API offers convenient options for developing voice-enabled applications, but it is very important to consider the context in which you will integrate the API. You must ensure that you comply with all laws and regulations that apply to your application. This includes understanding your obligations under privacy and communication laws, including national and regional privacy, eavesdropping, and wiretap laws that apply to your jurisdiction. Collect and process only audio that is within the reasonable expectations of your users. This includes ensuring that you have all necessary and appropriate consents from users for you to collect, process, and store their audio data. Refer to Technical limitations, operational factors, and ranges for more details.

Terms of Service

Terms of Service Link

Azure Speech - Speech to text is provided under Microsoft’s proprietary licensing terms. Access to the model is subscription-based and governed by Microsoft’s product licensing policies.

License Type: Proprietary
Access Model: Subscription-based via Azure services
Terms of Service: https://microsoft.com/licensing/terms/

Model Specifications

Last UpdatedDecember 2025

Input TypeAudio

Output TypeText

ProviderMicrosoft

Quick Start