Azure-Speech-Text-to-speech

Text-to-speech enables your applications, tools, or devices to convert text into natural synthesized speech. It leverages advanced out-of-the-box [prebuilt neural voices](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?t

Microsoft

Version: 1

Azure Speech

Azure Speech is a comprehensive suite of AI-powered speech capabilities that includes speech to text, text to speech, speech translation, and voice live AI. It enables developers to build intelligent voice-enabled applications with high accuracy, multilingual support, and customizable voice experiences.

Key capabilities

About this model

Azure text to speech (TTS) is a neural speech synthesis model designed to convert written text into highly natural speech. It excels in delivering expressive and context-aware audio output using prebuilt neural voices or custom voice models tailored to specific brands or applications.

This model is particularly valuable for developers building applications that require lifelike voice interaction, such as virtual assistants, accessibility tools, customer service bots, and content narration. With support for SSML-based fine-tuning, multilingual capabilities, and batch synthesis for long-form audio, Azure TTS offers flexibility, scalability, and high-quality voice generation across diverse use cases.

Key model capabilities

Azure Text-to-Speech offers several core capabilities that make it a powerful tool for developers building voice-enabled applications:

Neural Voice Synthesis
Delivers highly natural and expressive speech using advanced deep learning models. Neural voices replicate human intonation, rhythm, and emotion, enhancing user engagement across conversational interfaces.
Custom Neural Voice
Enables creation of unique, brand-specific voices through voice talent recordings and model training. This allows organizations to deliver consistent and personalized audio experiences across platforms.
SSML-Based Speech Tuning
Supports Speech Synthesis Markup Language (SSML) for fine-grained control over speech output, including pitch, rate, volume, pronunciation, pauses, etc.
Multilingual and Regional Voice Support
Offers over 150+ languages and variants with multiple voice options per locale, making it ideal for global applications and inclusive user experiences.

Each capability is designed to help developers create high-quality, natural, and scalable voice interactions experience for a wide range of use cases - from accessibility, virtual assistants to media narration and customer service automation.

Use cases

Pricing

Technical specs

Distribution

More information

Quick facts

Model providerMicrosoft

TypeText to speech, Audio generation

LifecycleGenerally available (GA)

Input typetext

Output typeaudio

PricingView pricing

Azure-Speech-Text-to-speech

About this model

Key model capabilities

Quick facts

Quick start