MAI-Voice-1
Version: 2025-12-18
MAI-Voice-1
MAI‑Voice-1 is a text‑to‑speech(TTS) model that generates high‑fidelity, highly natural, and expressive speech. It captures human‑like intonation, rhythm, and emotional nuance, enabling more engaging and lifelike conversational experiences. It strictly follows the provided transcript and supports per‑turn emotion controlAbout this model
There are two ways to set the voice for your project.• Curated voice library: Licensed voices designed to work straight out of the box.
• Voice prompting: Provide a few secs long audio clip with your request and the model matches it instantly.
Key capabilities
• Natural voice synthesis.• High-fidelity, high-clarity voice output.
• Licensed voices designed to work straight out of the box.
• Voice prompting: Instantly generate natural speech in any consented voice, without additional training/fine-tuning.
• Long form content generation while maintaining speaker consistency.
Key model capabilities
-
High fidelity Natural Voice Synthesis
Produces voice with the intonation, rhythm, and emotional range of a real speaker. -
State–of-the-Art Voice Prompting
Provide few seconds of an audio clip(up to 120secs) and the model clones it instantly. No fine-tuning required allowing you to onboard a consented voice of your choice easily. Access requires Microsoft approval and guardrails are in place to avoid misuse. -
Fine grained control
Shape delivery at the turn/sentence level by controlling the emotion and tone of the output. -
Long-form content
Built for extended content covering audiobooks, lectures, podcasts, training materials, and long-form narration.
Use cases
Text to speech offers a variety of features catering to a wide range of intended uses across industries and domains. All text to speech features are subject to the terms and conditions applicable to customers’ Azure subscription, including the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech.Key use cases
- Media: Entertainment - Give characters a voice. Generate expressive, lifelike audio for games, films, podcasts, audiobooks, and immersive AR/VR experiences.
- Virtual Assistants and Chatbots - Make your assistant sound like it belongs in your product. Power conversational agents across apps, vehicles, appliances, and customer service with a branded voice.
- Accessibility Features - Build products that more people can use. Add audio narration for visually impaired users and voice support for individuals with speech impairments.
- Educational and Interactive Learning - Build character and brand voices for online courses, interactive lessons, simulations, and guided tours.
- Media: Marketing and Advertising - Develop a consistent, recognizable voice across product launches, campaigns, and ads.
- Self-authored Content - Voice talent can bring blogs, books, social media content, and personal stories to life using a custom voice built from their own.
- Interactive Voice Response (IVR) Systems - Build dynamic, natural and expressive voices for call centers and automated phone interactions.
- Public Service and Informational Announcements - Deliver clear and engaging voice messages for public venues, traffic updates, weather alerts, event information, and schedules.
Out of scope use cases
Usage will be restricted to use the service in any way that is inconsistent with the Code of ConductPricing
Amongst HD voices, MAI-Voice-1 is available at a very competitive rate of $22.00/1M chars.Technical specs
MAI‑Voice-1 is a text‑to‑speech (TTS) model that generates high‑fidelity, highly natural, and expressive speech. It captures human‑like intonation, rhythm, and emotional nuance, enabling more engaging and lifelike conversational experiences. It strictly follows the provided transcript and supports per‑turn emotion control.Training cut-off date
This information is not available.Input formats
Plain text or Speech Synthesis Markup Language (SSML) , which supports emotion control.Supported language
English (soon expanding to 10+ languages).Supported Azure regions
For now available in Central US, Japan West and Sweden Central. Expanding to more regions soon.Sample JSON response
| Endpoint | Request Type | Response Format |
|---|---|---|
POST /cognitiveservices/v1 | SSML + headers | Binary audio file (MP3 / WAV / Opus / etc.) |
Speech SDK SpeakTextAsync | Text or SSML | SDK stream + result metadata |
| Batch synthesis API | Long-form SSML/Text | Asynchronous job → downloadable audio file |
Model architecture
This information is not available.Long context
This information is not available.Optimizing model performance
Coming Soon...Additional assets
This information is not available.Distribution
MAI-Voice-1 is available through the following methods to support a wide range of integration scenarios:-
Speech SDK
Integrate TTS capabilities directly into applications using Azure’s Speech SDK, available for platforms including .NET, Python, Java, JavaScript, and C++. -
REST API
Access TTS functionality via a public, subscription-based API for flexible integration into web services, mobile apps, and backend systems.
More information
Learn more in the full Azure AI Speech Service documentation .Responsible AI considerations
Safety techniques
This information is not available.Safety evaluations
This information is not available.Known limitations
Azure Text-to-Speech is designed with responsible AI principles in mind, but developers should be aware of the following limitations and risks:-
Linguistic Limitations
While we support only English today, we are scaling to support 10+ languages soon. -
Context and Emotion
The model may struggle to accurately convey nuanced emotions or context-specific intonation, especially in complex or sensitive scenarios. SSML Emotion Tags and Voice prompting can help mitigate this. -
Fairness and Representation
Voice datasets may reflect biases in gender, accent, age, or regional representation. Developers should evaluate voice selection carefully to ensure inclusive and equitable user experiences. -
Misuse Risks
Synthetic voices can be misused for impersonation, misinformation, or deceptive content. Developers should implement safeguards, such as watermarking, consent management, and usage monitoring. -
Reliability in High-Stakes Use Cases
For applications involving healthcare, legal, or emergency communication, MAI-Voice-1 should be used with caution and supplemented by human oversight.
Acceptable use
Acceptable use policy
Approved use cases for Azure text to speech include:-
Educational or Interactive Learning
For reading or speaking educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours. -
Media: Marketing or Entertainment
For product introductions, business promotion, advertisements, or speaking entertainment content such as video games, movies, TV, recorded music, podcasts, audiobooks, or AR/VR experiences. -
Accessibility Features
For audio description systems, narration, and communication support for individuals with speech impairments. -
Interactive Voice Response (IVR) Systems
For call center operations, telephony systems, and automated phone interactions. -
Public Service and Informational Announcements
For communicating public service information in venues or broadcasts (e.g., traffic, weather, events). Not intended for journalistic or news content. -
Translation and Localization
For translating conversations or audio media across different languages. -
Virtual Assistant or Chatbot
For smart assistants, web-based virtual agents, appliances, vehicles, toys, IoT device control, and customer service scenarios.
Terms of Service
Terms of Service Link
MAI-Voice-1 is provided under Microsoft’s proprietary licensing terms. Access to the model is subscription-based and governed by Microsoft’s product licensing policies.- License Type: Proprietary
- Access Model: Subscription-based via Azure services
- Terms of Service: https://microsoft.com/licensing/terms/
Model Specifications
Last UpdatedMay 2026
Input TypeText
Output TypeAudio
ProviderMicrosoft