MAI-Voice-1
MAI-Voice-1
Version: 2025-12-18
MicrosoftLast updated May 2026
MAI-Voice-1 is a text-to-speech (TTS) model that generates high-quality single-speaker speech and, soon, multi-speaker speech for public preview. It produces audio that strictly follows the input transcript and supports per-turn emotion control as well as

MAI-Voice-1

MAI‑Voice-1 is a text‑to‑speech(TTS) model that generates high‑fidelity, highly natural, and expressive speech. It captures human‑like intonation, rhythm, and emotional nuance, enabling more engaging and lifelike conversational experiences. It strictly follows the provided transcript and supports per‑turn emotion control

About this model

There are two ways to set the voice for your project.
• Curated voice library: Licensed voices designed to work straight out of the box.
• Voice prompting: Provide a few secs long audio clip with your request and the model matches it instantly.

Key capabilities

• Natural voice synthesis.
• High-fidelity, high-clarity voice output.
• Licensed voices designed to work straight out of the box.
• Voice prompting: Instantly generate natural speech in any consented voice, without additional training/fine-tuning.
• Long form content generation while maintaining speaker consistency.

Key model capabilities

  1. High fidelity Natural Voice Synthesis
    Produces voice with the intonation, rhythm, and emotional range of a real speaker.
  2. State–of-the-Art Voice Prompting
    Provide few seconds of an audio clip(up to 120secs) and the model clones it instantly. No fine-tuning required allowing you to onboard a consented voice of your choice easily. Access requires Microsoft approval and guardrails are in place to avoid misuse.
  3. Fine grained control
    Shape delivery at the turn/sentence level by controlling the emotion and tone of the output.
  4. Long-form content
    Built for extended content covering audiobooks, lectures, podcasts, training materials, and long-form narration.
Together, these capabilities give developers the building blocks to ship voice at scale, across accessibility, virtual assistants, media narration, and customer service

Use cases

Text to speech offers a variety of features catering to a wide range of intended uses across industries and domains. All text to speech features are subject to the terms and conditions applicable to customers’ Azure subscription, including the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech.

Key use cases

  • Media: Entertainment - Give characters a voice. Generate expressive, lifelike audio for games, films, podcasts, audiobooks, and immersive AR/VR experiences.
  • Virtual Assistants and Chatbots - Make your assistant sound like it belongs in your product. Power conversational agents across apps, vehicles, appliances, and customer service with a branded voice.
  • Accessibility Features - Build products that more people can use. Add audio narration for visually impaired users and voice support for individuals with speech impairments.
  • Educational and Interactive Learning - Build character and brand voices for online courses, interactive lessons, simulations, and guided tours.
  • Media: Marketing and Advertising - Develop a consistent, recognizable voice across product launches, campaigns, and ads.
  • Self-authored Content - Voice talent can bring blogs, books, social media content, and personal stories to life using a custom voice built from their own.
  • Interactive Voice Response (IVR) Systems - Build dynamic, natural and expressive voices for call centers and automated phone interactions.
  • Public Service and Informational Announcements - Deliver clear and engaging voice messages for public venues, traffic updates, weather alerts, event information, and schedules.

Out of scope use cases

Usage will be restricted to use the service in any way that is inconsistent with the Code of Conduct

Pricing

Amongst HD voices, MAI-Voice-1 is available at a very competitive rate of $22.00/1M chars.

Technical specs

MAI‑Voice-1 is a text‑to‑speech (TTS) model that generates high‑fidelity, highly natural, and expressive speech. It captures human‑like intonation, rhythm, and emotional nuance, enabling more engaging and lifelike conversational experiences. It strictly follows the provided transcript and supports per‑turn emotion control.

Training cut-off date

This information is not available.

Input formats

Plain text or Speech Synthesis Markup Language (SSML) , which supports emotion control.

Supported language

English (soon expanding to 10+ languages).

Supported Azure regions

For now available in Central US, Japan West and Sweden Central. Expanding to more regions soon.

Sample JSON response

EndpointRequest TypeResponse Format
POST /cognitiveservices/v1SSML + headersBinary audio file (MP3 / WAV / Opus / etc.)
Speech SDK SpeakTextAsyncText or SSMLSDK stream + result metadata
Batch synthesis APILong-form SSML/TextAsynchronous job → downloadable audio file

Model architecture

This information is not available.

Long context

This information is not available.

Optimizing model performance

Coming Soon...

Additional assets

This information is not available.

Distribution

MAI-Voice-1 is available through the following methods to support a wide range of integration scenarios:
  • Speech SDK
    Integrate TTS capabilities directly into applications using Azure’s Speech SDK, available for platforms including .NET, Python, Java, JavaScript, and C++.
  • REST API
    Access TTS functionality via a public, subscription-based API for flexible integration into web services, mobile apps, and backend systems.

More information

Learn more in the full Azure AI Speech Service documentation .

Responsible AI considerations

Safety techniques

This information is not available.

Safety evaluations

This information is not available.

Known limitations

Azure Text-to-Speech is designed with responsible AI principles in mind, but developers should be aware of the following limitations and risks:
  • Linguistic Limitations
    While we support only English today, we are scaling to support 10+ languages soon.
  • Context and Emotion
    The model may struggle to accurately convey nuanced emotions or context-specific intonation, especially in complex or sensitive scenarios. SSML Emotion Tags and Voice prompting can help mitigate this.
  • Fairness and Representation
    Voice datasets may reflect biases in gender, accent, age, or regional representation. Developers should evaluate voice selection carefully to ensure inclusive and equitable user experiences.
  • Misuse Risks
    Synthetic voices can be misused for impersonation, misinformation, or deceptive content. Developers should implement safeguards, such as watermarking, consent management, and usage monitoring.
  • Reliability in High-Stakes Use Cases
    For applications involving healthcare, legal, or emergency communication, MAI-Voice-1 should be used with caution and supplemented by human oversight.
Refer to Technical limitations, operational factors, and ranges for more details.

Acceptable use

Acceptable use policy

Approved use cases for Azure text to speech include:
  • Educational or Interactive Learning
    For reading or speaking educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours.
  • Media: Marketing or Entertainment
    For product introductions, business promotion, advertisements, or speaking entertainment content such as video games, movies, TV, recorded music, podcasts, audiobooks, or AR/VR experiences.
  • Accessibility Features
    For audio description systems, narration, and communication support for individuals with speech impairments.
  • Interactive Voice Response (IVR) Systems
    For call center operations, telephony systems, and automated phone interactions.
  • Public Service and Informational Announcements
    For communicating public service information in venues or broadcasts (e.g., traffic, weather, events). Not intended for journalistic or news content.
  • Translation and Localization
    For translating conversations or audio media across different languages.
  • Virtual Assistant or Chatbot
    For smart assistants, web-based virtual agents, appliances, vehicles, toys, IoT device control, and customer service scenarios.
Refer to use cases for more details.

Terms of Service

Terms of Service Link

MAI-Voice-1 is provided under Microsoft’s proprietary licensing terms. Access to the model is subscription-based and governed by Microsoft’s product licensing policies.
Model Specifications
Last UpdatedMay 2026
Input TypeText
Output TypeAudio
ProviderMicrosoft