MAI-Voice-2

MAI-Voice-2 is a prompted text-to-speech (TTS) model that generates high-fidelity, natural, and expressive speech across 10+ languages. It captures human-like intonation, rhythm, and emotional nuance for engaging conversational experiences.

Microsoft

Version: 2026-06-02

MAI-Voice-2

The next iteration in our speech synthesis family, MAI-Voice-2 is a prompted text-to-speech (TTS) model that generates high-fidelity, natural, and expressive speech across 10+ languages. It captures human-like intonation, rhythm, and emotional nuance, enabling engaging and lifelike conversational experiences.

About this model

There are two ways to set the voice for your project:

Curated voice library: Licensed voices designed to work straight out of the box.
Voice prompting: Provide a short audio clip (10-120 seconds) and the model matches it instantly.

Key capabilities

Natural voice synthesis.
High-fidelity, high-clarity voice output.
Multilingual support across 10+ languages.
Voice prompting with improved pacing, delivery, and naturalness. Instantly generate natural speech in any consented voice, without additional training/fine-tuning.
Long-form content generation with improved consistency.
Multi-speaker support.

Key model capabilities

High fidelity Natural Voice Synthesis
Produces speech with realistic intonation, rhythm, and emotional range.
State-of-the-Art Voice Prompting
Generate speech from short audio prompts (10-120 seconds). Prompt quality significantly impacts output, with best results from natural, conversational delivery and moderate energy levels.
Fine-grained control
Supports turn-level control over tone, delivery, and emotion.
Long-form content generation
Supports extended narration (e.g., audiobooks, podcasts) via chunking with context carryover.
Multilingual speech synthesis
Supports English, Spanish, French, German, Italian, Portuguese, Hindi, Japanese, and Chinese variants.

Use cases

Key use cases

Media: Entertainment - Generate expressive voices for games, films, podcasts, and immersive experiences.
Virtual Assistants and Chatbots - Power conversational agents across apps and devices with natural voices.
Accessibility Features - Provide narration for visually impaired users and assistive voice technologies.
Educational Experiences - Build interactive learning content with expressive narration.
Marketing and Advertising - Deliver consistent voice experiences across campaigns.
Self-authored Content - Turn written content into spoken audio using custom voice characteristics.
IVR Systems - Enable natural, expressive call center interactions.
Public Announcements - Deliver clear, engaging voice output for public information systems.

Out of scope use cases

This model prioritizes naturalness and expressivity over ultra-low latency scenarios.

Usage will be restricted to use the service in any way that is inconsistent with the Code of Conduct

Pricing

Technical specs

Distribution

More information

Quick facts

Model providerMicrosoft

TypeText to speech, Audio generation

LifecyclePreview

Input typetext

Output typeaudio

PricingView pricing

MAI-Voice-2

About this model

Key capabilities

Key model capabilities

Key use cases

Out of scope use cases

Quick facts

Quick start