Microsoft AI

MAI-Voice-2

MAI-Voice-2 is a prompted text-to-speech (TTS) model that generates high-fidelity, natural, and expressive speech across 10+ languages. It captures human-like intonation, rhythm, and emotional nuance for engaging conversational experiences.
Microsoft
Version: 2026-06-02

The next iteration in our speech synthesis family, MAI-Voice-2 is a prompted text-to-speech (TTS) model that generates high-fidelity, natural, and expressive speech across 10+ languages. It captures human-like intonation, rhythm, and emotional nuance, enabling engaging and lifelike conversational experiences.

About this model

There are two ways to set the voice for your project:

  • Curated voice library: Licensed voices designed to work straight out of the box.
  • Voice prompting: Provide a short audio clip (10-120 seconds) and the model matches it instantly.

Key capabilities

  • Natural voice synthesis.
  • High-fidelity, high-clarity voice output.
  • Multilingual support across 10+ languages.
  • Voice prompting with improved pacing, delivery, and naturalness. Instantly generate natural speech in any consented voice, without additional training/fine-tuning.
  • Long-form content generation with improved consistency.
  • Multi-speaker support.

Key model capabilities

  1. High fidelity Natural Voice Synthesis
    Produces speech with realistic intonation, rhythm, and emotional range.

  2. State-of-the-Art Voice Prompting
    Generate speech from short audio prompts (10-120 seconds). Prompt quality significantly impacts output, with best results from natural, conversational delivery and moderate energy levels.

  3. Fine-grained control
    Supports turn-level control over tone, delivery, and emotion.

  4. Long-form content generation
    Supports extended narration (e.g., audiobooks, podcasts) via chunking with context carryover.

  5. Multilingual speech synthesis
    Supports English, Spanish, French, German, Italian, Portuguese, Hindi, Japanese, and Chinese variants.


Key use cases

  • Media: Entertainment - Generate expressive voices for games, films, podcasts, and immersive experiences.
  • Virtual Assistants and Chatbots - Power conversational agents across apps and devices with natural voices.
  • Accessibility Features - Provide narration for visually impaired users and assistive voice technologies.
  • Educational Experiences - Build interactive learning content with expressive narration.
  • Marketing and Advertising - Deliver consistent voice experiences across campaigns.
  • Self-authored Content - Turn written content into spoken audio using custom voice characteristics.
  • IVR Systems - Enable natural, expressive call center interactions.
  • Public Announcements - Deliver clear, engaging voice output for public information systems.

Out of scope use cases

This model prioritizes naturalness and expressivity over ultra-low latency scenarios.

Usage will be restricted to use the service in any way that is inconsistent with the Code of Conduct


Quick facts

Model providerMicrosoft
TypeText to speech, Audio generation
LifecyclePreview
Input typetext
Output typeaudio