MAI-Voice-2
The next iteration in our speech synthesis family, MAI-Voice-2 is a prompted text-to-speech (TTS) model that generates high-fidelity, natural, and expressive speech across 10+ languages. It captures human-like intonation, rhythm, and emotional nuance, enabling engaging and lifelike conversational experiences.
About this model
There are two ways to set the voice for your project:
- Curated voice library: Licensed voices designed to work straight out of the box.
- Voice prompting: Provide a short audio clip (10-120 seconds) and the model matches it instantly.
Key capabilities
- Natural voice synthesis.
- High-fidelity, high-clarity voice output.
- Multilingual support across 10+ languages.
- Voice prompting with improved pacing, delivery, and naturalness. Instantly generate natural speech in any consented voice, without additional training/fine-tuning.
- Long-form content generation with improved consistency.
- Multi-speaker support.
Key model capabilities
-
High fidelity Natural Voice Synthesis
Produces speech with realistic intonation, rhythm, and emotional range. -
State-of-the-Art Voice Prompting
Generate speech from short audio prompts (10-120 seconds). Prompt quality significantly impacts output, with best results from natural, conversational delivery and moderate energy levels. -
Fine-grained control
Supports turn-level control over tone, delivery, and emotion. -
Long-form content generation
Supports extended narration (e.g., audiobooks, podcasts) via chunking with context carryover. -
Multilingual speech synthesis
Supports English, Spanish, French, German, Italian, Portuguese, Hindi, Japanese, and Chinese variants.
Key use cases
- Media: Entertainment - Generate expressive voices for games, films, podcasts, and immersive experiences.
- Virtual Assistants and Chatbots - Power conversational agents across apps and devices with natural voices.
- Accessibility Features - Provide narration for visually impaired users and assistive voice technologies.
- Educational Experiences - Build interactive learning content with expressive narration.
- Marketing and Advertising - Deliver consistent voice experiences across campaigns.
- Self-authored Content - Turn written content into spoken audio using custom voice characteristics.
- IVR Systems - Enable natural, expressive call center interactions.
- Public Announcements - Deliver clear, engaging voice output for public information systems.
Out of scope use cases
This model prioritizes naturalness and expressivity over ultra-low latency scenarios.
Usage will be restricted to use the service in any way that is inconsistent with the Code of Conduct