MAI-Voice-1
MAI-Voice-1 is a text-to-speech (TTS) model that generates high-quality single-speaker speech and, soon, multi-speaker speech for public preview. It produces audio that strictly follows the input transcript and supports per-turn emotion control as well as
MAI‑Voice-1 is a text‑to‑speech(TTS) model that generates high‑fidelity, highly natural, and expressive speech. It captures human‑like intonation, rhythm, and emotional nuance, enabling more engaging and lifelike conversational experiences. It strictly follows the provided transcript and supports per‑turn emotion control
• Curated voice library: Licensed voices designed to work straight out of the box.
• Voice prompting: Provide a few secs long audio clip with your request and the model matches it instantly.
• High-fidelity, high-clarity voice output.
• Licensed voices designed to work straight out of the box.
• Voice prompting: Instantly generate natural speech in any consented voice, without additional training/fine-tuning.
• Long form content generation while maintaining speaker consistency.
About this model
There are two ways to set the voice for your project.• Curated voice library: Licensed voices designed to work straight out of the box.
• Voice prompting: Provide a few secs long audio clip with your request and the model matches it instantly.
Key capabilities
• Natural voice synthesis.• High-fidelity, high-clarity voice output.
• Licensed voices designed to work straight out of the box.
• Voice prompting: Instantly generate natural speech in any consented voice, without additional training/fine-tuning.
• Long form content generation while maintaining speaker consistency.
Key model capabilities
-
High fidelity Natural Voice Synthesis
Produces voice with the intonation, rhythm, and emotional range of a real speaker. -
State–of-the-Art Voice Prompting
Provide few seconds of an audio clip(up to 120secs) and the model clones it instantly. No fine-tuning required allowing you to onboard a consented voice of your choice easily. Access requires Microsoft approval and guardrails are in place to avoid misuse. -
Fine grained control
Shape delivery at the turn/sentence level by controlling the emotion and tone of the output. -
Long-form content
Built for extended content covering audiobooks, lectures, podcasts, training materials, and long-form narration.
Text to speech offers a variety of features catering to a wide range of intended uses across industries and domains. All text to speech features are subject to the terms and conditions applicable to customers’ Azure subscription, including the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech.
Key use cases
- Media: Entertainment - Give characters a voice. Generate expressive, lifelike audio for games, films, podcasts, audiobooks, and immersive AR/VR experiences.
- Virtual Assistants and Chatbots - Make your assistant sound like it belongs in your product. Power conversational agents across apps, vehicles, appliances, and customer service with a branded voice.
- Accessibility Features - Build products that more people can use. Add audio narration for visually impaired users and voice support for individuals with speech impairments.
- Educational and Interactive Learning - Build character and brand voices for online courses, interactive lessons, simulations, and guided tours.
- Media: Marketing and Advertising - Develop a consistent, recognizable voice across product launches, campaigns, and ads.
- Self-authored Content - Voice talent can bring blogs, books, social media content, and personal stories to life using a custom voice built from their own.
- Interactive Voice Response (IVR) Systems - Build dynamic, natural and expressive voices for call centers and automated phone interactions.
- Public Service and Informational Announcements - Deliver clear and engaging voice messages for public venues, traffic updates, weather alerts, event information, and schedules.
Out of scope use cases
Usage will be restricted to use the service in any way that is inconsistent with the Code of ConductQuick facts
Model providerMicrosoft
TypeText to speech, Audio generation
LifecycleGenerally available (GA)
Input typetext
Output typeaudio
PricingView pricing