Azure-Speech-Text-to-speech-Avatar
Version: 1
Azure Speech
Azure Speech is a comprehensive suite of AI-powered speech capabilities that includes speech to text, text to speech, speech translation, and voice live AI. It enables developers to build intelligent voice-enabled applications with high accuracy, multilingual support, and customizable voice experiences.Key capabilities
About this model
With text to speech avatar’s advanced neural network models and VASA-1 model, the feature empowers users to create low latency real-time live chat avatars and deliver life-like and high-quality synthetic talking avatar videos for various applications.Key model capabilities
- Converts text into a digital video of a human speaking with natural-sounding voices powered by Azure AI text to speech.
- Provides a collection of standard avatars.
- Azure AI text to speech generates the voice of the avatar. For more information, see Avatar voice and language .
- Synthesizes text to speech avatar video asynchronously with the batch synthesis API or in real-time .
- Allow user to create avatar video content in text to speech avatar playground in AI Foundry.
- Enables real-time interactive avatar through Voice live in AI Foundry.
Use cases
Text to speech avatar offers a variety of features catering to a wide range of intended uses across industries and domains. All text to speech features including text to speech avatar are subject to the terms and conditions applicable to customers’ Azure subscription, including the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech.Key use cases
- Virtual Assistant or Chatbot: To create virtual assistants, virtual companions, virtual sales assistants, or for customer service applications.
- Content generation for enterprise contexts: For use to communicate product information, marketing materials, business promotional content, and internal business communications. Examples include character avatars or digital twins of a business leader to promote a brand.
- Educational or interactive learning: To create a fictional brand or character avatar for presenting educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours.
- Media: Entertainment: To present updates, share knowledge, create interactive media, or make talking head videos for entertainment scenarios such as videos, gaming, and augmented or virtual reality.
Out of scope use cases
Usage will be restricted to use the service in any way that is inconsistent with the Code of ConductPricing
Picing for Azure text to speech avatar is based on several factors including avatar type (standard avatar vs custom avatar), mode(real-time vs batch), avatar resolution(whether it is a 4K avatar), additional activities like model training or hosting.Technical specs
Azure Speech's text to speech avatar functionality is part of the Azure AI Speech Service. It uses deep neural networks or leverages VASA-1, an advanced AI model developed by Microsoft Research to generate highly natural talking avatars. The avatar supports real-time and batch synthesis, and includes custom avatar capabilities for creating personalized talking avatars.Training cut-off date
This information is not available.Input formats
Plain text or Speech Synthesis Markup Language (SSML) , which supports detailed speech control (e.g. pitch, rate, pronunciation, pauses) and avatar gesture control .Supported language
The language support for text to speech avatar is the same as the language support for text to speech. Azure text to speech supports 150+ languages and variants, with multiple voice options per locale. Full voice listSupported Azure regions
Supported text to speech avatar regionsSample JSON response
| Operation | Method | REST API call |
|---|---|---|
| Create batch synthesis | PUT | avatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01 |
| Get batch synthesis | GET | avatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01 |
| List batch synthesis | GET | avatar/batchsyntheses/?api-version=2024-08-01 |
| Delete batch synthesis | DELETE | avatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01 |
Model architecture
This information is not available.Long context
This information is not available.Optimizing model performance
To improve system performance for prebuilt text to speech avatars, we recommend that customers experiment with avatar selection and voice selection.When building a custom text to speech avatar, preparing high quality training data can help improve the quality of the custom avatar model. Refer to Best practices for improving system performance
Additional assets
This information is not available.Distribution
Text to speech avatar is available through multiple distribution methods to support a wide range of integration scenarios:- Voice live Turn on avatar in Voice live playground or in a Voice live enabled agent to offer user more engaging and personalized voice agent experience.
- Text to speech avatar Create a life-like avatar video with natural sounding text to speech voice in the Text to speech avatar playground.
- REST API Integrate TTS avatar capability directly into applications, web services and backend systems through public, subscription-based APIs
More information
Learn more in the full Azure AI Speech Service documentation .Responsible AI considerations
Safety techniques
Azure text to speech avatar adds a content credentials manifest to each avatar video following industry standards from the Coalition for Content Provenance and Authenticity (C2PA) , giving you information about the video's origin.Safety evaluations
This information is not available.Known limitations
Azure text to speech avatar is designed with responsible AI principles in mind, but developers should be aware of the following limitations and risks:- Body movements: Text to speech avatars are designed to be used in speaking scenarios. In general, the avatar is in front facing posture, sitting, or standing.
- Lip sync accuracy: In some cases, text to speech avatars may produce lip sync images that do not perfectly match with the audio content or that lack the desired naturalness, this could be caused by training data quality and the processing of the data.
- Faithfulness to the original: The text to speech avatars generated from human photos may lose some features of the original's facial expressions, since the facial movements are constructed by the VASA-1 model rather than derived from the original’s own data.
Acceptable use
Acceptable use policy
- Virtual Assistant or Chatbot: To create virtual assistants, virtual companions, virtual sales assistants, or for customer service applications.
- Content generation for enterprise contexts: For use to communicate product information, marketing materials, business promotional content, and internal business communications. Examples include character avatars or digital twins of a business leader to promote a brand.
- Educational or interactive learning: To create a fictional brand or character avatar for presenting educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours.
- Media: Entertainment: To present updates, share knowledge, create interactive media, or make talking head videos for entertainment scenarios such as videos, gaming, and augmented or virtual reality.
Terms of Service
Terms of Service Link
Azure Speech - Text to speech Avatar is provided under Microsoft’s proprietary licensing terms. Access to the model is subscription-based and governed by Microsoft’s product licensing policies.- License Type: Proprietary
- Access Model: Subscription-based via Azure services
- Terms of Service: https://microsoft.com/licensing/terms/
Model Specifications
Last UpdatedDecember 2025
Input TypeText
Output TypeVideo
ProviderMicrosoft