Azure-Speech-Text-to-speech-Avatar
Azure-Speech-Text-to-speech-Avatar
Version: 1
MicrosoftLast updated December 2025
Text to speech avatar converts text into a digital video of a human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Deve

Azure Speech

Azure Speech is a comprehensive suite of AI-powered speech capabilities that includes speech to text, text to speech, speech translation, and voice live AI. It enables developers to build intelligent voice-enabled applications with high accuracy, multilingual support, and customizable voice experiences.

Key capabilities

About this model

With text to speech avatar’s advanced neural network models and VASA-1 model, the feature empowers users to create low latency real-time live chat avatars and deliver life-like and high-quality synthetic talking avatar videos for various applications.

Key model capabilities

  1. Converts text into a digital video of a human speaking with natural-sounding voices powered by Azure AI text to speech.
  2. Provides a collection of standard avatars.
  3. Azure AI text to speech generates the voice of the avatar. For more information, see Avatar voice and language .
  4. Synthesizes text to speech avatar video asynchronously with the batch synthesis API or in real-time .
  5. Allow user to create avatar video content in text to speech avatar playground in AI Foundry.
  6. Enables real-time interactive avatar through Voice live in AI Foundry.

Use cases

Text to speech avatar offers a variety of features catering to a wide range of intended uses across industries and domains. All text to speech features including text to speech avatar are subject to the terms and conditions applicable to customers’ Azure subscription, including the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech.

Key use cases

  • Virtual Assistant or Chatbot: To create virtual assistants, virtual companions, virtual sales assistants, or for customer service applications.
  • Content generation for enterprise contexts: For use to communicate product information, marketing materials, business promotional content, and internal business communications. Examples include character avatars or digital twins of a business leader to promote a brand.
  • Educational or interactive learning: To create a fictional brand or character avatar for presenting educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours.
  • Media: Entertainment: To present updates, share knowledge, create interactive media, or make talking head videos for entertainment scenarios such as videos, gaming, and augmented or virtual reality.
Check more details in the approved use cases

Out of scope use cases

Usage will be restricted to use the service in any way that is inconsistent with the Code of Conduct

Pricing

Picing for Azure text to speech avatar is based on several factors including avatar type (standard avatar vs custom avatar), mode(real-time vs batch), avatar resolution(whether it is a 4K avatar), additional activities like model training or hosting.

Technical specs

Azure Speech's text to speech avatar functionality is part of the Azure AI Speech Service. It uses deep neural networks or leverages VASA-1, an advanced AI model developed by Microsoft Research to generate highly natural talking avatars. The avatar supports real-time and batch synthesis, and includes custom avatar capabilities for creating personalized talking avatars.

Training cut-off date

This information is not available.

Input formats

Plain text or Speech Synthesis Markup Language (SSML) , which supports detailed speech control (e.g. pitch, rate, pronunciation, pauses) and avatar gesture control .

Supported language

The language support for text to speech avatar is the same as the language support for text to speech. Azure text to speech supports 150+ languages and variants, with multiple voice options per locale. Full voice list

Supported Azure regions

Supported text to speech avatar regions

Sample JSON response

OperationMethodREST API call
Create batch synthesis PUTavatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01
Get batch synthesis GETavatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01
List batch synthesis GETavatar/batchsyntheses/?api-version=2024-08-01
Delete batch synthesis DELETEavatar/batchsyntheses/{SynthesisId}?api-version=2024-08-01

Model architecture

This information is not available.

Long context

This information is not available.

Optimizing model performance

To improve system performance for prebuilt text to speech avatars, we recommend that customers experiment with avatar selection and voice selection. 
When building a custom text to speech avatar, preparing high quality training data can help improve the quality of the custom avatar model.
Refer to Best practices for improving system performance

Additional assets

This information is not available.

Distribution

Text to speech avatar is available through multiple distribution methods to support a wide range of integration scenarios:
  • Voice live Turn on avatar in Voice live playground or in a Voice live enabled agent to offer user more engaging and personalized voice agent experience.
  • Text to speech avatar Create a life-like avatar video with natural sounding text to speech voice in the Text to speech avatar playground.
  • REST API Integrate TTS avatar capability directly into applications, web services and backend systems through public, subscription-based APIs

More information

Learn more in the full Azure AI Speech Service documentation .

Responsible AI considerations

Safety techniques

Azure text to speech avatar adds a content credentials manifest to each avatar video following industry standards from the Coalition for Content Provenance and Authenticity (C2PA) , giving you information about the video's origin.

Safety evaluations

This information is not available.

Known limitations

Azure text to speech avatar is designed with responsible AI principles in mind, but developers should be aware of the following limitations and risks:
  • Body movements: Text to speech avatars are designed to be used in speaking scenarios. In general, the avatar is in front facing posture, sitting, or standing.
  • Lip sync accuracy: In some cases, text to speech avatars may produce lip sync images that do not perfectly match with the audio content or that lack the desired naturalness, this could be caused by training data quality and the processing of the data.
  • Faithfulness to the original: The text to speech avatars generated from human photos may lose some features of the original's facial expressions, since the facial movements are constructed by the VASA-1 model rather than derived from the original’s own data.
Refer to Limitations for more details.

Acceptable use

Acceptable use policy

  • Virtual Assistant or Chatbot: To create virtual assistants, virtual companions, virtual sales assistants, or for customer service applications.
  • Content generation for enterprise contexts: For use to communicate product information, marketing materials, business promotional content, and internal business communications. Examples include character avatars or digital twins of a business leader to promote a brand.
  • Educational or interactive learning: To create a fictional brand or character avatar for presenting educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours.
  • Media: Entertainment: To present updates, share knowledge, create interactive media, or make talking head videos for entertainment scenarios such as videos, gaming, and augmented or virtual reality.
Check more details in the approved use cases

Terms of Service

Terms of Service Link

Azure Speech - Text to speech Avatar is provided under Microsoft’s proprietary licensing terms. Access to the model is subscription-based and governed by Microsoft’s product licensing policies.
Model Specifications
Last UpdatedDecember 2025
Input TypeText
Output TypeVideo
ProviderMicrosoft