AI Model Catalog | Microsoft Foundry Models

Inworld TTS 1.5 Max

Version: 1

Theai-Inc•Last updated May 2026

Flagship text-to-speech model with highest quality and expressiveness for demanding applications.

Audio

Conversation

Low latency

Inworld TTS 1.5 Max is the flagship text-to-speech model in the Inworld TTS 1.5 series, delivering the highest quality and expressiveness for demanding voice applications. It is a state-of-the-art speech synthesis system designed for production-grade, realtime voice-enabled applications. TTS 1.5 Max produces the most natural and expressive speech in the lineup with sub-250ms time-to-first-audio, making it ideal for premium interactive dialogue, entertainment, and applications where voice quality is paramount. Inworld TTS 1.5 Max is an autoregressive transformer-based speech language model (SpeechLM) with 8B parameters that uses a high-resolution neural audio codec. The architecture is a two-stage generative system: the SpeechLM generates discrete audio tokens from text, and the neural codec decoder converts these tokens into high-fidelity 48kHz audio. The model supports 15+ languages, customizable speaking speed (0.5x-1.5x), expressiveness adjustments, custom pronunciation, and character/word/phoneme/viseme timestamps for applications such as live captioning and lip-sync.

Intended Use

Primary Use Cases

Inworld TTS 1.5 Max sets a new standard for developers building voice-enabled applications at scale. Common use cases include:

Customer Service. Fluid, multi-turn dialogue for contact centers, live support, sales/GTM, and more.
Regulated Industries. Deployments in air-gapped or private environments for government and enterprise sectors requiring strict data control (e.g., healthcare, financial services).
Education / Professional Development. Realtime synthesis for training, interviewing, and language learning.
Companion Apps. Engaging experiences such as fitness coaching, guided meditation, personalized shopping.
Interactive Media & Entertainment. Character/IP voices with emotional range and contextual awareness for gaming, advertising, storytelling, and branded experiences.
Realtime Translation & Dubbing. Voice-enabled experiences with timestamp support.

Model Capabilities

Multilingual Support. Supports 15+ languages, including English, Hindi, Arabic, Hebrew, Chinese, and major European languages. Additional languages upon request.
Customization. Supports talking speed adjustments (0.5x to 1.5x), expressiveness adjustments, custom pronunciation, and emphasis markers (audio markups).
Timestamps. Powers applications such as live captioning, lip-sync, etc. with character, word, phoneme, and viseme timestamps.
Custom Voices. Available upon request. TTS 1.5 includes a rich library of studio-quality voices optimized for various use cases.
Data Control. On-premise deployment ensures no data (text or audio) leaves the customer's environment.

Out-of-Scope Use Cases

Inworld TTS 1.5 Max is designed for text-to-speech synthesis and is not intended for speech recognition, speaker identification, or general-purpose language understanding tasks. The model should not be used to generate speech that impersonates real individuals without their consent or to produce misleading audio content.

Responsible AI Considerations

Inworld TTS services provide text-to-speech capabilities and speech created directly from content provided by customers. All uses of Inworld are subject to the acceptable use policy and terms of service.

Trust & Safety. The model generates audio only from text explicitly provided by the caller. It does not generate autonomous or unsolicited content.
Compliance. Suitable for HIPAA, SOC2, GDPR, or other regulatory-sensitive environments. The latest information is available at: https://inworld.ai/security
Voice Consent. Custom voice cloning is only available for voices with documented consent from the speaker.

Training Data

The training corpus for Inworld TTS 1.5 comprises a large-scale, multilingual audio-text dataset curated from a combination of publicly available sources and licensed third-party providers. The SpeechLM is trained in multiple stages. It is first pre-trained on millions of speech/text raw data samples. The model is then fine-tuned to perform zero-shot speech synthesis on hundreds of thousands of hours of high-quality text/audio pairs. At the final stage, it is RL-aligned to synthesize naturally sounding audio according to human preferences. The audio codec is trained on the same data.

Minimum System Requirements

Hardware: NVIDIA H100 GPU
RAM: 64GB+ system memory
CPU: 8+ cores
OS: Ubuntu 22.04 LTS
Software: Docker, NVIDIA Container Toolkit, CUDA 13.0+

API Reference

After deploying the model, you receive a scoring endpoint URL and API key from Azure AI Foundry. All API paths below are relative to this endpoint URL.

List Available Voices

GET <scoring-endpoint>/tts/v1/voices

Returns a JSON array of available voice IDs and their metadata.

Synthesize Speech

POST <scoring-endpoint>/tts/v1/voice
Content-Type: application/json

Request body:

Field	Type	Required	Description
`text`	string	Yes	Text to synthesize
`voice_id`	string	Yes	Voice ID (use `/tts/v1/voices` to list available voices)
`model_id`	string	Yes	`inworld-tts-1.5-max`
`audio_config.audio_encoding`	string	No	`LINEAR16` (default), `MP3`, `OGG_OPUS`
`audio_config.sample_rate_hertz`	integer	No	`48000` (default), `24000`, `16000`
`audio_config.speaking_rate`	float	No	Speaking speed, `0.5` to `1.5` (default `1.0`)

Example request:

ENDPOINT="<your-scoring-endpoint-url>"
API_KEY="<your-api-key>"

curl -X POST "${ENDPOINT}/tts/v1/voice" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{
    "text": "Hello, this is a test of the Inworld text to speech system.",
    "voice_id": "Craig",
    "model_id": "inworld-tts-1.5-max",
    "audio_config": {"audio_encoding": "LINEAR16", "sample_rate_hertz": 48000}
  }'

Response:

{
  "audioContent": "<base64-encoded audio>",
  "timepoints": [...]
}

The audioContent field contains base64-encoded audio in the requested format. Decode it to get the raw audio bytes.

Streaming

POST <scoring-endpoint>/tts/v1/voice:stream
Content-Type: application/json

Same request body as /tts/v1/voice. Returns NDJSON (newline-delimited JSON) with audio chunks streamed as they are generated, enabling sub-250ms time-to-first-audio. For full API documentation, see Inworld TTS API Reference .

Inworld TTS 1.5 tops blind evaluations in leaderboard human-preference rankings such as Artificial Analysis and Hugging Face TTS Arenas. The models perform well on stability and latency benchmarks.

Metric	Description	TTS-1.5-Max
Latency	Time-to-first-audio chunk (P90)	<250ms
Output	Audio sample rate	48kHz
Languages	Supported languages	15+

Additional benchmarks available upon request.

Model Specifications

LicenseCustom

Training DataFebruary 2025

Last UpdatedMay 2026

Input TypeText

Output TypeAudio

ProviderTheai-Inc

Languages15 Languages

Quick Start