Inworld TTS 1.5 Max
Version: 1
Inworld TTS 1.5 Max is the flagship text-to-speech model in the Inworld TTS 1.5 series, delivering the highest quality and expressiveness for demanding voice applications. It is a state-of-the-art speech synthesis system designed for production-grade, realtime voice-enabled applications. TTS 1.5 Max produces the most natural and expressive speech in the lineup with sub-250ms time-to-first-audio, making it ideal for premium interactive dialogue, entertainment, and applications where voice quality is paramount.
Inworld TTS 1.5 Max is an autoregressive transformer-based speech language model (SpeechLM) with 8B parameters that uses a high-resolution neural audio codec. The architecture is a two-stage generative system: the SpeechLM generates discrete audio tokens from text, and the neural codec decoder converts these tokens into high-fidelity 48kHz audio. The model supports 15+ languages, customizable speaking speed (0.5x-1.5x), expressiveness adjustments, custom pronunciation, and character/word/phoneme/viseme timestamps for applications such as live captioning and lip-sync.
Intended Use
Primary Use Cases
Inworld TTS 1.5 Max sets a new standard for developers building voice-enabled applications at scale. Common use cases include:- Customer Service. Fluid, multi-turn dialogue for contact centers, live support, sales/GTM, and more.
- Regulated Industries. Deployments in air-gapped or private environments for government and enterprise sectors requiring strict data control (e.g., healthcare, financial services).
- Education / Professional Development. Realtime synthesis for training, interviewing, and language learning.
- Companion Apps. Engaging experiences such as fitness coaching, guided meditation, personalized shopping.
- Interactive Media & Entertainment. Character/IP voices with emotional range and contextual awareness for gaming, advertising, storytelling, and branded experiences.
- Realtime Translation & Dubbing. Voice-enabled experiences with timestamp support.
Model Capabilities
- Multilingual Support. Supports 15+ languages, including English, Hindi, Arabic, Hebrew, Chinese, and major European languages. Additional languages upon request.
- Customization. Supports talking speed adjustments (0.5x to 1.5x), expressiveness adjustments, custom pronunciation, and emphasis markers (audio markups).
- Timestamps. Powers applications such as live captioning, lip-sync, etc. with character, word, phoneme, and viseme timestamps.
- Custom Voices. Available upon request. TTS 1.5 includes a rich library of studio-quality voices optimized for various use cases.
- Data Control. On-premise deployment ensures no data (text or audio) leaves the customer's environment.
Out-of-Scope Use Cases
Inworld TTS 1.5 Max is designed for text-to-speech synthesis and is not intended for speech recognition, speaker identification, or general-purpose language understanding tasks. The model should not be used to generate speech that impersonates real individuals without their consent or to produce misleading audio content.Responsible AI Considerations
Inworld TTS services provide text-to-speech capabilities and speech created directly from content provided by customers. All uses of Inworld are subject to the acceptable use policy and terms of service.- Trust & Safety. The model generates audio only from text explicitly provided by the caller. It does not generate autonomous or unsolicited content.
- Compliance. Suitable for HIPAA, SOC2, GDPR, or other regulatory-sensitive environments. The latest information is available at: https://inworld.ai/security
- Voice Consent. Custom voice cloning is only available for voices with documented consent from the speaker.
Training Data
The training corpus for Inworld TTS 1.5 comprises a large-scale, multilingual audio-text dataset curated from a combination of publicly available sources and licensed third-party providers. The SpeechLM is trained in multiple stages. It is first pre-trained on millions of speech/text raw data samples. The model is then fine-tuned to perform zero-shot speech synthesis on hundreds of thousands of hours of high-quality text/audio pairs. At the final stage, it is RL-aligned to synthesize naturally sounding audio according to human preferences. The audio codec is trained on the same data.Minimum System Requirements
- Hardware: NVIDIA H100 GPU
- RAM: 64GB+ system memory
- CPU: 8+ cores
- OS: Ubuntu 22.04 LTS
- Software: Docker, NVIDIA Container Toolkit, CUDA 13.0+
API Reference
After deploying the model, you receive a scoring endpoint URL and API key from Azure AI Foundry. All API paths below are relative to this endpoint URL.List Available Voices
GET <scoring-endpoint>/tts/v1/voices
Synthesize Speech
POST <scoring-endpoint>/tts/v1/voice
Content-Type: application/json
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text to synthesize |
voice_id | string | Yes | Voice ID (use /tts/v1/voices to list available voices) |
model_id | string | Yes | inworld-tts-1.5-max |
audio_config.audio_encoding | string | No | LINEAR16 (default), MP3, OGG_OPUS |
audio_config.sample_rate_hertz | integer | No | 48000 (default), 24000, 16000 |
audio_config.speaking_rate | float | No | Speaking speed, 0.5 to 1.5 (default 1.0) |
ENDPOINT="<your-scoring-endpoint-url>"
API_KEY="<your-api-key>"
curl -X POST "${ENDPOINT}/tts/v1/voice" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{
"text": "Hello, this is a test of the Inworld text to speech system.",
"voice_id": "Craig",
"model_id": "inworld-tts-1.5-max",
"audio_config": {"audio_encoding": "LINEAR16", "sample_rate_hertz": 48000}
}'
{
"audioContent": "<base64-encoded audio>",
"timepoints": [...]
}
audioContent field contains base64-encoded audio in the requested format. Decode it to get the raw audio bytes.
Streaming
POST <scoring-endpoint>/tts/v1/voice:stream
Content-Type: application/json
/tts/v1/voice. Returns NDJSON (newline-delimited JSON) with audio chunks streamed as they are generated, enabling sub-250ms time-to-first-audio.
For full API documentation, see Inworld TTS API Reference . Inworld TTS 1.5 tops blind evaluations in leaderboard human-preference rankings such as Artificial Analysis and Hugging Face TTS Arenas. The models perform well on stability and latency benchmarks.
Additional benchmarks available upon request.
| Metric | Description | TTS-1.5-Max |
|---|---|---|
| Latency | Time-to-first-audio chunk (P90) | <250ms |
| Output | Audio sample rate | 48kHz |
| Languages | Supported languages | 15+ |
Model Specifications
LicenseCustom
Training DataFebruary 2025
Last UpdatedMay 2026
Input TypeText
Output TypeAudio
ProviderTheai-Inc
Languages15 Languages