Inworld TTS 1.5 Max
Inworld TTS 1.5 Max
Version: 1
Theai-IncLast updated May 2026
Flagship text-to-speech model with highest quality and expressiveness for demanding applications.
Audio
Conversation
Low latency
Inworld TTS 1.5 Max is the flagship text-to-speech model in the Inworld TTS 1.5 series, delivering the highest quality and expressiveness for demanding voice applications. It is a state-of-the-art speech synthesis system designed for production-grade, realtime voice-enabled applications. TTS 1.5 Max produces the most natural and expressive speech in the lineup with sub-250ms time-to-first-audio, making it ideal for premium interactive dialogue, entertainment, and applications where voice quality is paramount. Inworld TTS 1.5 Max is an autoregressive transformer-based speech language model (SpeechLM) with 8B parameters that uses a high-resolution neural audio codec. The architecture is a two-stage generative system: the SpeechLM generates discrete audio tokens from text, and the neural codec decoder converts these tokens into high-fidelity 48kHz audio. The model supports 15+ languages, customizable speaking speed (0.5x-1.5x), expressiveness adjustments, custom pronunciation, and character/word/phoneme/viseme timestamps for applications such as live captioning and lip-sync.

Intended Use

Primary Use Cases

Inworld TTS 1.5 Max sets a new standard for developers building voice-enabled applications at scale. Common use cases include:
  1. Customer Service. Fluid, multi-turn dialogue for contact centers, live support, sales/GTM, and more.
  2. Regulated Industries. Deployments in air-gapped or private environments for government and enterprise sectors requiring strict data control (e.g., healthcare, financial services).
  3. Education / Professional Development. Realtime synthesis for training, interviewing, and language learning.
  4. Companion Apps. Engaging experiences such as fitness coaching, guided meditation, personalized shopping.
  5. Interactive Media & Entertainment. Character/IP voices with emotional range and contextual awareness for gaming, advertising, storytelling, and branded experiences.
  6. Realtime Translation & Dubbing. Voice-enabled experiences with timestamp support.

Model Capabilities

  • Multilingual Support. Supports 15+ languages, including English, Hindi, Arabic, Hebrew, Chinese, and major European languages. Additional languages upon request.
  • Customization. Supports talking speed adjustments (0.5x to 1.5x), expressiveness adjustments, custom pronunciation, and emphasis markers (audio markups).
  • Timestamps. Powers applications such as live captioning, lip-sync, etc. with character, word, phoneme, and viseme timestamps.
  • Custom Voices. Available upon request. TTS 1.5 includes a rich library of studio-quality voices optimized for various use cases.
  • Data Control. On-premise deployment ensures no data (text or audio) leaves the customer's environment.

Out-of-Scope Use Cases

Inworld TTS 1.5 Max is designed for text-to-speech synthesis and is not intended for speech recognition, speaker identification, or general-purpose language understanding tasks. The model should not be used to generate speech that impersonates real individuals without their consent or to produce misleading audio content.

Responsible AI Considerations

Inworld TTS services provide text-to-speech capabilities and speech created directly from content provided by customers. All uses of Inworld are subject to the acceptable use policy and terms of service.
  • Trust & Safety. The model generates audio only from text explicitly provided by the caller. It does not generate autonomous or unsolicited content.
  • Compliance. Suitable for HIPAA, SOC2, GDPR, or other regulatory-sensitive environments. The latest information is available at: https://inworld.ai/security
  • Voice Consent. Custom voice cloning is only available for voices with documented consent from the speaker.

Training Data

The training corpus for Inworld TTS 1.5 comprises a large-scale, multilingual audio-text dataset curated from a combination of publicly available sources and licensed third-party providers. The SpeechLM is trained in multiple stages. It is first pre-trained on millions of speech/text raw data samples. The model is then fine-tuned to perform zero-shot speech synthesis on hundreds of thousands of hours of high-quality text/audio pairs. At the final stage, it is RL-aligned to synthesize naturally sounding audio according to human preferences. The audio codec is trained on the same data.

Minimum System Requirements

  • Hardware: NVIDIA H100 GPU
  • RAM: 64GB+ system memory
  • CPU: 8+ cores
  • OS: Ubuntu 22.04 LTS
  • Software: Docker, NVIDIA Container Toolkit, CUDA 13.0+

API Reference

After deploying the model, you receive a scoring endpoint URL and API key from Azure AI Foundry. All API paths below are relative to this endpoint URL.

List Available Voices

GET <scoring-endpoint>/tts/v1/voices
Returns a JSON array of available voice IDs and their metadata.

Synthesize Speech

POST <scoring-endpoint>/tts/v1/voice
Content-Type: application/json
Request body:
FieldTypeRequiredDescription
textstringYesText to synthesize
voice_idstringYesVoice ID (use /tts/v1/voices to list available voices)
model_idstringYesinworld-tts-1.5-max
audio_config.audio_encodingstringNoLINEAR16 (default), MP3, OGG_OPUS
audio_config.sample_rate_hertzintegerNo48000 (default), 24000, 16000
audio_config.speaking_ratefloatNoSpeaking speed, 0.5 to 1.5 (default 1.0)
Example request:
ENDPOINT="<your-scoring-endpoint-url>"
API_KEY="<your-api-key>"

curl -X POST "${ENDPOINT}/tts/v1/voice" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{
    "text": "Hello, this is a test of the Inworld text to speech system.",
    "voice_id": "Craig",
    "model_id": "inworld-tts-1.5-max",
    "audio_config": {"audio_encoding": "LINEAR16", "sample_rate_hertz": 48000}
  }'
Response:
{
  "audioContent": "<base64-encoded audio>",
  "timepoints": [...]
}
The audioContent field contains base64-encoded audio in the requested format. Decode it to get the raw audio bytes.

Streaming

POST <scoring-endpoint>/tts/v1/voice:stream
Content-Type: application/json
Same request body as /tts/v1/voice. Returns NDJSON (newline-delimited JSON) with audio chunks streamed as they are generated, enabling sub-250ms time-to-first-audio. For full API documentation, see Inworld TTS API Reference .
Inworld TTS 1.5 tops blind evaluations in leaderboard human-preference rankings such as Artificial Analysis and Hugging Face TTS Arenas. The models perform well on stability and latency benchmarks.
MetricDescriptionTTS-1.5-Max
LatencyTime-to-first-audio chunk (P90)<250ms
OutputAudio sample rate48kHz
LanguagesSupported languages15+
Additional benchmarks available upon request.
Model Specifications
LicenseCustom
Training DataFebruary 2025
Last UpdatedMay 2026
Input TypeText
Output TypeAudio
ProviderTheai-Inc
Languages15 Languages