Higgs-Audio-v3-Speech-to-Text

Version: 2

BosonAI•Last updated March 2026

High-performance automatic speech recognition model supporting 60+ languages with OpenAI Whisper-compatible API

Audio

Conversation

Instruction

Higgs-Audio-v3-Speech-to-Text is a high-performance automatic speech recognition (ASR) model developed by BosonAI. Built on a 1.7B parameter architecture, it delivers accurate transcription across 60+ languages with an OpenAI Whisper-compatible API interface.

Key Features

Multilingual Support: Transcribes audio in 60+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, and many more
Voice Activity Detection (VAD): Built-in Silero VAD automatically segments audio, handling long recordings and multi-speaker scenarios
Whisper-Compatible API: Drop-in replacement for OpenAI's Whisper API via the /v1/audio/transcriptions endpoint
High Accuracy: Achieves competitive Word Error Rate (WER) on standard benchmarks

Model Architecture

The model uses a multi-component inference pipeline:

HiggsAudio3: Core audio understanding model (1.7B parameters)
Whisper-large-v3-turbo: Audio feature extraction
Silero VAD: Voice activity detection for audio segmentation
vLLM: High-throughput serving engine for optimized inference

Intended Use

This model is designed for:

Real-time and batch audio transcription
Multilingual speech-to-text applications
Call center transcription and analysis
Meeting transcription and summarization
Subtitle and caption generation
Voice-powered search and indexing

API Format

The model exposes an OpenAI Whisper-compatible REST API:

POST /v1/audio/transcriptions
Content-Type: multipart/form-data

Parameters:
  - file: Audio file (required)
  - language: Language code (optional, auto-detected if not specified)

Supported Audio Formats

WAV, MP3, FLAC, OGG, M4A, and other common audio formats.

Hardware Requirements

Minimum: 1x NVIDIA A100 80GB GPU
Recommended SKU: Standard_NC24ads_A100_v4

Training and Development Notes

Training Data

The model was trained on publicly available speech datasets covering 60+ languages. Training data includes diverse speakers, accents, recording conditions, and audio quality levels to ensure robust performance across real-world scenarios.

Model Development

Developer: BosonAI
Architecture: HiggsAudio Understanding v3 (1.7B parameters)
Inference Engine: vLLM with custom optimizations
Audio Processing: Silero VAD for voice activity detection, Whisper-large-v3-turbo for audio feature extraction

Responsible AI Considerations

The model is designed for transcription purposes only and does not generate or synthesize speech
Audio content is processed in real-time and not stored by the model
Users should implement appropriate data handling practices for sensitive audio content
The model may produce errors that should be reviewed before use in critical applications

Evaluation

Benchmarks

Word Error Rate (WER)

Dataset	Language	WER
Common Voice 15	English (500 samples)	13.12%

Evaluation Methodology

WER is computed using standard Levenshtein distance-based word error rate
Audio samples are processed with Voice Activity Detection (VAD) segmentation enabled
Results measured on Common Voice 15 dataset with 500 randomly selected English samples

Limitations

Performance may vary across languages and accents
Very noisy audio environments may increase error rates
Extremely long audio files (>30 minutes) should be pre-segmented for optimal results
The model performs best on clear speech with minimal background noise

Model Specifications

LicenseCustom

Last UpdatedMarch 2026

Input TypeAudio

Output TypeText

ProviderBosonAI

Languages61 Languages

Quick Start