Higgs-Audio-v3-Speech-to-Text
Version: 2
Higgs-Audio-v3-Speech-to-Text is a high-performance automatic speech recognition (ASR) model developed by BosonAI. Built on a 1.7B parameter architecture, it delivers accurate transcription across 60+ languages with an OpenAI Whisper-compatible API interface.
Key Features
- Multilingual Support: Transcribes audio in 60+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, and many more
- Voice Activity Detection (VAD): Built-in Silero VAD automatically segments audio, handling long recordings and multi-speaker scenarios
- Whisper-Compatible API: Drop-in replacement for OpenAI's Whisper API via the
/v1/audio/transcriptionsendpoint - High Accuracy: Achieves competitive Word Error Rate (WER) on standard benchmarks
Model Architecture
The model uses a multi-component inference pipeline:- HiggsAudio3: Core audio understanding model (1.7B parameters)
- Whisper-large-v3-turbo: Audio feature extraction
- Silero VAD: Voice activity detection for audio segmentation
- vLLM: High-throughput serving engine for optimized inference
Intended Use
This model is designed for:- Real-time and batch audio transcription
- Multilingual speech-to-text applications
- Call center transcription and analysis
- Meeting transcription and summarization
- Subtitle and caption generation
- Voice-powered search and indexing
API Format
The model exposes an OpenAI Whisper-compatible REST API:POST /v1/audio/transcriptions
Content-Type: multipart/form-data
Parameters:
- file: Audio file (required)
- language: Language code (optional, auto-detected if not specified)
Supported Audio Formats
WAV, MP3, FLAC, OGG, M4A, and other common audio formats.Hardware Requirements
- Minimum: 1x NVIDIA A100 80GB GPU
- Recommended SKU: Standard_NC24ads_A100_v4
Training and Development Notes
Training Data
The model was trained on publicly available speech datasets covering 60+ languages. Training data includes diverse speakers, accents, recording conditions, and audio quality levels to ensure robust performance across real-world scenarios.Model Development
- Developer: BosonAI
- Architecture: HiggsAudio Understanding v3 (1.7B parameters)
- Inference Engine: vLLM with custom optimizations
- Audio Processing: Silero VAD for voice activity detection, Whisper-large-v3-turbo for audio feature extraction
Responsible AI Considerations
- The model is designed for transcription purposes only and does not generate or synthesize speech
- Audio content is processed in real-time and not stored by the model
- Users should implement appropriate data handling practices for sensitive audio content
- The model may produce errors that should be reviewed before use in critical applications
Evaluation
Benchmarks
Word Error Rate (WER)
| Dataset | Language | WER |
|---|---|---|
| Common Voice 15 | English (500 samples) | 13.12% |
Evaluation Methodology
- WER is computed using standard Levenshtein distance-based word error rate
- Audio samples are processed with Voice Activity Detection (VAD) segmentation enabled
- Results measured on Common Voice 15 dataset with 500 randomly selected English samples
Limitations
- Performance may vary across languages and accents
- Very noisy audio environments may increase error rates
- Extremely long audio files (>30 minutes) should be pre-segmented for optimal results
- The model performs best on clear speech with minimal background noise
Model Specifications
LicenseCustom
Last UpdatedMarch 2026
Input TypeAudio
Output TypeText
ProviderBosonAI
Languages61 Languages