Higgs-Audio-v3-Speech-to-Text
Higgs-Audio-v3-Speech-to-Text
Version: 2
BosonAILast updated March 2026
High-performance automatic speech recognition model supporting 60+ languages with OpenAI Whisper-compatible API
Audio
Conversation
Instruction
Higgs-Audio-v3-Speech-to-Text is a high-performance automatic speech recognition (ASR) model developed by BosonAI. Built on a 1.7B parameter architecture, it delivers accurate transcription across 60+ languages with an OpenAI Whisper-compatible API interface.

Key Features

  • Multilingual Support: Transcribes audio in 60+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, and many more
  • Voice Activity Detection (VAD): Built-in Silero VAD automatically segments audio, handling long recordings and multi-speaker scenarios
  • Whisper-Compatible API: Drop-in replacement for OpenAI's Whisper API via the /v1/audio/transcriptions endpoint
  • High Accuracy: Achieves competitive Word Error Rate (WER) on standard benchmarks

Model Architecture

The model uses a multi-component inference pipeline:
  • HiggsAudio3: Core audio understanding model (1.7B parameters)
  • Whisper-large-v3-turbo: Audio feature extraction
  • Silero VAD: Voice activity detection for audio segmentation
  • vLLM: High-throughput serving engine for optimized inference

Intended Use

This model is designed for:
  • Real-time and batch audio transcription
  • Multilingual speech-to-text applications
  • Call center transcription and analysis
  • Meeting transcription and summarization
  • Subtitle and caption generation
  • Voice-powered search and indexing

API Format

The model exposes an OpenAI Whisper-compatible REST API:
POST /v1/audio/transcriptions
Content-Type: multipart/form-data

Parameters:
  - file: Audio file (required)
  - language: Language code (optional, auto-detected if not specified)

Supported Audio Formats

WAV, MP3, FLAC, OGG, M4A, and other common audio formats.

Hardware Requirements

  • Minimum: 1x NVIDIA A100 80GB GPU
  • Recommended SKU: Standard_NC24ads_A100_v4

Training and Development Notes

Training Data

The model was trained on publicly available speech datasets covering 60+ languages. Training data includes diverse speakers, accents, recording conditions, and audio quality levels to ensure robust performance across real-world scenarios.

Model Development

  • Developer: BosonAI
  • Architecture: HiggsAudio Understanding v3 (1.7B parameters)
  • Inference Engine: vLLM with custom optimizations
  • Audio Processing: Silero VAD for voice activity detection, Whisper-large-v3-turbo for audio feature extraction

Responsible AI Considerations

  • The model is designed for transcription purposes only and does not generate or synthesize speech
  • Audio content is processed in real-time and not stored by the model
  • Users should implement appropriate data handling practices for sensitive audio content
  • The model may produce errors that should be reviewed before use in critical applications

Evaluation

Benchmarks

Word Error Rate (WER)

DatasetLanguageWER
Common Voice 15English (500 samples)13.12%

Evaluation Methodology

  • WER is computed using standard Levenshtein distance-based word error rate
  • Audio samples are processed with Voice Activity Detection (VAD) segmentation enabled
  • Results measured on Common Voice 15 dataset with 500 randomly selected English samples

Limitations

  • Performance may vary across languages and accents
  • Very noisy audio environments may increase error rates
  • Extremely long audio files (>30 minutes) should be pre-segmented for optimal results
  • The model performs best on clear speech with minimal background noise
Model Specifications
LicenseCustom
Last UpdatedMarch 2026
Input TypeAudio
Output TypeText
ProviderBosonAI
Languages61 Languages