Mistral AI Releases Voxtral Transcribe 2 Speech-to-Text Models

Hey there, tech enthusiasts! Get ready to revolutionize how you interact with audio, because Mistral AI has just dropped Voxtral Transcribe 2! This exciting new family of speech-to-text models brings state-of-the-art transcription, precision speaker diarization, and blistering real-time capabilities right to your fingertips.

This release includes two powerful models: Voxtral Mini Transcribe V2 for all your batch processing needs, and Voxtral Realtime, purpose-built for lightning-fast live applications.

What it's for

For those heavy-duty batch transcription tasks, meet Voxtral Mini Transcribe V2. It’s packed with features like precise speaker diarization, which intelligently labels who said what. Plus, you can use context biasing (up to 100 words or phrases!) to ensure proper nouns, technical terms, or tricky jargon are transcribed perfectly. It even gives you accurate word-level timestamps across 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch!

Then there's Voxtral Realtime, designed from the ground up for live applications like voice agents or real-time subtitles. It uses a clever streaming architecture to deliver transcriptions with incredibly low latency – we're talking configurable down to sub-200ms. And here's a cool part: it's an open-weights model, released under an Apache 2.0 license, making it perfect for efficient edge deployment with a 4B parameter footprint. You can find its weights right on Hugging Face. It supports the same 13 languages as its batch counterpart.

Why it matters

Why are we so excited about this? Well, Voxtral Transcribe 2 isn't just fast; it's incredibly accurate and super efficient. Voxtral Mini Transcribe V2 boasts an approximate 4% word error rate on the FLEURS benchmark. And get this: it's priced at an amazing $0.003/min! That's not just competitive; it actually outperforms models like GPT-4o mini Transcribe and Gemini 2.5 Flash in accuracy, and processes audio approximately 3x faster than ElevenLabs’ Scribe v2 while matching on quality and at a fraction of the cost.

For developers, Voxtral Realtime's open-weights nature means unprecedented flexibility to build privacy-first, real-time voice applications right on the edge. This represents a significant leap forward in making advanced speech AI accessible and affordable for everyone.

Where you get it

Ready to give it a spin? We've launched a brand-new audio playground in Mistral Studio where you can instantly test Voxtral Transcribe 2 with diarization and timestamps. It's a fantastic way to see the magic happen firsthand!

Voxtral Mini Transcribe V2 is available now via API. For more technical details on its powerful batch processing capabilities, check out the Voxtral Mini Transcribe V2 documentation. Voxtral Realtime is available via API and as open weights on Hugging Face. Whether you're building a new voice assistant or need to analyze hours of meeting recordings, these models are ready to integrate into your workflows.

Read more: Voxtral Transcribe 2 Announcement for full details and links to the audio playground, docs, and Hugging Face.

Mistral AI Releases Voxtral Transcribe 2 Speech-to-Text Models

What it's for

Why it matters

Where you get it

Read next

Anthropic Upgrades Claude Opus to 4.8, Boosting Benchmarks and Collaboration

Get notified when our newsletter launches