Cohere-transcribe: State-of-the-Art 14-Language Speech Model Tops Leaderboard

TL;DR

Cohere has released cohere-transcribe-03-2026, a new 2B-parameter speech recognition model.
It achieves state-of-the-art accuracy across 14 enterprise-critical languages, topping the Hugging Face Open ASR Leaderboard for English.
The model is designed for efficient production inference, offering significant speed advantages over competitors.
It's available under an Apache 2.0 license and can be explored on Hugging Face Spaces.

Cohere has officially launched cohere-transcribe-03-2026, a powerful new speech recognition model that is making waves in the AI community. This 2B-parameter model has been meticulously trained from scratch to support a robust set of 14 languages, catering to enterprise needs with unparalleled accuracy and efficiency. The results speak for themselves, with the model claiming the top spot on the Hugging Face Open ASR Leaderboard for English and demonstrating competitive or superior performance against existing open-source models in 13 other languages.

This release marks Cohere's first foray into audio models, and it's clear they've focused on foundational strengths. The development team emphasized key areas such as a strong multilingual tokenizer, an optimized training regime, and a carefully curated data mix. The outcome is a model that not only excels in benchmarks but also performs exceptionally well in human evaluations, suggesting a tangible improvement in transcription quality for real-world applications.

The cohere-transcribe-03-2026 model boasts impressive accuracy figures, particularly in English, where it has surpassed both proprietary and open-source competitors. Its performance is further bolstered by its strong showing across a diverse range of languages including German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, and Korean. This comprehensive language support makes it a valuable tool for global businesses and researchers alike.

The architecture of cohere-transcribe-03-2026 is a 2B encoder-decoder X-attention transformer featuring a Fast-Conformer encoder. A significant portion of its parameters, over 90%, are dedicated to the encoder, with a deliberately lightweight decoder. This design choice minimizes autoregressive inference compute, leading to its notable efficiency without compromising performance. This contrasts with some other models that build upon pre-trained text LLMs, which can be cheaper to train but incur higher inference costs.

Beyond raw accuracy, a critical aspect of cohere-transcribe-03-2026 is its focus on production readiness. Cohere collaborated with vLLM to enable efficient serving of the model using an open-source stack, a significant step for developers looking to integrate advanced ASR capabilities into their applications. The model achieves an offline throughput that is reportedly three times higher than similarly sized competitors, making it an attractive option for high-volume transcription tasks.

Summary

cohere-transcribe-03-2026 is a new 2B-parameter speech recognition model from Cohere, offering state-of-the-art accuracy across 14 languages.
It leads the Hugging Face Open ASR Leaderboard for English and performs competitively in other supported languages.
The model is optimized for efficient production inference, boasting high throughput and low latency.
It is available on Hugging Face under an Apache 2.0 license.

Source: Introducing Cohere-transcribe: state-of-the-art speech recognition

Cohere-transcribe: State-of-the-Art 14-Language Speech Model Tops Leaderboard

TL;DR

Summary

Read next

Anthropic Upgrades Claude Opus to 4.8, Boosting Benchmarks and Collaboration

Get notified when our newsletter launches