Vividh-ASR Benchmark Fixes Whisper's Studio-Bias for Indic Languages

TL;DR

A new benchmark, Vividh-ASR, has been developed to identify and fix "studio-bias" in ASR models for Indic languages.
This bias causes models to perform poorly on spontaneous speech despite good performance on clean, studio-recorded audio.
A novel fine-tuning recipe for OpenAI's Whisper models significantly improves robustness across various acoustic conditions.
Fine-tuning Whisper with a high learning rate (2e-4) proved to be the most impactful change, outperforming larger public models.

The challenge of building accurate Automatic Speech Recognition (ASR) systems for Indic languages is significant, particularly due to a prevalent issue known as "studio-bias." This phenomenon means that models, while performing well on meticulously recorded, clean audio often found in studios, degrade substantially when encountering the natural, spontaneous speech patterns of everyday conversations. Adalat AI has introduced the Vividh-ASR benchmark to systematically diagnose this problem and has developed a corresponding fine-tuning recipe to enhance the robustness of models like OpenAI's Whisper. This benchmark stratifies evaluation into four tiers based on acoustic complexity, allowing for a granular understanding of where models falter.

The core of the problem lies in the discrepancy between idealized training data and real-world usage. Even large Whisper models, when fine-tuned on typical datasets that may lean towards cleaner speech, often struggle with the variability inherent in spontaneous dialogue, which includes different accents, background noise, and informal speaking styles. The Vividh-ASR benchmark's tiered approach—ranging from studio and broadcast audio (Tier A and B) to spontaneous speech (Tier C) and noisy conditions (Tier D)—provides a clear picture of this degradation. By specifically focusing on acoustic complexity rather than just domain, the benchmark reveals the true limitations of existing models in practical scenarios.

Adalat AI's research revealed that standard fine-tuning approaches are not always optimal. Surprisingly, the conventional "easy-to-hard" curriculum learning strategy, where models are trained on simpler examples first, offered no benefits and often led to performance degradation. Instead, the study found that fine-tuning Whisper with a high learning rate of 2e-4 was the most crucial factor in achieving significant performance gains. This single adjustment consistently outperformed existing public Hindi and Malayalam ASR models across all acoustic conditions tested by the benchmark.

For Malayalam, training on harder conditions first—a reversal of the standard curriculum—did yield further improvements on spontaneous and noisy speech. However, for Hindi, the high learning rate alone proved sufficient to achieve state-of-the-art results without requiring this more complex training order. This suggests that the optimal fine-tuning strategy can be language-dependent, but the high learning rate approach is a universally powerful technique for improving robustness. These findings allow a 244M parameter Whisper model fine-tuned with this recipe to surpass publicly available models up to six times its size in overall Word Error Rate (WER), all without architectural changes or proprietary data.

The researchers are releasing both the Vividh-ASR benchmark and fine-tuned Whisper models. The benchmark includes approximately 26 hours of data for Malayalam and 36 hours for Hindi, meticulously curated to represent the diverse acoustic challenges. Additionally, Adalat AI is making available Whisper Small and Medium models for both languages, including variants fine-tuned with a high learning rate and a Reverse Multi-Stage Fine-Tuning (R-MFT) approach. These resources aim to empower developers and researchers working on ASR for Indic languages to build more resilient and accurate applications that perform reliably in real-world conditions.

Summary

The Vividh-ASR benchmark addresses "studio-bias" in ASR models for Indic languages by evaluating performance across four acoustic complexity tiers.
A key finding is that fine-tuning Whisper models with a high learning rate (2e-4) is the most effective strategy for improving robustness.
Adalat AI is releasing the benchmark and fine-tuned Whisper models, including small and medium variants, to improve accuracy on spontaneous speech.
These efforts enable smaller models to outperform significantly larger, publicly available models, reducing studio-bias for languages like Hindi and Malayalam.

Source: Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages

Vividh-ASR Benchmark Fixes Whisper's Studio-Bias for Indic Languages

TL;DR

Summary

Read next

Anthropic Upgrades Claude Opus to 4.8, Boosting Benchmarks and Collaboration

Get notified when our newsletter launches