astroza's picture
Update leaderboard configuration and results processing for Chilean Spanish ASR evaluation
13a06cd
# Chilean Spanish ASR Leaderboard Configuration
# Your leaderboard name
TITLE = """<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish </h1> </body> </html>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
on Chilean Spanish speech data from the Hugging Face Hub. \
\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \
\nThis leaderboard focuses specifically on **Chilean Spanish dialect** evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas.
🙏 **Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!**"""
# About section content
ABOUT_TEXT = """
## About This Leaderboard
This repository is a **streamlined, task-specific version** of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the **Chilean Spanish dialect**.
### What is the Open ASR Leaderboard?
The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics.
### How This Repository Differs
This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation:
| Aspect | Original Open ASR Leaderboard | This Repository |
|--------|-------------------------------|-----------------|
| **Target Language** | English (primarily) | Chilean Spanish |
| **Dataset** | 7 English datasets (LibriSpeech, AMI, etc.) | 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) |
| **Text Normalization** | English text normalizer | **Multilingual normalizer** preserving Spanish accents (á, é, í, ó, ú, ñ) |
| **Model Focus** | Broad coverage (~50+ models) | **10 selected models** optimized for multilingual/Spanish ASR |
| **Execution** | Local GPU execution | **Cloud-based** parallel execution via Modal |
---
## Models Evaluated
This repository evaluates **11 state-of-the-art ASR models** selected for their multilingual or Spanish language support:
| Model | Type | Framework | Parameters | Notes |
|-------|------|-----------|------------|-------|
| **openai/whisper-large-v3** | Multilingual | Transformers | 1.5B | OpenAI's flagship ASR model |
| **openai/whisper-large-v3-turbo** | Multilingual | Transformers | 809M | Faster Whisper variant |
| **surus-lat/whisper-large-v3-turbo-latam** | Multilingual | Transformers | 809M | Fine-tuned model for Latam Spanish |
| **openai/whisper-small** | Multilingual | Transformers | 244M | Reference baseline model |
| **rcastrovexler/whisper-small-es-cl** | Chilean Spanish | Transformers | 244M | Only fine-tuned model found for Chilean Spanish |
| **nvidia/canary-1b-v2** | Multilingual | NeMo | 1B | NVIDIA's multilingual ASR |
| **nvidia/parakeet-tdt-0.6b-v3** | Multilingual | NeMo | 0.6B | Lightweight, fast inference |
| **microsoft/Phi-4-multimodal-instruct** | Multimodal | Phi | 14B | Microsoft's multimodal LLM with audio |
| **mistralai/Voxtral-Mini-3B-2507** | Speech-to-text | Transformers | 3B | Mistral's ASR model |
| **elevenlabs/scribe_v1** | API-based | API | N/A | ElevenLabs' commercial ASR API |
| **facebookresearch/omniASR_LLM_7B** | Multilingual | OmniLingual ASR | 7B | FAIR's OmniLingual ASR with spa_Latn target |
## Dataset
This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data:
### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only)
This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles:
1. **Common Voice (Chilean Spanish filtered)**: Community-contributed recordings specifically filtered for Chilean Spanish dialects.
2. **Google Chilean Spanish** ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)):
- 7 hours of transcribed high-quality audio of Chilean Spanish sentences.
- Recorded by 31 volunteers.
- Intended for speech technologies.
- Restructured from original OpenSLR archives for easier streaming
3. **Datarisas** ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)):
- Audio fragments from comedy routines at the Festival de Viña del Mar.
- Represents spontaneous, colloquial Chilean Spanish.
- Captures humor and cultural expressions specific to Chile.
**Combined Dataset Properties:**
- **Language**: Spanish (Chilean variant)
- **Split**: `test`
- **Domain**: Mixed (formal recordings, volunteer speech, comedy performances)
- **Total Coverage**: Multiple speaking styles and contexts of Chilean Spanish
## Metrics
Following the Open ASR Leaderboard standard, we report:
- **WER (Word Error Rate)**: ⬇️ Lower is better - Measures transcription accuracy
- **RTFx (Real-Time Factor)**: ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time)
### Word Error Rate (WER)
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
Take the following example:
| Reference: | el | gato | se | sentó | en | la | alfombra |
|-------------|-----|-----|---------|-----|-----|-----| -----|
| Prediction: | el | gato | **se** | sentó | en | la | | |
| Label: | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | D |
Here, we have:
* 0 substitutions
* 0 insertions
* 1 deletion ("alfombra" is missing)
This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
reference (N), which for this example is 7:
```
WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143
```
Giving a WER of 0.14, or 14%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions.
### Inverse Real Time Factor (RTFx)
Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an
model to process a given amount of speech. It is defined as:
```
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
```
Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
Thus, **a higher RTFx value indicates lower latency**.
## Text Normalization for Spanish
This repository uses a **multilingual normalizer** configured to preserve Spanish characters:
```python
normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False)
```
**What it does:**
- ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡`
- ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols
- ✅ Normalizes: Whitespace, capitalization (converts to lowercase)
- ❌ Does NOT remove: Accents or Spanish-specific characters
**Example:**
```python
Input: "¿Cómo estás? [ruido] (suspiro)"
Output: "cómo estás"
```
This is critical for Spanish evaluation, as diacritics change word meaning:
- `esta` (this) vs. `está` (is)
- `si` (if) vs. `sí` (yes)
- `el` (the) vs. `él` (he)
## How to reproduce our results
The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation.
For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard,
title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish},
author={Instituto de Data Science UDD},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}}
}
@misc{astroza2025chilean-dataset,
title={Chilean Spanish ASR Test Dataset},
author={Alonso Astroza},
year={2025},
howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
}
@misc{open-asr-leaderboard,
title={Open Automatic Speech Recognition Leaderboard},
author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team},
year={2023},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}}
}
"""