Spaces:
Running
Running
| # Chilean Spanish ASR Leaderboard Configuration | |
| # Your leaderboard name | |
| TITLE = """<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish </h1> </body> </html>""" | |
| # What does your leaderboard evaluate? | |
| INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \ | |
| on Chilean Spanish speech data from the Hugging Face Hub. \ | |
| \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \ | |
| \nThis leaderboard focuses specifically on **Chilean Spanish dialect** evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas. | |
| 🙏 **Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!**""" | |
| # About section content | |
| ABOUT_TEXT = """ | |
| ## About This Leaderboard | |
| This repository is a **streamlined, task-specific version** of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the **Chilean Spanish dialect**. | |
| ### What is the Open ASR Leaderboard? | |
| The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics. | |
| ### How This Repository Differs | |
| This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation: | |
| | Aspect | Original Open ASR Leaderboard | This Repository | | |
| |--------|-------------------------------|-----------------| | |
| | **Target Language** | English (primarily) | Chilean Spanish | | |
| | **Dataset** | 7 English datasets (LibriSpeech, AMI, etc.) | 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) | | |
| | **Text Normalization** | English text normalizer | **Multilingual normalizer** preserving Spanish accents (á, é, í, ó, ú, ñ) | | |
| | **Model Focus** | Broad coverage (~50+ models) | **10 selected models** optimized for multilingual/Spanish ASR | | |
| | **Execution** | Local GPU execution | **Cloud-based** parallel execution via Modal | | |
| --- | |
| ## Models Evaluated | |
| This repository evaluates **11 state-of-the-art ASR models** selected for their multilingual or Spanish language support: | |
| | Model | Type | Framework | Parameters | Notes | | |
| |-------|------|-----------|------------|-------| | |
| | **openai/whisper-large-v3** | Multilingual | Transformers | 1.5B | OpenAI's flagship ASR model | | |
| | **openai/whisper-large-v3-turbo** | Multilingual | Transformers | 809M | Faster Whisper variant | | |
| | **surus-lat/whisper-large-v3-turbo-latam** | Multilingual | Transformers | 809M | Fine-tuned model for Latam Spanish | | |
| | **openai/whisper-small** | Multilingual | Transformers | 244M | Reference baseline model | | |
| | **rcastrovexler/whisper-small-es-cl** | Chilean Spanish | Transformers | 244M | Only fine-tuned model found for Chilean Spanish | | |
| | **nvidia/canary-1b-v2** | Multilingual | NeMo | 1B | NVIDIA's multilingual ASR | | |
| | **nvidia/parakeet-tdt-0.6b-v3** | Multilingual | NeMo | 0.6B | Lightweight, fast inference | | |
| | **microsoft/Phi-4-multimodal-instruct** | Multimodal | Phi | 14B | Microsoft's multimodal LLM with audio | | |
| | **mistralai/Voxtral-Mini-3B-2507** | Speech-to-text | Transformers | 3B | Mistral's ASR model | | |
| | **elevenlabs/scribe_v1** | API-based | API | N/A | ElevenLabs' commercial ASR API | | |
| | **facebookresearch/omniASR_LLM_7B** | Multilingual | OmniLingual ASR | 7B | FAIR's OmniLingual ASR with spa_Latn target | | |
| ## Dataset | |
| This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data: | |
| ### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only) | |
| This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles: | |
| 1. **Common Voice (Chilean Spanish filtered)**: Community-contributed recordings specifically filtered for Chilean Spanish dialects. | |
| 2. **Google Chilean Spanish** ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)): | |
| - 7 hours of transcribed high-quality audio of Chilean Spanish sentences. | |
| - Recorded by 31 volunteers. | |
| - Intended for speech technologies. | |
| - Restructured from original OpenSLR archives for easier streaming | |
| 3. **Datarisas** ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)): | |
| - Audio fragments from comedy routines at the Festival de Viña del Mar. | |
| - Represents spontaneous, colloquial Chilean Spanish. | |
| - Captures humor and cultural expressions specific to Chile. | |
| **Combined Dataset Properties:** | |
| - **Language**: Spanish (Chilean variant) | |
| - **Split**: `test` | |
| - **Domain**: Mixed (formal recordings, volunteer speech, comedy performances) | |
| - **Total Coverage**: Multiple speaking styles and contexts of Chilean Spanish | |
| ## Metrics | |
| Following the Open ASR Leaderboard standard, we report: | |
| - **WER (Word Error Rate)**: ⬇️ Lower is better - Measures transcription accuracy | |
| - **RTFx (Real-Time Factor)**: ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time) | |
| ### Word Error Rate (WER) | |
| Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage | |
| of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**. | |
| Take the following example: | |
| | Reference: | el | gato | se | sentó | en | la | alfombra | | |
| |-------------|-----|-----|---------|-----|-----|-----| -----| | |
| | Prediction: | el | gato | **se** | sentó | en | la | | | | |
| | Label: | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | D | | |
| Here, we have: | |
| * 0 substitutions | |
| * 0 insertions | |
| * 1 deletion ("alfombra" is missing) | |
| This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our | |
| reference (N), which for this example is 7: | |
| ``` | |
| WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143 | |
| ``` | |
| Giving a WER of 0.14, or 14%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. | |
| ### Inverse Real Time Factor (RTFx) | |
| Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an | |
| model to process a given amount of speech. It is defined as: | |
| ``` | |
| RTFx = (number of seconds of audio inferred) / (compute time in seconds) | |
| ``` | |
| Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time. | |
| Thus, **a higher RTFx value indicates lower latency**. | |
| ## Text Normalization for Spanish | |
| This repository uses a **multilingual normalizer** configured to preserve Spanish characters: | |
| ```python | |
| normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False) | |
| ``` | |
| **What it does:** | |
| - ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡` | |
| - ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols | |
| - ✅ Normalizes: Whitespace, capitalization (converts to lowercase) | |
| - ❌ Does NOT remove: Accents or Spanish-specific characters | |
| **Example:** | |
| ```python | |
| Input: "¿Cómo estás? [ruido] (suspiro)" | |
| Output: "cómo estás" | |
| ``` | |
| This is critical for Spanish evaluation, as diacritics change word meaning: | |
| - `esta` (this) vs. `está` (is) | |
| - `si` (if) vs. `sí` (yes) | |
| - `el` (the) vs. `él` (he) | |
| ## How to reproduce our results | |
| The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation. | |
| For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl | |
| """ | |
| CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
| CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard, | |
| title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish}, | |
| author={Instituto de Data Science UDD}, | |
| year={2025}, | |
| publisher={Hugging Face}, | |
| howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}} | |
| } | |
| @misc{astroza2025chilean-dataset, | |
| title={Chilean Spanish ASR Test Dataset}, | |
| author={Alonso Astroza}, | |
| year={2025}, | |
| howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}} | |
| } | |
| @misc{open-asr-leaderboard, | |
| title={Open Automatic Speech Recognition Leaderboard}, | |
| author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team}, | |
| year={2023}, | |
| publisher={Hugging Face}, | |
| howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}} | |
| } | |
| """ | |