# Chilean Spanish ASR Leaderboard Configuration # Your leaderboard name TITLE = """

🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish

""" # What does your leaderboard evaluate? INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \ on Chilean Spanish speech data from the Hugging Face Hub. \ \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \ \nThis leaderboard focuses specifically on **Chilean Spanish dialect** evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas. 🙏 **Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!**""" # About section content ABOUT_TEXT = """ ## About This Leaderboard This repository is a **streamlined, task-specific version** of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the **Chilean Spanish dialect**. ### What is the Open ASR Leaderboard? The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics. ### How This Repository Differs This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation: | Aspect | Original Open ASR Leaderboard | This Repository | |--------|-------------------------------|-----------------| | **Target Language** | English (primarily) | Chilean Spanish | | **Dataset** | 7 English datasets (LibriSpeech, AMI, etc.) | 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) | | **Text Normalization** | English text normalizer | **Multilingual normalizer** preserving Spanish accents (á, é, í, ó, ú, ñ) | | **Model Focus** | Broad coverage (~50+ models) | **10 selected models** optimized for multilingual/Spanish ASR | | **Execution** | Local GPU execution | **Cloud-based** parallel execution via Modal | --- ## Models Evaluated This repository evaluates **11 state-of-the-art ASR models** selected for their multilingual or Spanish language support: | Model | Type | Framework | Parameters | Notes | |-------|------|-----------|------------|-------| | **openai/whisper-large-v3** | Multilingual | Transformers | 1.5B | OpenAI's flagship ASR model | | **openai/whisper-large-v3-turbo** | Multilingual | Transformers | 809M | Faster Whisper variant | | **surus-lat/whisper-large-v3-turbo-latam** | Multilingual | Transformers | 809M | Fine-tuned model for Latam Spanish | | **openai/whisper-small** | Multilingual | Transformers | 244M | Reference baseline model | | **rcastrovexler/whisper-small-es-cl** | Chilean Spanish | Transformers | 244M | Only fine-tuned model found for Chilean Spanish | | **nvidia/canary-1b-v2** | Multilingual | NeMo | 1B | NVIDIA's multilingual ASR | | **nvidia/parakeet-tdt-0.6b-v3** | Multilingual | NeMo | 0.6B | Lightweight, fast inference | | **microsoft/Phi-4-multimodal-instruct** | Multimodal | Phi | 14B | Microsoft's multimodal LLM with audio | | **mistralai/Voxtral-Mini-3B-2507** | Speech-to-text | Transformers | 3B | Mistral's ASR model | | **elevenlabs/scribe_v1** | API-based | API | N/A | ElevenLabs' commercial ASR API | | **facebookresearch/omniASR_LLM_7B** | Multilingual | OmniLingual ASR | 7B | FAIR's OmniLingual ASR with spa_Latn target | ## Dataset This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data: ### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only) This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles: 1. **Common Voice (Chilean Spanish filtered)**: Community-contributed recordings specifically filtered for Chilean Spanish dialects. 2. **Google Chilean Spanish** ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)): - 7 hours of transcribed high-quality audio of Chilean Spanish sentences. - Recorded by 31 volunteers. - Intended for speech technologies. - Restructured from original OpenSLR archives for easier streaming 3. **Datarisas** ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)): - Audio fragments from comedy routines at the Festival de Viña del Mar. - Represents spontaneous, colloquial Chilean Spanish. - Captures humor and cultural expressions specific to Chile. **Combined Dataset Properties:** - **Language**: Spanish (Chilean variant) - **Split**: `test` - **Domain**: Mixed (formal recordings, volunteer speech, comedy performances) - **Total Coverage**: Multiple speaking styles and contexts of Chilean Spanish ## Metrics Following the Open ASR Leaderboard standard, we report: - **WER (Word Error Rate)**: ⬇️ Lower is better - Measures transcription accuracy - **RTFx (Real-Time Factor)**: ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time) ### Word Error Rate (WER) Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**. Take the following example: | Reference: | el | gato | se | sentó | en | la | alfombra | |-------------|-----|-----|---------|-----|-----|-----| -----| | Prediction: | el | gato | **se** | sentó | en | la | | | | Label: | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | D | Here, we have: * 0 substitutions * 0 insertions * 1 deletion ("alfombra" is missing) This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our reference (N), which for this example is 7: ``` WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143 ``` Giving a WER of 0.14, or 14%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions. ### Inverse Real Time Factor (RTFx) Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an model to process a given amount of speech. It is defined as: ``` RTFx = (number of seconds of audio inferred) / (compute time in seconds) ``` Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time. Thus, **a higher RTFx value indicates lower latency**. ## Text Normalization for Spanish This repository uses a **multilingual normalizer** configured to preserve Spanish characters: ```python normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False) ``` **What it does:** - ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡` - ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols - ✅ Normalizes: Whitespace, capitalization (converts to lowercase) - ❌ Does NOT remove: Accents or Spanish-specific characters **Example:** ```python Input: "¿Cómo estás? [ruido] (suspiro)" Output: "cómo estás" ``` This is critical for Spanish evaluation, as diacritics change word meaning: - `esta` (this) vs. `está` (is) - `si` (if) vs. `sí` (yes) - `el` (the) vs. `él` (he) ## How to reproduce our results The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation. For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl """ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard, title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish}, author={Instituto de Data Science UDD}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}} } @misc{astroza2025chilean-dataset, title={Chilean Spanish ASR Test Dataset}, author={Alonso Astroza}, year={2025}, howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}} } @misc{open-asr-leaderboard, title={Open Automatic Speech Recognition Leaderboard}, author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team}, year={2023}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}} } """