Spaces:

idsudd
/

open_asr_leaderboard_cl

Running

App Files Files Community

open_asr_leaderboard_cl / src /about.py

astroza

Update leaderboard configuration and results processing for Chilean Spanish ASR evaluation

13a06cd 28 days ago

raw

history blame contribute delete

9.51 kB

	# Chilean Spanish ASR Leaderboard Configuration

	# Your leaderboard name
	TITLE = """<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish </h1> </body> </html>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
	on Chilean Spanish speech data from the Hugging Face Hub. \
	\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \
	\nThis leaderboard focuses specifically on Chilean Spanish dialect evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas.

	🙏 Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!"""

	# About section content
	ABOUT_TEXT = """
	## About This Leaderboard

	This repository is a streamlined, task-specific version of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the Chilean Spanish dialect.

	### What is the Open ASR Leaderboard?

	The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics.

	### How This Repository Differs

	This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation:

	\| Aspect \| Original Open ASR Leaderboard \| This Repository \|
	\|--------\|-------------------------------\|-----------------\|
	\| Target Language \| English (primarily) \| Chilean Spanish \|
	\| Dataset \| 7 English datasets (LibriSpeech, AMI, etc.) \| 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) \|
	\| Text Normalization \| English text normalizer \| Multilingual normalizer preserving Spanish accents (á, é, í, ó, ú, ñ) \|
	\| Model Focus \| Broad coverage (~50+ models) \| 10 selected models optimized for multilingual/Spanish ASR \|
	\| Execution \| Local GPU execution \| Cloud-based parallel execution via Modal \|

	---

	## Models Evaluated

	This repository evaluates 11 state-of-the-art ASR models selected for their multilingual or Spanish language support:

	\| Model \| Type \| Framework \| Parameters \| Notes \|
	\|-------\|------\|-----------\|------------\|-------\|
	\| openai/whisper-large-v3 \| Multilingual \| Transformers \| 1.5B \| OpenAI's flagship ASR model \|
	\| openai/whisper-large-v3-turbo \| Multilingual \| Transformers \| 809M \| Faster Whisper variant \|
	\| surus-lat/whisper-large-v3-turbo-latam \| Multilingual \| Transformers \| 809M \| Fine-tuned model for Latam Spanish \|
	\| openai/whisper-small \| Multilingual \| Transformers \| 244M \| Reference baseline model \|
	\| rcastrovexler/whisper-small-es-cl \| Chilean Spanish \| Transformers \| 244M \| Only fine-tuned model found for Chilean Spanish \|
	\| nvidia/canary-1b-v2 \| Multilingual \| NeMo \| 1B \| NVIDIA's multilingual ASR \|
	\| nvidia/parakeet-tdt-0.6b-v3 \| Multilingual \| NeMo \| 0.6B \| Lightweight, fast inference \|
	\| microsoft/Phi-4-multimodal-instruct \| Multimodal \| Phi \| 14B \| Microsoft's multimodal LLM with audio \|
	\| mistralai/Voxtral-Mini-3B-2507 \| Speech-to-text \| Transformers \| 3B \| Mistral's ASR model \|
	\| elevenlabs/scribe_v1 \| API-based \| API \| N/A \| ElevenLabs' commercial ASR API \|
	\| facebookresearch/omniASR_LLM_7B \| Multilingual \| OmniLingual ASR \| 7B \| FAIR's OmniLingual ASR with spa_Latn target \|

	## Dataset

	This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data:

	### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only)

	This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles:

	1. Common Voice (Chilean Spanish filtered): Community-contributed recordings specifically filtered for Chilean Spanish dialects.

	2. Google Chilean Spanish ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)):
	- 7 hours of transcribed high-quality audio of Chilean Spanish sentences.
	- Recorded by 31 volunteers.
	- Intended for speech technologies.
	- Restructured from original OpenSLR archives for easier streaming

	3. Datarisas ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)):
	- Audio fragments from comedy routines at the Festival de Viña del Mar.
	- Represents spontaneous, colloquial Chilean Spanish.
	- Captures humor and cultural expressions specific to Chile.

	Combined Dataset Properties:
	- Language: Spanish (Chilean variant)
	- Split: `test`
	- Domain: Mixed (formal recordings, volunteer speech, comedy performances)
	- Total Coverage: Multiple speaking styles and contexts of Chilean Spanish

	## Metrics

	Following the Open ASR Leaderboard standard, we report:

	- WER (Word Error Rate): ⬇️ Lower is better - Measures transcription accuracy
	- RTFx (Real-Time Factor): ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time)

	### Word Error Rate (WER)
	Word Error Rate is used to measure the accuracy of automatic speech recognition systems. It calculates the percentage
	of words in the system's output that differ from the reference (correct) transcript. A lower WER value indicates higher accuracy.

	Take the following example:
	\| Reference: \| el \| gato \| se \| sentó \| en \| la \| alfombra \|
	\|-------------\|-----\|-----\|---------\|-----\|-----\|-----\| -----\|
	\| Prediction: \| el \| gato \| se \| sentó \| en \| la \| \| \|
	\| Label: \| ✅ \| ✅ \| ✅ \| ✅ \| ✅ \| ✅ \| D \|

	Here, we have:
	* 0 substitutions
	* 0 insertions
	* 1 deletion ("alfombra" is missing)

	This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
	reference (N), which for this example is 7:

	```
	WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143
	```

	Giving a WER of 0.14, or 14%. For a fair comparison, we calculate zero-shot (i.e. pre-trained models only) normalised WER for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions.

	### Inverse Real Time Factor (RTFx)
	Inverse Real Time Factor is a measure of the latency of automatic speech recognition systems, i.e. how long it takes an
	model to process a given amount of speech. It is defined as:

	```
	RTFx = (number of seconds of audio inferred) / (compute time in seconds)
	```

	Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
	Thus, a higher RTFx value indicates lower latency.

	## Text Normalization for Spanish

	This repository uses a multilingual normalizer configured to preserve Spanish characters:

	```python
	normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False)
	```

	What it does:
	- ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡`
	- ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols
	- ✅ Normalizes: Whitespace, capitalization (converts to lowercase)
	- ❌ Does NOT remove: Accents or Spanish-specific characters

	Example:
	```python
	Input: "¿Cómo estás? [ruido] (suspiro)"
	Output: "cómo estás"
	```

	This is critical for Spanish evaluation, as diacritics change word meaning:
	- `esta` (this) vs. `está` (is)
	- `si` (if) vs. `sí` (yes)
	- `el` (the) vs. `él` (he)

	## How to reproduce our results
	The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation.
	For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard,
	title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish},
	author={Instituto de Data Science UDD},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}}
	}

	@misc{astroza2025chilean-dataset,
	title={Chilean Spanish ASR Test Dataset},
	author={Alonso Astroza},
	year={2025},
	howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
	}

	@misc{open-asr-leaderboard,
	title={Open Automatic Speech Recognition Leaderboard},
	author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team},
	year={2023},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}}
	}
	"""