YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Parakeet-TDT-CTC-110M CoreML
NVIDIA's Parakeet-TDT-CTC-110M model converted to CoreML format for efficient inference on Apple Silicon.
Model Description
This is a hybrid ASR model with a shared Conformer encoder and two decoder heads:
- CTC Head: Fast greedy decoding, ideal for keyword spotting
- TDT Head: Token-Duration Transducer for high-quality transcription
Architecture
| Component | Description | Size |
|---|---|---|
| Preprocessor | Mel spectrogram extraction | ~1 MB |
| Encoder | Conformer encoder (shared) | ~400 MB |
| CTCHead | CTC output projection | ~4 MB |
| Decoder | TDT prediction network (LSTM) | ~25 MB |
| JointDecision | TDT joint network | ~6 MB |
Total size: ~436 MB
Performance
Benchmarked on Earnings22 dataset (772 audio files):
| Metric | Value |
|---|---|
| Keyword Recall | 100% (1309/1309) |
| WER | 17.97% |
| RTFx (M4 Pro) | 358x real-time |
Requirements
- macOS 13+ (Ventura or later)
- Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
Installation
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .
# For audio file support (WAV, MP3, etc.)
pip install -e ".[audio]"
Usage
Python Inference
from scripts.inference import ParakeetCoreML
# Load model (from current directory with .mlpackage files)
model = ParakeetCoreML(".")
# Transcribe with TDT (higher quality)
text = model.transcribe("audio.wav", mode="tdt")
print(text)
# Or use CTC for faster keyword spotting
text = model.transcribe("audio.wav", mode="ctc")
print(text)
Command Line
# TDT decoding (default, higher quality)
uv run scripts/inference.py --audio audio.wav
# CTC decoding (faster, good for keyword spotting)
uv run scripts/inference.py --audio audio.wav --mode ctc
Model Conversion
To convert from the original NeMo model:
# Install conversion dependencies
uv sync --extra convert
# Run conversion
uv run scripts/convert_nemo_to_coreml.py --output-dir ./model
This will:
- Download the original model from NVIDIA (
nvidia/parakeet-tdt_ctc-110m) - Convert each component to CoreML format
- Extract vocabulary and create metadata
File Structure
./
βββ Preprocessor.mlpackage # Audio β Mel spectrogram
βββ Encoder.mlpackage # Mel β Encoder features
βββ CTCHead.mlpackage # Encoder β CTC log probs
βββ Decoder.mlpackage # TDT prediction network
βββ JointDecision.mlpackage # TDT joint network
βββ vocab.json # Token vocabulary (1024 tokens)
βββ metadata.json # Model configuration
βββ pyproject.toml # Python dependencies
βββ uv.lock # Locked dependencies
βββ scripts/ # Inference & conversion scripts
Decoding Modes
TDT Mode (Recommended for Transcription)
- Uses Token-Duration Transducer decoding
- Higher accuracy (17.97% WER)
- Predicts both tokens and durations
- Best for full transcription tasks
CTC Mode (Recommended for Keyword Spotting)
- Greedy CTC decoding
- Faster inference
- 100% keyword recall on Earnings22
- Best for detecting specific words/phrases
Custom Vocabulary / Keyword Spotting
For keyword spotting, CTC mode with custom vocabulary boosting achieves 100% recall:
# Load custom vocabulary with token IDs
with open("custom_vocab.json") as f:
keywords = json.load(f) # {"keyword": [token_ids], ...}
# Run CTC decoding
tokens = model.decode_ctc(encoder_output)
# Check for keyword matches
for keyword, expected_ids in keywords.items():
if is_subsequence(expected_ids, tokens):
print(f"Found keyword: {keyword}")
License
This model conversion is released under the Apache 2.0 License, same as the original NVIDIA model.
Citation
If you use this model, please cite the original NVIDIA work:
@misc{nvidia_parakeet_tdt_ctc,
title={Parakeet-TDT-CTC-110M},
author={NVIDIA},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/nvidia/parakeet-tdt_ctc-110m}
}
Acknowledgments
- Original model by NVIDIA NeMo
- CoreML conversion by FluidInference
- Downloads last month
- 107
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support