Sukuma-TTS ๐๏ธ
The first Text-to-Speech model for Sukuma (Kisukuma), a Bantu language spoken by approximately 10 million people in northern Tanzania.
Model Description
Sukuma-TTS is a neural text-to-speech model fine-tuned on the Sukuma Voices dataset. It converts Sukuma text into natural-sounding speech, enabling voice-based applications for an underrepresented African language.
Model Details
| Property | Value |
|---|---|
| Base Model | Orpheus 3B v0.1 |
| Fine-tuning Method | LoRA (Low-Rank Adaptation) |
| Audio Codec | SNAC 24kHz |
| Output Sample Rate | 24,000 Hz |
| Training Data | 19.56 hours of Sukuma speech |
| Language | Sukuma (ISO 639-3: suk) |
Training Configuration
| Parameter | Value |
|---|---|
| LoRA Rank | 64 |
| LoRA Alpha | 64 |
| Learning Rate | 2e-4 |
| Batch Size | 1 (effective: 4) |
| Epochs | 4 |
| Optimizer | AdamW 8-bit |
| LR Scheduler | Cosine |
| Max Sequence Length | 16,384 |
Evaluation Results
Mean Opinion Score (MOS)
| Audio Type | MOS Score |
|---|---|
| Sukuma-TTS (This Model) | 3.9 ยฑ 0.15 |
| Human Recordings | 4.6 ยฑ 0.1 |
Evaluated by native Sukuma speakers using a 5-point Likert scale.
ASR Evaluation (WER)
When synthesized speech was evaluated using our ASR model:
| Test Data | Word Error Rate |
|---|---|
| Original Speech | 25.19% |
| Synthesized Speech | 32.60% |
Usage
Installation
pip install transformers torch snac unsloth
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("sartifyllc/sukuma-voices-tts")
tokenizer = AutoTokenizer.from_pretrained("sartifyllc/sukuma-voices-tts")
# Load SNAC decoder
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Prepare input text
text = "<sukuma> Hunagwene oyombaga giki, nene nalinajo ijilanga."
input_ids = tokenizer(text, return_tensors="pt").input_ids
# Add special tokens
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
# Generate audio tokens
model.eval()
with torch.no_grad():
generated_ids = model.generate(
input_ids=input_ids.to(model.device),
max_new_tokens=4096,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
eos_token_id=128258,
)
# Process generated tokens to audio
# (See full inference code in repository)
Full Inference Script
For complete inference with audio output, see our GitHub repository.
from sukuma_tts import SukumaTTS
# Initialize
tts = SukumaTTS.from_pretrained("sartifyllc/sukuma-voices-tts")
# Generate speech
audio = tts.synthesize("Umunhu ngwunuyo agabhalelaga chiza abhanhu bhakwe.")
# Save to file
tts.save_audio(audio, "output.wav")
Intended Use
Primary Use Cases
โ Recommended:
- Voice assistants for Sukuma speakers
- Audiobook generation
- Educational applications
- Accessibility tools for visually impaired users
- Language preservation and documentation
- Research on low-resource TTS
Out-of-Scope Use
โ Not Recommended:
- Generating deceptive or misleading audio
- Impersonating real individuals
- Creating harmful or offensive content
- Applications requiring real-time synthesis (model not optimized for speed)
Limitations
- Domain Specificity: Trained primarily on biblical text; may not capture colloquial speech patterns
- Single Speaker: Limited speaker diversity in training data
- Diacritics: Model uses non-diacritic Sukuma orthography
- Inference Speed: Not optimized for real-time applications
- Text Length: Very long texts may produce degraded audio quality
Example Outputs
| Input Text (Sukuma) | English Translation |
|---|---|
| Umunhu ngwunuyo agabhalelaga chiza abhanhu bhakwe. | This person raises his people well. |
| Uweyi agabhalangaga na bhanhu bhakwe inzila ja gwigulambigija. | He teaches his people to be diligent. |
Technical Details
Architecture
Orpheus 3B (Base)
โ
โโโ LoRA Adapters (r=64, ฮฑ=64)
โ โโโ q_proj, k_proj, v_proj, o_proj
โ โโโ gate_proj, up_proj, down_proj
โ
โโโ SNAC Audio Codec (24kHz)
โโโ 3-layer hierarchical representation
Token Format
[START_HUMAN] [TEXT_TOKENS] [END_TEXT] [END_HUMAN] [START_AI] [START_SPEECH] [AUDIO_CODES] [END_SPEECH] [END_AI]
Training Data
This model was trained on the Sukuma Voices dataset:
| Metric | Value |
|---|---|
| Total Samples | 6,871 |
| Total Duration | 19.56 hours |
| Unique Vocabulary | 21,366 words |
| Source | Sukuma New Testament 2000 |
๐ Dataset: sartifyllc/SUKUMA_VOICE
Citation
If you use this model, please cite:
@inproceedings{mgonzo2025sukuma,
title={Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma},
author={Mgonzo, Macton and Oketch, Kezia and Etori, Naome and Mang'eni, Winnie and Nyaki, Elizabeth and Mollel, Michael S.},
booktitle={Proceedings of the Association for Computational Linguistics},
year={2025}
}
Authors
| Name | Affiliation |
|---|---|
| Macton Mgonzo | Brown University |
| Kezia Oketch | University of Notre Dame |
| Naome Etori | University of Minnesota - Twin Cities |
| Winnie Mang'eni | Pawa AI |
| Elizabeth Nyaki | Pawa AI, Sartify Company Limited |
| Michael S. Mollel | Pawa AI, Sartify Company Limited |
Acknowledgments
We gratefully acknowledge Sartify Company Limited and Pawa AI for their support in data collection and model development.
License
This model is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Contact
- ๐ง Correspondence: macton_mgonzo@brown.edu
- ๐ Issues: GitHub Issues
- ๐ค HuggingFace: Mollel/Sukuma-TTS
Advancing speech technology for African languages, one model at a time.
- Downloads last month
- 33
Model tree for sartifyllc/sukuma-voices-tts
Base model
meta-llama/Llama-3.2-3B-InstructEvaluation results
- Mean Opinion Score on Sukuma Voicesself-reported3.900