Sukuma-TTS ๐ŸŽ™๏ธ

The first Text-to-Speech model for Sukuma (Kisukuma), a Bantu language spoken by approximately 10 million people in northern Tanzania.

Language MOS Score Base Model License


Model Description

Sukuma-TTS is a neural text-to-speech model fine-tuned on the Sukuma Voices dataset. It converts Sukuma text into natural-sounding speech, enabling voice-based applications for an underrepresented African language.

Model Details

Property Value
Base Model Orpheus 3B v0.1
Fine-tuning Method LoRA (Low-Rank Adaptation)
Audio Codec SNAC 24kHz
Output Sample Rate 24,000 Hz
Training Data 19.56 hours of Sukuma speech
Language Sukuma (ISO 639-3: suk)

Training Configuration

Parameter Value
LoRA Rank 64
LoRA Alpha 64
Learning Rate 2e-4
Batch Size 1 (effective: 4)
Epochs 4
Optimizer AdamW 8-bit
LR Scheduler Cosine
Max Sequence Length 16,384

Evaluation Results

Mean Opinion Score (MOS)

Audio Type MOS Score
Sukuma-TTS (This Model) 3.9 ยฑ 0.15
Human Recordings 4.6 ยฑ 0.1

Evaluated by native Sukuma speakers using a 5-point Likert scale.

ASR Evaluation (WER)

When synthesized speech was evaluated using our ASR model:

Test Data Word Error Rate
Original Speech 25.19%
Synthesized Speech 32.60%

Usage

Installation

pip install transformers torch snac unsloth

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("sartifyllc/sukuma-voices-tts")
tokenizer = AutoTokenizer.from_pretrained("sartifyllc/sukuma-voices-tts")

# Load SNAC decoder
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")

# Prepare input text
text = "<sukuma> Hunagwene oyombaga giki, nene nalinajo ijilanga."
input_ids = tokenizer(text, return_tensors="pt").input_ids

# Add special tokens
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)

# Generate audio tokens
model.eval()
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=input_ids.to(model.device),
        max_new_tokens=4096,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.1,
        eos_token_id=128258,
    )

# Process generated tokens to audio
# (See full inference code in repository)

Full Inference Script

For complete inference with audio output, see our GitHub repository.

from sukuma_tts import SukumaTTS

# Initialize
tts = SukumaTTS.from_pretrained("sartifyllc/sukuma-voices-tts")

# Generate speech
audio = tts.synthesize("Umunhu ngwunuyo agabhalelaga chiza abhanhu bhakwe.")

# Save to file
tts.save_audio(audio, "output.wav")

Intended Use

Primary Use Cases

โœ… Recommended:

  • Voice assistants for Sukuma speakers
  • Audiobook generation
  • Educational applications
  • Accessibility tools for visually impaired users
  • Language preservation and documentation
  • Research on low-resource TTS

Out-of-Scope Use

โŒ Not Recommended:

  • Generating deceptive or misleading audio
  • Impersonating real individuals
  • Creating harmful or offensive content
  • Applications requiring real-time synthesis (model not optimized for speed)

Limitations

  1. Domain Specificity: Trained primarily on biblical text; may not capture colloquial speech patterns
  2. Single Speaker: Limited speaker diversity in training data
  3. Diacritics: Model uses non-diacritic Sukuma orthography
  4. Inference Speed: Not optimized for real-time applications
  5. Text Length: Very long texts may produce degraded audio quality

Example Outputs

Input Text (Sukuma) English Translation
Umunhu ngwunuyo agabhalelaga chiza abhanhu bhakwe. This person raises his people well.
Uweyi agabhalangaga na bhanhu bhakwe inzila ja gwigulambigija. He teaches his people to be diligent.

Technical Details

Architecture

Orpheus 3B (Base)
    โ”‚
    โ”œโ”€โ”€ LoRA Adapters (r=64, ฮฑ=64)
    โ”‚   โ”œโ”€โ”€ q_proj, k_proj, v_proj, o_proj
    โ”‚   โ””โ”€โ”€ gate_proj, up_proj, down_proj
    โ”‚
    โ””โ”€โ”€ SNAC Audio Codec (24kHz)
        โ””โ”€โ”€ 3-layer hierarchical representation

Token Format

[START_HUMAN] [TEXT_TOKENS] [END_TEXT] [END_HUMAN] [START_AI] [START_SPEECH] [AUDIO_CODES] [END_SPEECH] [END_AI]

Training Data

This model was trained on the Sukuma Voices dataset:

Metric Value
Total Samples 6,871
Total Duration 19.56 hours
Unique Vocabulary 21,366 words
Source Sukuma New Testament 2000

๐Ÿ“Š Dataset: sartifyllc/SUKUMA_VOICE

Citation

If you use this model, please cite:

@inproceedings{mgonzo2025sukuma,
  title={Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma},
  author={Mgonzo, Macton and Oketch, Kezia and Etori, Naome and Mang'eni, Winnie and Nyaki, Elizabeth and Mollel, Michael S.},
  booktitle={Proceedings of the Association for Computational Linguistics},
  year={2025}
}

Authors

Name Affiliation
Macton Mgonzo Brown University
Kezia Oketch University of Notre Dame
Naome Etori University of Minnesota - Twin Cities
Winnie Mang'eni Pawa AI
Elizabeth Nyaki Pawa AI, Sartify Company Limited
Michael S. Mollel Pawa AI, Sartify Company Limited

Acknowledgments

We gratefully acknowledge Sartify Company Limited and Pawa AI for their support in data collection and model development.

License

This model is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Contact


Advancing speech technology for African languages, one model at a time.

Sartify โ€ข Pawa AI โ€ข Dataset

Downloads last month
33
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sartifyllc/sukuma-voices-tts

Evaluation results