Africa-BBPE Tokenizer

📋 Model Card

Model Overview

Model Name: NaolBM/Africa-BBPE
Model Type: Byte-Level BPE Tokenizer
Base Model: Trained on NaolBM/african-corpus
Vocabulary Size: 50000
Languages Supported: Amharic, Swahili, Hausa, Oromo, Yoruba, Tigrinya, English

🎯 Performance Benchmarks

Tokenizer Battle Results

Comparison against Gemma-3 and Qwen3 tokenizers:

Language	Africa-BBPE	Gemma-3	Qwen-3	Winner
Amharic	4	15	18	🇪🇹 Africa-BBPE
Swahili	8	10	12	🇰🇪 Africa-BBPE
Hausa	11	12	12	🇳🇬 Africa-BBPE
Oromo	5	11	10	🇪🇹 Africa-BBPE
Yoruba	9	8	8	🇳🇬 Gemma-3
Tigrinya	5	13	21	🇪🇷 Africa-BBPE
English	7	4	3	🇬🇧 Qwen-3
Code-switching	10	17	21	🌍 Africa-BBPE

Summary Statistics

Metric	Africa-BBPE	Gemma-3	Qwen-3
🏆 Wins	6	1	1
📊 Total Tokens	59	90	105
⚡ Avg Tokens/Sample	7.38	11.25	13.13

Language Family Performance

Language Family	Africa-BBPE	Gemma-3	Qwen-3
Semitic (Ge'ez)	4.5	14.0	19.5
Cushitic	5.0	11.0	10.0
Bantu	8.0	10.0	12.0
Chadic	11.0	12.0	12.0
Benue-Congo	9.0	8.0	8.0
Germanic	7.0	4.0	3.0
Code-switching	10.0	17.0	21.0

🌍 Language Support

Language	Code	Script	Tokenization Efficiency
Amharic	`am`	Ge'ez	⭐⭐⭐⭐⭐ (4 tokens avg)
Tigrinya	`ti`	Ge'ez	⭐⭐⭐⭐⭐ (5 tokens avg)
Oromo	`om`	Latin	⭐⭐⭐⭐⭐ (5 tokens avg)
Swahili	`sw`	Latin	⭐⭐⭐⭐ (8 tokens avg)
Hausa	`ha`	Latin	⭐⭐⭐ (11 tokens avg)
Yoruba	`yo`	Latin	⭐⭐⭐ (9 tokens avg)
English	`en`	Latin	⭐⭐ (7 tokens avg)
Code-switching	Mixed	Mixed	⭐⭐⭐⭐⭐ (10 tokens avg)

🚀 Usage

from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Africa-BBPE")

# Example usage
text = "አማርኛ ቋንቋ በኢትዮጵያ"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Tokens: {tokens}")
print(f"Token IDs: {ids}")
print(f"Number of tokens: {len(ids)}")

📊 Training Data

Dataset: NaolBM/african-corpus
Total Rows: 35,344,339
Languages: Amharic, Swahili, Hausa, Oromo, Yoruba, Tigrinya, English

Dataset Composition

Language	Rows	Percentage
Swahili	14,125,925	39.97%
Amharic	10,815,255	30.60%
Hausa	7,144,077	20.21%
English	2,119,719	6.00%
Oromo	881,450	2.49%
Yoruba	245,837	0.70%
Tigrinya	12,076	0.03%

🔍 Key Strengths

✅ 6x more efficient than Gemma-3 on Amharic (4 vs 15 tokens)
✅ 4x more efficient than Qwen-3 on Tigrinya (5 vs 21 tokens)
✅ 2x more efficient on code-switched text (10 vs 17-21 tokens)
✅ Best-in-class for 6/8 language categories
✅ Optimized for Ge'ez script (Amharic, Tigrinya)
✅ Excellent for Cushitic languages (Oromo)

📈 Efficiency Gains

Compared to Gemma-3:

73% fewer tokens for Amharic
62% fewer tokens for Tigrinya
55% fewer tokens for Oromo
41% fewer tokens overall

Compared to Qwen-3:

78% fewer tokens for Tigrinya
72% fewer tokens for code-switching
52% fewer tokens for Amharic
44% fewer tokens overall

📜 License

MIT

🙏 Acknowledgments

Trained on NaolBM/african-corpus
Built with Hugging Face Tokenizers library
All original dataset contributors

🔗 Links

Tokenizer: NaolBM/Africa-BBPE
Dataset: NaolBM/african-corpus

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

NaolBM
/

Africa-BBPE

Africa-BBPE Tokenizer

📋 Model Card

Model Overview

🎯 Performance Benchmarks

Tokenizer Battle Results

Summary Statistics

Language Family Performance

🌍 Language Support

🚀 Usage

📊 Training Data

Dataset Composition

🔍 Key Strengths

📈 Efficiency Gains

📜 License

🙏 Acknowledgments

🔗 Links

Dataset used to train NaolBM/Africa-BBPE