Africa-BBPE Tokenizer
๐ Model Card
Model Overview
- Model Name: NaolBM/Africa-BBPE
- Model Type: Byte-Level BPE Tokenizer
- Base Model: Trained on NaolBM/african-corpus
- Vocabulary Size: 50000
- Languages Supported: Amharic, Swahili, Hausa, Oromo, Yoruba, Tigrinya, English
๐ฏ Performance Benchmarks
Tokenizer Battle Results
Comparison against Gemma-3 and Qwen3 tokenizers:
| Language | Africa-BBPE | Gemma-3 | Qwen-3 | Winner |
|---|---|---|---|---|
| Amharic | 4 | 15 | 18 | ๐ช๐น Africa-BBPE |
| Swahili | 8 | 10 | 12 | ๐ฐ๐ช Africa-BBPE |
| Hausa | 11 | 12 | 12 | ๐ณ๐ฌ Africa-BBPE |
| Oromo | 5 | 11 | 10 | ๐ช๐น Africa-BBPE |
| Yoruba | 9 | 8 | 8 | ๐ณ๐ฌ Gemma-3 |
| Tigrinya | 5 | 13 | 21 | ๐ช๐ท Africa-BBPE |
| English | 7 | 4 | 3 | ๐ฌ๐ง Qwen-3 |
| Code-switching | 10 | 17 | 21 | ๐ Africa-BBPE |
Summary Statistics
| Metric | Africa-BBPE | Gemma-3 | Qwen-3 |
|---|---|---|---|
| ๐ Wins | 6 | 1 | 1 |
| ๐ Total Tokens | 59 | 90 | 105 |
| โก Avg Tokens/Sample | 7.38 | 11.25 | 13.13 |
Language Family Performance
| Language Family | Africa-BBPE | Gemma-3 | Qwen-3 |
|---|---|---|---|
| Semitic (Ge'ez) | 4.5 | 14.0 | 19.5 |
| Cushitic | 5.0 | 11.0 | 10.0 |
| Bantu | 8.0 | 10.0 | 12.0 |
| Chadic | 11.0 | 12.0 | 12.0 |
| Benue-Congo | 9.0 | 8.0 | 8.0 |
| Germanic | 7.0 | 4.0 | 3.0 |
| Code-switching | 10.0 | 17.0 | 21.0 |
๐ Language Support
| Language | Code | Script | Tokenization Efficiency |
|---|---|---|---|
| Amharic | am |
Ge'ez | โญโญโญโญโญ (4 tokens avg) |
| Tigrinya | ti |
Ge'ez | โญโญโญโญโญ (5 tokens avg) |
| Oromo | om |
Latin | โญโญโญโญโญ (5 tokens avg) |
| Swahili | sw |
Latin | โญโญโญโญ (8 tokens avg) |
| Hausa | ha |
Latin | โญโญโญ (11 tokens avg) |
| Yoruba | yo |
Latin | โญโญโญ (9 tokens avg) |
| English | en |
Latin | โญโญ (7 tokens avg) |
| Code-switching | Mixed | Mixed | โญโญโญโญโญ (10 tokens avg) |
๐ Usage
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Africa-BBPE")
# Example usage
text = "แ แแญแ แแแ แ แขแตแฎแตแซ"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f"Tokens: {tokens}")
print(f"Token IDs: {ids}")
print(f"Number of tokens: {len(ids)}")
๐ Training Data
- Dataset: NaolBM/african-corpus
- Total Rows: 35,344,339
- Languages: Amharic, Swahili, Hausa, Oromo, Yoruba, Tigrinya, English
Dataset Composition
| Language | Rows | Percentage |
|---|---|---|
| Swahili | 14,125,925 | 39.97% |
| Amharic | 10,815,255 | 30.60% |
| Hausa | 7,144,077 | 20.21% |
| English | 2,119,719 | 6.00% |
| Oromo | 881,450 | 2.49% |
| Yoruba | 245,837 | 0.70% |
| Tigrinya | 12,076 | 0.03% |
๐ Key Strengths
- โ 6x more efficient than Gemma-3 on Amharic (4 vs 15 tokens)
- โ 4x more efficient than Qwen-3 on Tigrinya (5 vs 21 tokens)
- โ 2x more efficient on code-switched text (10 vs 17-21 tokens)
- โ Best-in-class for 6/8 language categories
- โ Optimized for Ge'ez script (Amharic, Tigrinya)
- โ Excellent for Cushitic languages (Oromo)
๐ Efficiency Gains
Compared to Gemma-3:
- 73% fewer tokens for Amharic
- 62% fewer tokens for Tigrinya
- 55% fewer tokens for Oromo
- 41% fewer tokens overall
Compared to Qwen-3:
- 78% fewer tokens for Tigrinya
- 72% fewer tokens for code-switching
- 52% fewer tokens for Amharic
- 44% fewer tokens overall
๐ License
MIT
๐ Acknowledgments
- Trained on NaolBM/african-corpus
- Built with Hugging Face Tokenizers library
- All original dataset contributors
๐ Links
- Tokenizer: NaolBM/Africa-BBPE
- Dataset: NaolBM/african-corpus
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support