Africa-BBPE Tokenizer

๐Ÿ“‹ Model Card

Model Overview

  • Model Name: NaolBM/Africa-BBPE
  • Model Type: Byte-Level BPE Tokenizer
  • Base Model: Trained on NaolBM/african-corpus
  • Vocabulary Size: 50000
  • Languages Supported: Amharic, Swahili, Hausa, Oromo, Yoruba, Tigrinya, English

๐ŸŽฏ Performance Benchmarks

Tokenizer Battle Results

Comparison against Gemma-3 and Qwen3 tokenizers:

Language Africa-BBPE Gemma-3 Qwen-3 Winner
Amharic 4 15 18 ๐Ÿ‡ช๐Ÿ‡น Africa-BBPE
Swahili 8 10 12 ๐Ÿ‡ฐ๐Ÿ‡ช Africa-BBPE
Hausa 11 12 12 ๐Ÿ‡ณ๐Ÿ‡ฌ Africa-BBPE
Oromo 5 11 10 ๐Ÿ‡ช๐Ÿ‡น Africa-BBPE
Yoruba 9 8 8 ๐Ÿ‡ณ๐Ÿ‡ฌ Gemma-3
Tigrinya 5 13 21 ๐Ÿ‡ช๐Ÿ‡ท Africa-BBPE
English 7 4 3 ๐Ÿ‡ฌ๐Ÿ‡ง Qwen-3
Code-switching 10 17 21 ๐ŸŒ Africa-BBPE

Summary Statistics

Metric Africa-BBPE Gemma-3 Qwen-3
๐Ÿ† Wins 6 1 1
๐Ÿ“Š Total Tokens 59 90 105
โšก Avg Tokens/Sample 7.38 11.25 13.13

Language Family Performance

Language Family Africa-BBPE Gemma-3 Qwen-3
Semitic (Ge'ez) 4.5 14.0 19.5
Cushitic 5.0 11.0 10.0
Bantu 8.0 10.0 12.0
Chadic 11.0 12.0 12.0
Benue-Congo 9.0 8.0 8.0
Germanic 7.0 4.0 3.0
Code-switching 10.0 17.0 21.0

๐ŸŒ Language Support

Language Code Script Tokenization Efficiency
Amharic am Ge'ez โญโญโญโญโญ (4 tokens avg)
Tigrinya ti Ge'ez โญโญโญโญโญ (5 tokens avg)
Oromo om Latin โญโญโญโญโญ (5 tokens avg)
Swahili sw Latin โญโญโญโญ (8 tokens avg)
Hausa ha Latin โญโญโญ (11 tokens avg)
Yoruba yo Latin โญโญโญ (9 tokens avg)
English en Latin โญโญ (7 tokens avg)
Code-switching Mixed Mixed โญโญโญโญโญ (10 tokens avg)

๐Ÿš€ Usage

from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Africa-BBPE")

# Example usage
text = "แŠ แˆ›แˆญแŠ› แ‰‹แŠ•แ‰‹ แ‰ แŠขแ‰ตแ‹ฎแŒตแ‹ซ"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Tokens: {tokens}")
print(f"Token IDs: {ids}")
print(f"Number of tokens: {len(ids)}")

๐Ÿ“Š Training Data

  • Dataset: NaolBM/african-corpus
  • Total Rows: 35,344,339
  • Languages: Amharic, Swahili, Hausa, Oromo, Yoruba, Tigrinya, English

Dataset Composition

Language Rows Percentage
Swahili 14,125,925 39.97%
Amharic 10,815,255 30.60%
Hausa 7,144,077 20.21%
English 2,119,719 6.00%
Oromo 881,450 2.49%
Yoruba 245,837 0.70%
Tigrinya 12,076 0.03%

๐Ÿ” Key Strengths

  • โœ… 6x more efficient than Gemma-3 on Amharic (4 vs 15 tokens)
  • โœ… 4x more efficient than Qwen-3 on Tigrinya (5 vs 21 tokens)
  • โœ… 2x more efficient on code-switched text (10 vs 17-21 tokens)
  • โœ… Best-in-class for 6/8 language categories
  • โœ… Optimized for Ge'ez script (Amharic, Tigrinya)
  • โœ… Excellent for Cushitic languages (Oromo)

๐Ÿ“ˆ Efficiency Gains

Compared to Gemma-3:

  • 73% fewer tokens for Amharic
  • 62% fewer tokens for Tigrinya
  • 55% fewer tokens for Oromo
  • 41% fewer tokens overall

Compared to Qwen-3:

  • 78% fewer tokens for Tigrinya
  • 72% fewer tokens for code-switching
  • 52% fewer tokens for Amharic
  • 44% fewer tokens overall

๐Ÿ“œ License

MIT

๐Ÿ™ Acknowledgments

  • Trained on NaolBM/african-corpus
  • Built with Hugging Face Tokenizers library
  • All original dataset contributors

๐Ÿ”— Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train NaolBM/Africa-BBPE