miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization
miLLi 1.0 is a hybrid tokenizer specifically engineered for the Azerbaijani language, addressing the limitations of standard statistical models (e.g., BPE, WordPiece) in processing agglutinative morphologies. By integrating a rule-based root dictionary with statistical learning, the model prioritizes morphological integrity and semantic root preservation over purely frequency-based compression.
The model introduces a dynamic Phonological Restoration algorithm designed to map allomorphic variations (e.g., vowel loss, consonant mutations) back to their canonical root forms during the pre-tokenization phase.
1. Methodology and Architecture
The architecture of miLLi 1.0 is built upon a three-stage hybrid pipeline:
Linguistic Pre-processing:
- Utilization of a cleaned root dictionary based on Mozilla's
az.dic. - Implementation of an Aho-Corasick based Trie structure for efficient root matching.
- Case Handling: Application of a special
<UPPER>token strategy to consolidate vocabulary and preserve Named Entity Recognition (NER) signals without case-sensitivity redundancy.
- Utilization of a cleaned root dictionary based on Mozilla's
Phonological Restoration:
- A dynamic algorithm that identifies phonetically modified stems (e.g., q-ğ, k-y mutations) and restores them to their lemma forms before segmentation.
- Adopts a "Longest Restored Match" principle to ensure valid morphological segmentation.
Statistical Subword Segmentation:
- A Byte-Pair Encoding (BPE) model trained on the CulturaX Azerbaijani corpus (500k line subset) is applied to the remaining suffixes and out-of-vocabulary terms.
2. Empirical Evaluation
The performance of miLLi 1.0 was evaluated using a dual-strategy approach: Quantitative Efficiency (measured on the Tatoeba corpus) and Qualitative Accuracy (measured on a curated Morphological Challenge Set).
2.1. Quantitative Analysis: Token Efficiency
Metric: Token/Word (T/W) Ratio (Lower indicates higher compression)
Evaluations on the Tatoeba corpus (~4,500 sentences) demonstrate that miLLi 1.0 offers a balanced representation. While local statistical models achieve higher compression through whole-word memorization, miLLi 1.0 significantly outperforms global multilingual models.
| Model | Category | T/W Ratio |
|---|---|---|
| aLLMA | Local (Statistical) | 1.418 |
| AzeBERT | Local (Statistical) | 1.571 |
| miLLi 1.0 | Local (Hybrid) | 1.955 |
| GPT-4o | Global (SOTA) | 2.387 |
| mBERT | Global (Multilingual) | 2.521 |
| GPT-3.5 | Global (Legacy) | 3.490 |
2.2. Qualitative Analysis: Linguistic Robustness
Metrics: Morphological Boundary Accuracy (MBA) & Root Consistency Rate (RCR)
This evaluation measures the model's ability to correctly identify the linguistic root and restore phonetically modified stems.
- MBA: Measures the percentage of words split correctly at the root-suffix boundary.
| Model | MBA (%) |
|---|---|
| miLLi 1.0 | 53.0% |
| XLM-RoBERTa | 38.0% |
| aLLMA | 16.0% |
| mBERT | 11.0% |
| GPT-4o | 4.0% |
3. Usage
The model is compatible with the Hugging Face transformers library. Due to the custom Python logic required for phonological restoration, the trust_remote_code=True parameter is mandatory.
Installation
pip install transformers tokenizers
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("elshadrahimov/miLLi-1.0", trust_remote_code=True)
text = "Vətənimizin bayrağı yüksəkliklərdə dalğalanır."
# Tokenize
tokens = tokenizer.tokenize(text)
print(tokens)
# Encode
input_ids = tokenizer.encode(text)
print(input_ids)
4. Limitations
Dictionary Dependence: The restoration capability is strictly limited to the coverage of the underlying root dictionary. Neologisms and dialectisms not present in the dictionary will be processed via standard BPE without restoration. Computational Latency: The pre-tokenization layer (Trie search and rule-based restoration) introduces a slight inference latency compared to purely C++ optimized tokenizers. Sequence Length: The use of the token for capitalization handling results in a marginal increase in sequence length compared to cased tokenizers. 5. Citation If you use this model in your research, please cite the following:
Citation
If you use miLLi 1.0 in your research, please cite it as follows:
APA Style: Rahimov, E. (2025). miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization. Hugging Face. https://huggingface.co/elshadrahimov/miLLi-1.0
BibTeX:
@misc{rahimov2025milli,
author = {Rahimov, Elshad},
title = {miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization},
year = {2025},
howpublished = {\url{https://huggingface.co/elshadrahimov/miLLi-1.0}},
note = {Hugging Face Model Hub}
}