miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization

miLLi 1.0 is a hybrid tokenizer specifically engineered for the Azerbaijani language, addressing the limitations of standard statistical models (e.g., BPE, WordPiece) in processing agglutinative morphologies. By integrating a rule-based root dictionary with statistical learning, the model prioritizes morphological integrity and semantic root preservation over purely frequency-based compression.

The model introduces a dynamic Phonological Restoration algorithm designed to map allomorphic variations (e.g., vowel loss, consonant mutations) back to their canonical root forms during the pre-tokenization phase.

1. Methodology and Architecture

The architecture of miLLi 1.0 is built upon a three-stage hybrid pipeline:

  1. Linguistic Pre-processing:

    • Utilization of a cleaned root dictionary based on Mozilla's az.dic.
    • Implementation of an Aho-Corasick based Trie structure for efficient root matching.
    • Case Handling: Application of a special <UPPER> token strategy to consolidate vocabulary and preserve Named Entity Recognition (NER) signals without case-sensitivity redundancy.
  2. Phonological Restoration:

    • A dynamic algorithm that identifies phonetically modified stems (e.g., q-ğ, k-y mutations) and restores them to their lemma forms before segmentation.
    • Adopts a "Longest Restored Match" principle to ensure valid morphological segmentation.
  3. Statistical Subword Segmentation:

    • A Byte-Pair Encoding (BPE) model trained on the CulturaX Azerbaijani corpus (500k line subset) is applied to the remaining suffixes and out-of-vocabulary terms.

2. Empirical Evaluation

The performance of miLLi 1.0 was evaluated using a dual-strategy approach: Quantitative Efficiency (measured on the Tatoeba corpus) and Qualitative Accuracy (measured on a curated Morphological Challenge Set).

2.1. Quantitative Analysis: Token Efficiency

Metric: Token/Word (T/W) Ratio (Lower indicates higher compression)

Evaluations on the Tatoeba corpus (~4,500 sentences) demonstrate that miLLi 1.0 offers a balanced representation. While local statistical models achieve higher compression through whole-word memorization, miLLi 1.0 significantly outperforms global multilingual models.

Model Category T/W Ratio
aLLMA Local (Statistical) 1.418
AzeBERT Local (Statistical) 1.571
miLLi 1.0 Local (Hybrid) 1.955
GPT-4o Global (SOTA) 2.387
mBERT Global (Multilingual) 2.521
GPT-3.5 Global (Legacy) 3.490

2.2. Qualitative Analysis: Linguistic Robustness

Metrics: Morphological Boundary Accuracy (MBA) & Root Consistency Rate (RCR)

This evaluation measures the model's ability to correctly identify the linguistic root and restore phonetically modified stems.

  • MBA: Measures the percentage of words split correctly at the root-suffix boundary.
Model MBA (%)
miLLi 1.0 53.0%
XLM-RoBERTa 38.0%
aLLMA 16.0%
mBERT 11.0%
GPT-4o 4.0%

3. Usage

The model is compatible with the Hugging Face transformers library. Due to the custom Python logic required for phonological restoration, the trust_remote_code=True parameter is mandatory.

Installation

pip install transformers tokenizers

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("elshadrahimov/miLLi-1.0", trust_remote_code=True)

text = "Vətənimizin bayrağı yüksəkliklərdə dalğalanır."

# Tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode
input_ids = tokenizer.encode(text)
print(input_ids)

4. Limitations

Dictionary Dependence: The restoration capability is strictly limited to the coverage of the underlying root dictionary. Neologisms and dialectisms not present in the dictionary will be processed via standard BPE without restoration. Computational Latency: The pre-tokenization layer (Trie search and rule-based restoration) introduces a slight inference latency compared to purely C++ optimized tokenizers. Sequence Length: The use of the token for capitalization handling results in a marginal increase in sequence length compared to cased tokenizers. 5. Citation If you use this model in your research, please cite the following:

Citation

If you use miLLi 1.0 in your research, please cite it as follows:

APA Style: Rahimov, E. (2025). miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization. Hugging Face. https://huggingface.co/elshadrahimov/miLLi-1.0

BibTeX:

@misc{rahimov2025milli,
  author = {Rahimov, Elshad},
  title = {miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization},
  year = {2025},
  howpublished = {\url{https://huggingface.co/elshadrahimov/miLLi-1.0}},
  note = {Hugging Face Model Hub}
}

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train elshadrahimov/miLLi-1.0