mBERT fine-tuned for Kyrgyz punctuation

Fine-tuned on the 16,028-sentence Kyrgyz punctuation-restoration dataset (original 14,028 + 2,000 question-augmentation sentences) as part of Uvalieva & Muhametjanova, Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models (under revision at ACM TALLIP, manuscript ID TALLIP-26-0124).

Task

Token-level classification over four labels:

Label Meaning
O no punctuation after this token
COMMA insert a comma after this token
PERIOD insert a period after this token
QUESTION insert a question mark after this token

Main metrics

Weighted F1 on the 15% held-out test split of the augmented corpus: 0.930. See the paper and the accompanying GitHub repository for per-class numbers and cross-model comparisons.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation")
model = AutoModelForTokenClassification.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation").eval()

ID2LABEL = {0: "O", 1: "COMMA", 2: "PERIOD", 3: "QUESTION"}

def restore_punctuation(text: str) -> str:
    words = text.split()
    enc = tokenizer(words, is_split_into_words=True,
                    return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        logits = model(**enc).logits.squeeze(0)
    preds = logits.argmax(dim=-1).tolist()
    word_ids = enc.word_ids(batch_index=0)

    # Take the label of the LAST subtoken of each word
    label_per_word = [None] * len(words)
    for i in range(len(word_ids) - 1, -1, -1):
        wid = word_ids[i]
        if wid is None:
            continue
        if label_per_word[wid] is None:
            label_per_word[wid] = ID2LABEL[preds[i]]

    out = []
    for w, lab in zip(words, label_per_word):
        if lab == "COMMA":
            out.append(w + ",")
        elif lab == "PERIOD":
            out.append(w + ".")
        elif lab == "QUESTION":
            out.append(w + "?")
        else:
            out.append(w)
    s = " ".join(out)
    return s[0].upper() + s[1:] if s else s

print(restore_punctuation("мен мектепке барам"))
# → "Мен мектепке барам."

Training details

  • Dataset: 16,028 Kyrgyz sentences (144,564 tokens), 76.5/8.5/15 split, seed 42
  • Base model: bert-base-multilingual-cased
  • Optimizer: AdamW, lr 5e-5, weight decay 0.01, warmup ratio 0.1
  • Batch size 16, max length 256, 5 epochs, FP16
  • Hardware: single NVIDIA RTX 5080

Code and data

Citation

@article{uvalieva2026kyrgyz,
  author  = {Uvalieva, Zarina and Muhametjanova, Gulshat},
  title   = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
  journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
  year    = {2026},
  note    = {Under revision}
}
Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zarinaaa/mbert-kyrgyz-punctuation

Finetuned
(993)
this model