mBERT fine-tuned for Kyrgyz punctuation

Fine-tuned on the 16,028-sentence Kyrgyz punctuation-restoration dataset (original 14,028 + 2,000 question-augmentation sentences) as part of Uvalieva & Muhametjanova, Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models (under revision at ACM TALLIP, manuscript ID TALLIP-26-0124).

Task

Token-level classification over four labels:

Label	Meaning
`O`	no punctuation after this token
`COMMA`	insert a comma after this token
`PERIOD`	insert a period after this token
`QUESTION`	insert a question mark after this token

Main metrics

Weighted F1 on the 15% held-out test split of the augmented corpus: 0.930. See the paper and the accompanying GitHub repository for per-class numbers and cross-model comparisons.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation")
model = AutoModelForTokenClassification.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation").eval()

ID2LABEL = {0: "O", 1: "COMMA", 2: "PERIOD", 3: "QUESTION"}

def restore_punctuation(text: str) -> str:
    words = text.split()
    enc = tokenizer(words, is_split_into_words=True,
                    return_tensors="pt", truncation=True, max_length=256)
    with torch.no_grad():
        logits = model(**enc).logits.squeeze(0)
    preds = logits.argmax(dim=-1).tolist()
    word_ids = enc.word_ids(batch_index=0)

    # Take the label of the LAST subtoken of each word
    label_per_word = [None] * len(words)
    for i in range(len(word_ids) - 1, -1, -1):
        wid = word_ids[i]
        if wid is None:
            continue
        if label_per_word[wid] is None:
            label_per_word[wid] = ID2LABEL[preds[i]]

    out = []
    for w, lab in zip(words, label_per_word):
        if lab == "COMMA":
            out.append(w + ",")
        elif lab == "PERIOD":
            out.append(w + ".")
        elif lab == "QUESTION":
            out.append(w + "?")
        else:
            out.append(w)
    s = " ".join(out)
    return s[0].upper() + s[1:] if s else s

print(restore_punctuation("мен мектепке барам"))
# → "Мен мектепке барам."

Training details

Dataset: 16,028 Kyrgyz sentences (144,564 tokens), 76.5/8.5/15 split, seed 42
Base model: bert-base-multilingual-cased
Optimizer: AdamW, lr 5e-5, weight decay 0.01, warmup ratio 0.1
Batch size 16, max length 256, 5 epochs, FP16
Hardware: single NVIDIA RTX 5080

Code and data

Citation

@article{uvalieva2026kyrgyz,
  author  = {Uvalieva, Zarina and Muhametjanova, Gulshat},
  title   = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
  journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
  year    = {2026},
  note    = {Under revision}
}

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Zarinaaa/mbert-kyrgyz-punctuation

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(993)

this model