Instructions to use Zarinaaa/mbert-kyrgyz-punctuation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Zarinaaa/mbert-kyrgyz-punctuation with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Zarinaaa/mbert-kyrgyz-punctuation")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation") model = AutoModelForTokenClassification.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation") - Notebooks
- Google Colab
- Kaggle
mBERT fine-tuned for Kyrgyz punctuation
Fine-tuned on the 16,028-sentence Kyrgyz punctuation-restoration dataset (original 14,028 + 2,000 question-augmentation sentences) as part of Uvalieva & Muhametjanova, Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models (under revision at ACM TALLIP, manuscript ID TALLIP-26-0124).
Task
Token-level classification over four labels:
| Label | Meaning |
|---|---|
O |
no punctuation after this token |
COMMA |
insert a comma after this token |
PERIOD |
insert a period after this token |
QUESTION |
insert a question mark after this token |
Main metrics
Weighted F1 on the 15% held-out test split of the augmented corpus: 0.930. See the paper and the accompanying GitHub repository for per-class numbers and cross-model comparisons.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation")
model = AutoModelForTokenClassification.from_pretrained("Zarinaaa/mbert-kyrgyz-punctuation").eval()
ID2LABEL = {0: "O", 1: "COMMA", 2: "PERIOD", 3: "QUESTION"}
def restore_punctuation(text: str) -> str:
words = text.split()
enc = tokenizer(words, is_split_into_words=True,
return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
logits = model(**enc).logits.squeeze(0)
preds = logits.argmax(dim=-1).tolist()
word_ids = enc.word_ids(batch_index=0)
# Take the label of the LAST subtoken of each word
label_per_word = [None] * len(words)
for i in range(len(word_ids) - 1, -1, -1):
wid = word_ids[i]
if wid is None:
continue
if label_per_word[wid] is None:
label_per_word[wid] = ID2LABEL[preds[i]]
out = []
for w, lab in zip(words, label_per_word):
if lab == "COMMA":
out.append(w + ",")
elif lab == "PERIOD":
out.append(w + ".")
elif lab == "QUESTION":
out.append(w + "?")
else:
out.append(w)
s = " ".join(out)
return s[0].upper() + s[1:] if s else s
print(restore_punctuation("мен мектепке барам"))
# → "Мен мектепке барам."
Training details
- Dataset: 16,028 Kyrgyz sentences (144,564 tokens), 76.5/8.5/15 split, seed 42
- Base model:
bert-base-multilingual-cased - Optimizer: AdamW, lr 5e-5, weight decay 0.01, warmup ratio 0.1
- Batch size 16, max length 256, 5 epochs, FP16
- Hardware: single NVIDIA RTX 5080
Code and data
- GitHub: https://github.com/Zarina33/kyrgyz-punctuation-restoration
- Dataset: https://huggingface.co/datasets/Zarinaaa/kyrgyz-punctuation-dataset
Citation
@article{uvalieva2026kyrgyz,
author = {Uvalieva, Zarina and Muhametjanova, Gulshat},
title = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
year = {2026},
note = {Under revision}
}
- Downloads last month
- 1
Model tree for Zarinaaa/mbert-kyrgyz-punctuation
Base model
google-bert/bert-base-multilingual-cased