KhmerTagger: Inverse Text Normalization for Khmer ASR

KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text.

Model Description

The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads:

Punctuation head: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.)
Number head: Identifies and tags numeric entities in the text

Usage

from transformers import XLMRobertaTokenizer
import torch
from model import KhmerTagger

# Load tokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")

# Load model
model = KhmerTagger(n_punct_features=5, n_num_features=3)
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))
model.eval()

# Your inference code here...

Training

The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set.

Citation

@misc{khmertagger2025,
  author = {Seanghay Yath},
  title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/seanghay/khmertagger}},
  note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa}
}

Downloads last month: 19

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support