KhmerTagger: Inverse Text Normalization for Khmer ASR
KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text.
Model Description
The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads:
- Punctuation head: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.)
- Number head: Identifies and tags numeric entities in the text
Usage
from transformers import XLMRobertaTokenizer
import torch
from model import KhmerTagger
# Load tokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")
# Load model
model = KhmerTagger(n_punct_features=5, n_num_features=3)
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))
model.eval()
# Your inference code here...
Training
The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set.
Citation
@misc{khmertagger2025,
author = {Seanghay Yath},
title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/seanghay/khmertagger}},
note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa}
}
- Downloads last month
- 19
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support