|
|
---
|
|
|
tags:
|
|
|
- khmer
|
|
|
- nlp
|
|
|
- punctuation-restoration
|
|
|
- inverse-text-normalization
|
|
|
- asr
|
|
|
- xlm-roberta
|
|
|
license: mit
|
|
|
language:
|
|
|
- km
|
|
|
---
|
|
|
|
|
|
# KhmerTagger: Inverse Text Normalization for Khmer ASR
|
|
|
|
|
|
KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text.
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads:
|
|
|
- **Punctuation head**: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.)
|
|
|
- **Number head**: Identifies and tags numeric entities in the text
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
```python
|
|
|
from transformers import XLMRobertaTokenizer
|
|
|
import torch
|
|
|
from model import KhmerTagger
|
|
|
|
|
|
# Load tokenizer
|
|
|
tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")
|
|
|
|
|
|
# Load model
|
|
|
model = KhmerTagger(n_punct_features=5, n_num_features=3)
|
|
|
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))
|
|
|
model.eval()
|
|
|
|
|
|
# Your inference code here...
|
|
|
```
|
|
|
|
|
|
## Training
|
|
|
|
|
|
The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set.
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
```bibtex
|
|
|
@misc{khmertagger2025,
|
|
|
author = {Seanghay Yath},
|
|
|
title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition},
|
|
|
year = {2025},
|
|
|
publisher = {GitHub},
|
|
|
journal = {GitHub repository},
|
|
|
howpublished = {\url{https://github.com/seanghay/khmertagger}},
|
|
|
note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa}
|
|
|
}
|
|
|
```
|
|
|
|