Panavath
/

khmertagger

punctuation-restoration

inverse-text-normalization

Model card Files Files and versions

khmertagger / README.md

Panavath's picture

Upload folder using huggingface_hub

67ac7d1 verified 16 days ago

|

history blame contribute delete

1.73 kB

	---
	tags:
	- khmer
	- nlp
	- punctuation-restoration
	- inverse-text-normalization
	- asr
	- xlm-roberta
	license: mit
	language:
	- km
	---

	# KhmerTagger: Inverse Text Normalization for Khmer ASR

	KhmerTagger is a model for inverse text normalization (ITN) of Khmer Automatic Speech Recognition (ASR) outputs. It performs punctuation restoration and number recognition to improve readability of raw ASR text.

	## Model Description

	The model is based on XLM-RoBERTa as the encoder, with a bidirectional LSTM layer and two classification heads:
	- Punctuation head: Predicts punctuation marks (space, comma, question mark, exclamation mark, etc.)
	- Number head: Identifies and tags numeric entities in the text

	## Usage

	```python
	from transformers import XLMRobertaTokenizer
	import torch
	from model import KhmerTagger

	# Load tokenizer
	tokenizer = XLMRobertaTokenizer.from_pretrained("FacebookAI/xlm-roberta-base")

	# Load model
	model = KhmerTagger(n_punct_features=5, n_num_features=3)
	model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu", weights_only=True))
	model.eval()

	# Your inference code here...
	```

	## Training

	The model was trained on 1.5 million tokens of Khmer news data and achieved 97.2% accuracy on the validation set.

	## Citation

	```bibtex
	@misc{khmertagger2025,
	author = {Seanghay Yath},
	title = {KhmerTagger: Inverse Text Normalization for Khmer Automatic Speech Recognition},
	year = {2025},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/seanghay/khmertagger}},
	note = {Open source project for Khmer punctuation restoration and number recognition using XLM-ROBERTa}
	}
	```