mT5 Latin Punctuator (mt5-large)
Model Description
This model is a fine-tuned version of google/mt5-large designed for automatic punctuation restoration in Latin text.
It takes unpunctuated Latin text as input and generates a fully punctuated version of the same text, adding sentence boundaries and internal punctuation. The model was optimized for Classical and Medieval Latin prose (approx. 400–1500 AD), making it suitable for digital editions and post-processing OCR/ATR outputs.
- Task: Sequence-to-Sequence Punctuation Restoration
- Base Model: mT5-Large
- Language: Latin (
la) - Developer: Michael Schonhardt (TU Darmstadt)
Note: This model operates on continuous text streams. While it restores sentence boundaries (
.,?,!) and capitalization, it is not designed to predict paragraph breaks or layout features.
Intended Use
Primary Use Cases
- OCR Post-processing: Restoring punctuation to raw OCR output from early printed books or manuscripts.
- NLP Pipeline Preparation: Pre-formatting Latin text for downstream tasks like Machine Translation or Named Entity Recognition.
- Pedagogical Aid: Helping students or readers navigate raw Latin texts by providing suggested sentence structures.
Out-of-Scope / Limitations
- Not a Critical Editor: The model makes probabilistic guesses based on training data. In ambiguous cases, it may insert punctuation that disagrees with specific editorial conventions.
- Diplomatic Transcription: The model expects expanded spelling based on normalisation conventions and may struggle with raw diplomatic transcriptions containing heavy abbreviations.
Usage
Important: The model was trained with the prefix "punctuate: " and expects lowercased input. The helper function below handles this automatically.
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "mschonhardt/mt5-latin-punctuator-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
def punctuate(text: str) -> str:
# Preprocessing: Add prefix and lowercase as per training script
input_text = "punctuate: " + text.lower()
inputs = tokenizer(
input_text,
return_tensors="pt",
truncation=True,
max_length=1024,
).to(device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_length=1024,
num_beams=4,
early_stopping=True,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Example usage
text = "gallia est omnis divisa in partes tres"
print(punctuate(text))
# Output: Gallia est omnis divisa in partes tres.
Training Data
The model was trained on a custom dataset derived from the Latin Text Archive (LTA) as well as data created as part of the Burchards Dekret Digital Project.
Data Preparation & Cleaning
The dataset was processed using the following pipeline:
- Artifact Removal: Line number references (e.g., "667 - 668") were removed via regex.
- Normalization: Whitespace was normalized to single spaces.
- Source/Target Formatting:
- Source: Text was lowercased, punctuation (
.,;?!) was removed, and the prefix"punctuate: "was added. - Target: Original text with case and punctuation preserved.
- Source: Text was lowercased, punctuation (
- Splitting: The data was split into Train, Validation, and Test sets using a random seed of 42.
Data Source & Acknowledgements
We gratefully acknowledge that the training data originates from the Latin Text Archive (LTA) (Prof. Dr. Bernhard Jussen, Dr. Tim Geelhaar) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.
Training Procedure
The model was fine-tuned on a high-performance compute node using the transformers Seq2SeqTrainer.
Hardware Configuration
- GPU: NVIDIA L40S (45 GB VRAM)
- Precision:
bfloat16(BF16) mixed precision - Optimizer: Adafactor
Hyperparameters
| Parameter | Value |
|---|---|
| Batch Size | 4 |
| Grad Accumulation | 16 steps |
| Effective Batch Size | 64 |
| Learning Rate | 5e-5 (Linear Decay) |
| Max Sequence Length | 1024 (Source and Target) |
| Gradient Checkpointing | True |
Training Dynamics
Early Stopping: Training was stopped at ~1.6 epochs as validation gains diminished.
Validation Loss Progression:
| Epoch | Validation Loss |
|---|---|
| 0.2 | 0.1448 |
| 0.6 | 0.1137 |
| 1.0 | 0.1043 |
| 1.4 | 0.0982 |
| 1.6 | 0.0973 (Best) |
Example Outputs
1. Classical Latin (Caesar)
Input: gallia est omnis divisa in partes tres quarum unam incolunt belgae aliam aquitani tertiam qui ipsorum lingua celtae nostra galli appellantur
Output: Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.
2. Medieval Medical Latin (Mondino de' Liuzzi)
Input: Consilium ad conceptionem mondini bononiensis de leuciis quia ex relatis principio sterilitatis causam fore duplicem horum coniugatorum unam ocultam scilicet disproportionem eorum nobis ignotam cuius remotioni altissimus provideat quia nostrum non est nec posse alicuius artis et aliam manifestam existimo esse complectionem eorum declinare ad frigidum et declinabilem in processu etatis etiam ad frigidum et humidum humiditatem accidentali corporibus eorum est subveniendum triplici instrumento medici declinante ad calidum et siccum quo ad medicinas et ad calidum et humidum quantum ad dictam
Output: Consilium ad conceptionem Mondini Bononiensis de leuciis. Quia ex relatis principio sterilitatis causam fore duplicem horum coniugatorum, unam ocultam, scilicet disproportionem eorum nobis ignotam, cuius remotioni Altissimus provideat, quia nostrum non est nec posse alicuius artis, et aliam manifestam existimo esse complectionem eorum declinare ad frigidum et declinabilem in processu etatis etiam ad frigidum et humidum. Humiditatem accidentali corporibus eorum est subveniendum triplici instrumento medici declinante ad calidum et siccum quo ad medicinas et ad calidum et humidum quantum ad dictam.
Model Development and Funding
This model was developed by Michael Schonhardt as part of the Digital Editing Toolkit.
Affiliation:
Fachgebiet Digitale Editorik und Kulturgeschichte des Mittelalters
(Digital Editing and Cultural History of the Middle Ages)
Technische Universität Darmstadt In cooperation with the Academy of Sciences and Literature | Mainz
Funding: The development was funded by the Hessian Ministry of Higher Education, Research, Science and the Arts (HMWK) within the LOEWE Exploration project: "Embedding the Past".
The project focuses on developing transparent and responsible AI methods for the semantic analysis and editorial processing of historical documents.
Citation
If you use this model in your research, please cite it as follows:
@misc{schonhardt-2025-latin-punctuator,
author = {Schonhardt, Michael},
title = {mT5 Latin Punctuator (mt5-large)},
year = {2025},
publisher = {Hugging Face},
doi = {10.5281/zenodo.17777660}
howpublished = {\url{[https://huggingface.co/YOUR_USERNAME/mt5-latin-punctuator-large](https://huggingface.co/YOUR_USERNAME/mt5-latin-punctuator-large)}},
note = {Part of the LOEWE Exploration 'Embedding the Past'. Data provided by LTA}
}
Please also cite the original mT5 paper:
@inproceedings{xue-etal-2021-mt5,
title = "m{T}5: A Massively Multilingual Pre-trained Text-to-Text Transformer",
author = "Xue, Linting and
Constant, Noah and
Roberts, Adam and
Kale, Mihir and
Al-Rfou, Rami and
Siddhant, Aditya and
Barua, Aditya and
Raffel, Colin",
editor = "Toutanova, Kristina and
Rumshisky, Anna and
Zettlemoyer, Luke and
Hakkani-Tur, Dilek and
Beltagy, Iz and
Bethard, Steven and
Cotterell, Ryan and
Chakraborty, Tanmoy and
Zhou, Yichao",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.41/",
doi = "10.18653/v1/2021.naacl-main.41",
pages = "483--498"
}
- Downloads last month
- 20
Model tree for mschonhardt/mt5-latin-punctuator-large
Base model
google/mt5-large