YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

mT5 Latin Punctuator (mt5-large)

Model Description

This model is a fine-tuned version of google/mt5-large designed for automatic punctuation restoration in Latin text.

It takes unpunctuated Latin text as input and generates a fully punctuated version of the same text, adding sentence boundaries and internal punctuation. The model was optimized for Classical and Medieval Latin prose (approx. 400–1500 AD), making it suitable for digital editions and post-processing OCR/ATR outputs.

  • Task: Sequence-to-Sequence Punctuation Restoration
  • Base Model: mT5-Large
  • Language: Latin (la)
  • Developer: Michael Schonhardt (TU Darmstadt)

Note: This model operates on continuous text streams. While it restores sentence boundaries (., ?, !) and capitalization, it is not designed to predict paragraph breaks or layout features.


Intended Use

Primary Use Cases

  • OCR Post-processing: Restoring punctuation to raw OCR output from early printed books or manuscripts.
  • NLP Pipeline Preparation: Pre-formatting Latin text for downstream tasks like Machine Translation or Named Entity Recognition.
  • Pedagogical Aid: Helping students or readers navigate raw Latin texts by providing suggested sentence structures.

Out-of-Scope / Limitations

  • Not a Critical Editor: The model makes probabilistic guesses based on training data. In ambiguous cases, it may insert punctuation that disagrees with specific editorial conventions.
  • Diplomatic Transcription: The model expects expanded spelling based on normalisation conventions and may struggle with raw diplomatic transcriptions containing heavy abbreviations.

Usage

Important: The model was trained with the prefix "punctuate: " and expects lowercased input. The helper function below handles this automatically.

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "mschonhardt/mt5-latin-punctuator-large" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

def punctuate(text: str) -> str:
    # Preprocessing: Add prefix and lowercase as per training script
    input_text = "punctuate: " + text.lower()
    
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        truncation=True,
        max_length=1024,
    ).to(device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_length=1024,
            num_beams=4,
            early_stopping=True,
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Example usage
text = "gallia est omnis divisa in partes tres"
print(punctuate(text))
# Output: Gallia est omnis divisa in partes tres.

Training Data

The model was trained on a custom dataset derived from the Latin Text Archive (LTA) as well as data created as part of the Burchards Dekret Digital Project.

Data Preparation & Cleaning

The dataset was processed using the following pipeline:

  1. Artifact Removal: Line number references (e.g., "667 - 668") were removed via regex.
  2. Normalization: Whitespace was normalized to single spaces.
  3. Source/Target Formatting:
    • Source: Text was lowercased, punctuation (.,;?!) was removed, and the prefix "punctuate: " was added.
    • Target: Original text with case and punctuation preserved.
  4. Splitting: The data was split into Train, Validation, and Test sets using a random seed of 42.

Data Source & Acknowledgements

We gratefully acknowledge that the training data originates from the Latin Text Archive (LTA) (Prof. Dr. Bernhard Jussen, Dr. Tim Geelhaar) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.


Training Procedure

The model was fine-tuned on a high-performance compute node using the transformers Seq2SeqTrainer.

Hardware Configuration

  • GPU: NVIDIA L40S (45 GB VRAM)
  • Precision: bfloat16 (BF16) mixed precision
  • Optimizer: Adafactor

Hyperparameters

Parameter Value
Batch Size 4
Grad Accumulation 16 steps
Effective Batch Size 64
Learning Rate 5e-5 (Linear Decay)
Max Sequence Length 1024 (Source and Target)
Gradient Checkpointing True

Training Dynamics

Early Stopping: Training was stopped at ~1.6 epochs as validation gains diminished.

Validation Loss Progression:

Epoch Validation Loss
0.2 0.1448
0.6 0.1137
1.0 0.1043
1.4 0.0982
1.6 0.0973 (Best)

Example Outputs

1. Classical Latin (Caesar)

Input: gallia est omnis divisa in partes tres quarum unam incolunt belgae aliam aquitani tertiam qui ipsorum lingua celtae nostra galli appellantur

Output: Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.

2. Medieval Medical Latin (Mondino de' Liuzzi)

Input: Consilium ad conceptionem mondini bononiensis de leuciis quia ex relatis principio sterilitatis causam fore duplicem horum coniugatorum unam ocultam scilicet disproportionem eorum nobis ignotam cuius remotioni altissimus provideat quia nostrum non est nec posse alicuius artis et aliam manifestam existimo esse complectionem eorum declinare ad frigidum et declinabilem in processu etatis etiam ad frigidum et humidum humiditatem accidentali corporibus eorum est subveniendum triplici instrumento medici declinante ad calidum et siccum quo ad medicinas et ad calidum et humidum quantum ad dictam

Output: Consilium ad conceptionem Mondini Bononiensis de leuciis. Quia ex relatis principio sterilitatis causam fore duplicem horum coniugatorum, unam ocultam, scilicet disproportionem eorum nobis ignotam, cuius remotioni Altissimus provideat, quia nostrum non est nec posse alicuius artis, et aliam manifestam existimo esse complectionem eorum declinare ad frigidum et declinabilem in processu etatis etiam ad frigidum et humidum. Humiditatem accidentali corporibus eorum est subveniendum triplici instrumento medici declinante ad calidum et siccum quo ad medicinas et ad calidum et humidum quantum ad dictam.


Model Development and Funding

This model was developed by Michael Schonhardt as part of the Digital Editing Toolkit.

Affiliation: Fachgebiet Digitale Editorik und Kulturgeschichte des Mittelalters
(Digital Editing and Cultural History of the Middle Ages)
Technische Universität Darmstadt In cooperation with the Academy of Sciences and Literature | Mainz

Funding: The development was funded by the Hessian Ministry of Higher Education, Research, Science and the Arts (HMWK) within the LOEWE Exploration project: "Embedding the Past".

The project focuses on developing transparent and responsible AI methods for the semantic analysis and editorial processing of historical documents.


Citation

If you use this model in your research, please cite it as follows:

@misc{schonhardt-2025-latin-punctuator,
  author       = {Schonhardt, Michael},
  title        = {mT5 Latin Punctuator (mt5-large)},
  year         = {2025},
  publisher    = {Hugging Face},
  doi          = {10.5281/zenodo.17777660}
  howpublished = {\url{[https://huggingface.co/YOUR_USERNAME/mt5-latin-punctuator-large](https://huggingface.co/YOUR_USERNAME/mt5-latin-punctuator-large)}},
  note         = {Part of the LOEWE Exploration 'Embedding the Past'. Data provided by LTA}
}

Please also cite the original mT5 paper:

@inproceedings{xue-etal-2021-mt5,
    title = "m{T}5: A Massively Multilingual Pre-trained Text-to-Text Transformer",
    author = "Xue, Linting  and
      Constant, Noah  and
      Roberts, Adam  and
      Kale, Mihir  and
      Al-Rfou, Rami  and
      Siddhant, Aditya  and
      Barua, Aditya  and
      Raffel, Colin",
    editor = "Toutanova, Kristina  and
      Rumshisky, Anna  and
      Zettlemoyer, Luke  and
      Hakkani-Tur, Dilek  and
      Beltagy, Iz  and
      Bethard, Steven  and
      Cotterell, Ryan  and
      Chakraborty, Tanmoy  and
      Zhou, Yichao",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.41/",
    doi = "10.18653/v1/2021.naacl-main.41",
    pages = "483--498"
    
}
Downloads last month
20
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mschonhardt/mt5-latin-punctuator-large

Base model

google/mt5-large
Finetuned
(91)
this model

Collection including mschonhardt/mt5-latin-punctuator-large