Mangosteen: An Open Thai Corpus for Language Model Pretraining
Paper
• 2507.14664 • Published
• 7
Mangosteen, a 47 billion-token Thai corpus built with a Thai-adapted pipeline, improves language model performance on Thai benchmarks.
Note Raw data for Thai Dolma - Commoncrawl - Fineweb2
Note Fineweb2 LD
Note + Quality filtering
Note +Deduplication
Note Fineweb2 by our pipeline (LD+ Quality filtering + Deduplication + Content filtering)
Note Commoncrawl LD
Note + Quality filtering
Note +Deduplication
Note Commoncrawl by our pipeline (LD+ Quality filtering + Deduplication + Content filtering)
Note FastText Model
Note Non-common crawl subset
Note common crawl subset
Note CPT base model
Note SFT model