KORMo midtraining datasets - a kormo-lm Collection

kormo-lm 's Collections

KORMo SFT datasets

KORMo midtraining datasets

KORMo-10B Checkpoints

KORMo pretraining datasets

KORMo midtraining datasets

updated Oct 13

The midtraining datasets for KORMo-10B were collected from diverse, publicly available source

princeton-nlp/prolong-data-64K

Updated Oct 5, 2024 • 14.1k • 20
KORMo-Team/Cosmopedia-ko-synth

Preview • Updated Oct 13 • 1.45k
KORMo-Team/korean-web-collection

Preview • Updated Sep 14 • 1.13k • 1

Note Long Context Training English - princeton-nlp/prolong-data-64K (7.21B, sampling) Korean - kormo-lm/ko_synth_kosmopedia (0.51B, sampling) - kormo-lm/korean_web (1.44B, sampling)
kormo-lm/NemoPost-filtered

Preview • Updated Oct 1 • 998
kormo-lm/OpenThoughts3-filtered

Preview • Updated Oct 1 • 1
KORMo-Team/NemoPost-ko-translated

Preview • Updated Oct 1 • 299
KORMo-Team/NemoPost-ko-synth

Preview • Updated Oct 13 • 623 • 1

Note Reasoning Mid-Training English - nvidia/Nemotron-Post-Training-Dataset-v1 (~144.75B) - open-thoughts/OpenThoughts3-1.2M (~5.46B) Korean - kormo-lm/midtrain-Nemotron-post-train-translated (~2.83B) - kormo-lm/ko_reasoning_synth_ko_mlp (~7.05B)