princeton-nlp/prolong-data-64K
Updated
•
14.1k
•
20
The midtraining datasets for KORMo-10B were collected from diverse, publicly available source
Note Long Context Training English - princeton-nlp/prolong-data-64K (7.21B, sampling) Korean - kormo-lm/ko_synth_kosmopedia (0.51B, sampling) - kormo-lm/korean_web (1.44B, sampling)
Note Reasoning Mid-Training English - nvidia/Nemotron-Post-Training-Dataset-v1 (~144.75B) - open-thoughts/OpenThoughts3-1.2M (~5.46B) Korean - kormo-lm/midtrain-Nemotron-post-train-translated (~2.83B) - kormo-lm/ko_reasoning_synth_ko_mlp (~7.05B)