KORMo-Team/dclm-baseline-filtered
Preview
•
Updated
•
4.2k
•
1
The pretraining datasets for KORMo-10B were collected from diverse, publicly available source
Note Stage 1 Pretraining Datasets English - komo-lm/dclm-baseline (~1000B Tokens) Korean - kormo-lm/korean_web (~42.5B Tokens)
Note Stage 2 Pretraining Datasets English - kormo-lm/UltraFineWeb (~793B) - kormo-lm/math_finemath_3plus (~37.3B) - kormo-lm/code_stack_edu (~152B) - kormo-lm/cosmopedia (~25B) - kormo-lm/reasoning_synth_OCR (~0.65B) - kormo-lm/reasoning_synth_OMR (~3.19B) Korean - kormo-lm/ko_web_korean_opensource (~5.57B) - kormo-lm/ko_synth_fineweb2 (~10.97B) - kormo-lm/ko_synth_kosmopedia (~4.07B) - kormo-lm/ko_synth_UltraFineWeb (~41.69B) - kormo-lm/ko_reasoning_synth_ko_mlp (~7.05B)