You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
You agree to not use the dataset to conduct experiments that cause harm to human subjects.
Log in or Sign Up to review the conditions and access this model content.
license: apache-2.0
OpenSeek-Mid-v1
OpenSeek-Mid-v1 is a 10.61-billion-parameter language model grown from Qwen3-4B-Base through a two-stage model expansion pipeline and trained on only 2 trillion tokens of fully open-source data.
Despite having 25% fewer parameters and using 18x less training data, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.
Highlights
- Model Growth, Not From-Scratch Training: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
- Extreme Data Efficiency: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training β an 18x reduction in data requirement.
- Muon Optimizer: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
- Fully Open-Source Data: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).
Architecture
| Specification | Value |
|---|---|
| Parameters | 10.61B |
| Layers | 56 |
| Hidden Size (d_model) | 2560 |
| FFN Intermediate Size (d_FFN) | 19456 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Sequence Length | 8192 |
| Vocabulary Size | same as Qwen3-4B |
Growth Pipeline
Qwen3-4B (4.02B, 36L)
β Width expansion (d_FFN: 9728 β 19456, SNR=10dB)
βΌ
Width-Expanded (7.10B, 36L)
β Partial depth stacking (layers 14β34 Γ 2)
βΌ
OpenSeek-Mid-v1 (10.61B, 56L)
β Continual pretraining with Muon (2T tokens)
βΌ
Final Model
Training
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | Muon |
| Sequence Length | 8192 |
| Global Batch Size | 2048 sequences |
| Peak Learning Rate | 1e-4 |
| LR Schedule | Cosine with linear warmup |
| Warmup Steps | 1000 |
| Weight Decay | 0.1 |
| Training Framework | FlagScale (FlagOS) |
| Total Training Tokens | ~2.06T |
Stage 1: Broad Knowledge Acquisition (1.36T tokens)
Stage 1 Data Mixture
| Category | Proportion | Tokens (B) |
|---|---|---|
| Web | 42% | ~571B |
| Math | 20% | ~272B |
| Code | 20% | ~272B |
| STEM | 15% | ~204B |
| Multilingual | 3% | ~41B |
Stage 2: Capability Specialization (0.70T tokens)
Stage 2 Data Mixture
| Category | Proportion | Tokens (B) | Delta vs. Stage 1 |
|---|---|---|---|
| Web | 35% | ~245B | -7% |
| Math | 20% | ~140B | β |
| Code | 24% | ~168B | +4% |
| STEM | 18% | ~126B | +3% |
| Multilingual | 3% | ~21B | β |
Detailed Dataset Composition
Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "β" indicates the dataset is not used in that stage.
Web
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-CC-v2-HQ-Syn | 798.41 | 23.24 | 19.36 |
| Nemotron-CC-v2-Diverse-QA (Γ5 shards) | 340.81 | 9.92 | 8.26 |
| Nemotron-CC-v2-HQ (Γ5 shards) | 303.82 | 8.84 | 7.36 |
| dolmino-mix-1124-wiki | 3.82 | 0.15 | 0.18 |
| dolmino-mix-1124-stackexchange | 1.30 | 0.05 | 0.06 |
Math
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-SFT-MATH | 207.46 | 11.70 | 11.70 |
| Nemotron-CC-Math-v1-4plus-MIND | 74.34 | 4.19 | 4.19 |
| Nemotron-CC-Math-v1-4plus | 53.37 | 3.01 | 3.01 |
| Dolmino-math | 11.17 | 0.63 | 0.63 |
| OpenMathInstruct-2 | 5.30 | 0.30 | 0.30 |
| OpenMathReasoning-4k | 2.48 | 0.14 | 0.14 |
| NuminaMath-1.5 | 0.38 | 0.02 | 0.02 |
Code
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-Pretraining-Code-v1-Syn | 171.53 | 9.05 | 10.86 |
| Nemotron-SFT-Code | 57.47 | 3.03 | 3.64 |
| stack-edu-Java | 31.70 | 1.06 | 1.27 |
| stack-edu-Markdown | 26.64 | 0.38 | 0.45 |
| stack-edu-Python | 18.27 | 1.54 | 1.85 |
| stack-edu-Cpp | 12.62 | 1.11 | 1.33 |
| stack-edu-JavaScript | 8.99 | 1.00 | 1.20 |
| stack-edu-SQL | 8.23 | 0.37 | 0.44 |
| github-issue | 8.46 | 0.25 | 0.30 |
| stack-edu-PHP | 7.43 | 0.25 | 0.30 |
| stack-edu-CSharp | 7.26 | 0.37 | 0.44 |
| stack-edu-C | 4.80 | 0.43 | 0.52 |
| stack-edu-Shell | 2.60 | 0.01 | 0.01 |
| stack-edu-TypeScript | 2.51 | 0.18 | 0.22 |
| OpenCodeInstruct | 1.59 | β | 0.10 |
| stack-edu-Swift | 1.53 | 0.06 | 0.07 |
| stack-edu-Rust | 1.45 | 0.05 | 0.06 |
| stack-edu-Go | 1.42 | 0.03 | 0.04 |
| kaggle-notebooks | 1.42 | 0.65 | 0.78 |
| stack-edu-Ruby | 1.36 | 0.01 | 0.01 |
| OpenCodeReasoning-2-cpp-4k | 0.76 | 0.04 | 0.05 |
| OpenCodeReasoning-2-python-4k | 0.58 | 0.03 | 0.04 |
| github-code-review | 0.32 | β | 0.02 |
STEM & Science
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-Pretraining-Specialized-v1 (Γ4 shards) | 276.83 | 10.55 | 12.73 |
| Nemotron-Pretraining-SFT-v1-General | 86.93 | 3.31 | 4.00 |
| dolmino-mix-1124-pes2o | 60.19 | 0.50 | 0.50 |
| Nemotron-Pretraining-Specialized-v1.1 | 9.04 | β | 0.42 |
| OpenScienceReasoning-2-4k | 1.72 | 0.07 | 0.08 |
| MegaScience | 0.98 | 0.04 | 0.04 |
Multilingual
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-CC-v2-Translated-Diverse-QA | 135.80 | 1.74 | 1.74 |
| CCI4_0-Zh-High | 98.76 | 1.26 | 1.26 |
Checkpoint Merging
The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:
| Checkpoint | Weight | Role | Key Metric |
|---|---|---|---|
| iter 169984 | 0.30 | Code anchor | MBPP 78.84 |
| iter 219136 | 0.25 | Reasoning lead | GPQA-d 44.39 |
| iter 174080 | 0.15 | Code peak | EvalPlus 68.88 |
| iter 190464 | 0.15 | Math bridge | GPQA-d 42.86 |
| iter 217088 | 0.15 | General boost | BBH 82.84 |
Evaluation Results
All evaluations conducted via lm-eval-harness with consistent settings.
| Benchmark | Qwen3-4B | Qwen3-8B | Qwen3.5-9B | Nemotron-12B | Gemma3-12B | Qwen3-14B | OpenSeek-Mid-v1 |
|---|---|---|---|---|---|---|---|
| Training tokens | 36T | 36T | 36T | 20T | 12T | 36T | 2T |
| MMLU (5-shot) | 72.72 | 76.57 | 78.64 | 78.07 | 73.28 | 80.57 | 79.31 |
| MMLU-Pro (5-shot CoT) | 49.31 | 52.35 | 58.48 | 57.57 | 41.16 | 56.00 | 66.57 |
| AGIEval-en (0-shot) | 45.92 | 49.09 | 45.15 | 49.20 | 44.89 | 52.83 | 52.18 |
| BBH (3-shot CoT) | 71.20 | 77.75 | 82.23 | 69.65 | 73.78 | 78.71 | 82.55 |
| HellaSwag (5-shot) | 75.36 | 79.47 | 81.04 | 83.13 | 83.45 | 82.05 | 81.81 |
| Winogrande (5-shot) | 71.90 | 77.51 | 76.80 | 79.24 | 80.35 | 79.40 | 79.24 |
| PIQA (5-shot) | 78.89 | 81.39 | 81.61 | 82.97 | 81.80 | 83.30 | 83.19 |
| OpenBookQA (5-shot) | 45.00 | 49.00 | 50.00 | 50.20 | 49.60 | 50.80 | 49.80 |
| ARC-C (0-shot) | 51.19 | 56.91 | 56.83 | 60.58 | 64.68 | 59.30 | 62.12 |
| GSM8K (4-shot CoT) | 84.31 | 86.73 | 85.60 | 81.43 | 72.02 | 90.07 | 89.16 |
| MATH (4-shot CoT) | 50.16 | 52.48 | 56.16 | 57.30 | 43.30 | 59.70 | 65.88 |
| GPQA-diamond (3-shot CoT) | 32.65 | 35.71 | 37.76 | 31.12 | 23.47 | 37.76 | 45.41 |
| MBPP (0-shot) | 73.81 | 75.66 | 77.51 | 73.81 | 73.28 | 84.92 | 76.19 |
| EvalPlus Avg (0-shot) | 63.96 | 67.95 | 59.54 | 61.20 | 53.48 | 73.41 | 66.45 |
| Avg General | 62.39 | 66.67 | 67.86 | 65.04 | 60.98 | 69.22 | 70.75 |
| Avg All | 61.88 | 65.61 | 66.24 | 65.39 | 61.32 | 69.20 | 69.99 |
- Avg General: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
- Avg All: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).
Citation
If you find this work useful, please cite:
@misc{openseek-mid-v1,
title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
year={2026},
note={Technical report coming soon}
}
Acknowledgements
This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.
- Downloads last month
- 25