You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to not use the dataset to conduct experiments that cause harm to human subjects.

Log in or Sign Up to review the conditions and access this model content.


license: apache-2.0

OpenSeek-Mid-v1

OpenSeek-Mid-v1 is a 10.61-billion-parameter language model grown from Qwen3-4B-Base through a two-stage model expansion pipeline and trained on only 2 trillion tokens of fully open-source data.

Despite having 25% fewer parameters and using 18x less training data, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.

results_all

Highlights

  • Model Growth, Not From-Scratch Training: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
  • Extreme Data Efficiency: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training β€” an 18x reduction in data requirement.
  • Muon Optimizer: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
  • Fully Open-Source Data: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).

Architecture

Specification Value
Parameters 10.61B
Layers 56
Hidden Size (d_model) 2560
FFN Intermediate Size (d_FFN) 19456
Attention Heads 32
KV Heads 8
Sequence Length 8192
Vocabulary Size same as Qwen3-4B

Growth Pipeline

Qwen3-4B (4.02B, 36L)
    β”‚  Width expansion (d_FFN: 9728 β†’ 19456, SNR=10dB)
    β–Ό
Width-Expanded (7.10B, 36L)
    β”‚  Partial depth stacking (layers 14–34 Γ— 2)
    β–Ό
OpenSeek-Mid-v1 (10.61B, 56L)
    β”‚  Continual pretraining with Muon (2T tokens)
    β–Ό
Final Model

Training

Training Configuration

Parameter Value
Optimizer Muon
Sequence Length 8192
Global Batch Size 2048 sequences
Peak Learning Rate 1e-4
LR Schedule Cosine with linear warmup
Warmup Steps 1000
Weight Decay 0.1
Training Framework FlagScale (FlagOS)
Total Training Tokens ~2.06T

Stage 1: Broad Knowledge Acquisition (1.36T tokens)

Stage 1 Data Mixture

Category Proportion Tokens (B)
Web 42% ~571B
Math 20% ~272B
Code 20% ~272B
STEM 15% ~204B
Multilingual 3% ~41B

Stage 2: Capability Specialization (0.70T tokens)

Stage 2 Data Mixture

Category Proportion Tokens (B) Delta vs. Stage 1
Web 35% ~245B -7%
Math 20% ~140B β€”
Code 24% ~168B +4%
STEM 18% ~126B +3%
Multilingual 3% ~21B β€”

Detailed Dataset Composition

Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "β€”" indicates the dataset is not used in that stage.

Web

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-CC-v2-HQ-Syn 798.41 23.24 19.36
Nemotron-CC-v2-Diverse-QA (Γ—5 shards) 340.81 9.92 8.26
Nemotron-CC-v2-HQ (Γ—5 shards) 303.82 8.84 7.36
dolmino-mix-1124-wiki 3.82 0.15 0.18
dolmino-mix-1124-stackexchange 1.30 0.05 0.06

Math

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-SFT-MATH 207.46 11.70 11.70
Nemotron-CC-Math-v1-4plus-MIND 74.34 4.19 4.19
Nemotron-CC-Math-v1-4plus 53.37 3.01 3.01
Dolmino-math 11.17 0.63 0.63
OpenMathInstruct-2 5.30 0.30 0.30
OpenMathReasoning-4k 2.48 0.14 0.14
NuminaMath-1.5 0.38 0.02 0.02

Code

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-Pretraining-Code-v1-Syn 171.53 9.05 10.86
Nemotron-SFT-Code 57.47 3.03 3.64
stack-edu-Java 31.70 1.06 1.27
stack-edu-Markdown 26.64 0.38 0.45
stack-edu-Python 18.27 1.54 1.85
stack-edu-Cpp 12.62 1.11 1.33
stack-edu-JavaScript 8.99 1.00 1.20
stack-edu-SQL 8.23 0.37 0.44
github-issue 8.46 0.25 0.30
stack-edu-PHP 7.43 0.25 0.30
stack-edu-CSharp 7.26 0.37 0.44
stack-edu-C 4.80 0.43 0.52
stack-edu-Shell 2.60 0.01 0.01
stack-edu-TypeScript 2.51 0.18 0.22
OpenCodeInstruct 1.59 β€” 0.10
stack-edu-Swift 1.53 0.06 0.07
stack-edu-Rust 1.45 0.05 0.06
stack-edu-Go 1.42 0.03 0.04
kaggle-notebooks 1.42 0.65 0.78
stack-edu-Ruby 1.36 0.01 0.01
OpenCodeReasoning-2-cpp-4k 0.76 0.04 0.05
OpenCodeReasoning-2-python-4k 0.58 0.03 0.04
github-code-review 0.32 β€” 0.02

STEM & Science

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-Pretraining-Specialized-v1 (Γ—4 shards) 276.83 10.55 12.73
Nemotron-Pretraining-SFT-v1-General 86.93 3.31 4.00
dolmino-mix-1124-pes2o 60.19 0.50 0.50
Nemotron-Pretraining-Specialized-v1.1 9.04 β€” 0.42
OpenScienceReasoning-2-4k 1.72 0.07 0.08
MegaScience 0.98 0.04 0.04

Multilingual

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-CC-v2-Translated-Diverse-QA 135.80 1.74 1.74
CCI4_0-Zh-High 98.76 1.26 1.26

Checkpoint Merging

The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:

Checkpoint Weight Role Key Metric
iter 169984 0.30 Code anchor MBPP 78.84
iter 219136 0.25 Reasoning lead GPQA-d 44.39
iter 174080 0.15 Code peak EvalPlus 68.88
iter 190464 0.15 Math bridge GPQA-d 42.86
iter 217088 0.15 General boost BBH 82.84

Evaluation Results

All evaluations conducted via lm-eval-harness with consistent settings.

Benchmark Qwen3-4B Qwen3-8B Qwen3.5-9B Nemotron-12B Gemma3-12B Qwen3-14B OpenSeek-Mid-v1
Training tokens 36T 36T 36T 20T 12T 36T 2T
MMLU (5-shot) 72.72 76.57 78.64 78.07 73.28 80.57 79.31
MMLU-Pro (5-shot CoT) 49.31 52.35 58.48 57.57 41.16 56.00 66.57
AGIEval-en (0-shot) 45.92 49.09 45.15 49.20 44.89 52.83 52.18
BBH (3-shot CoT) 71.20 77.75 82.23 69.65 73.78 78.71 82.55
HellaSwag (5-shot) 75.36 79.47 81.04 83.13 83.45 82.05 81.81
Winogrande (5-shot) 71.90 77.51 76.80 79.24 80.35 79.40 79.24
PIQA (5-shot) 78.89 81.39 81.61 82.97 81.80 83.30 83.19
OpenBookQA (5-shot) 45.00 49.00 50.00 50.20 49.60 50.80 49.80
ARC-C (0-shot) 51.19 56.91 56.83 60.58 64.68 59.30 62.12
GSM8K (4-shot CoT) 84.31 86.73 85.60 81.43 72.02 90.07 89.16
MATH (4-shot CoT) 50.16 52.48 56.16 57.30 43.30 59.70 65.88
GPQA-diamond (3-shot CoT) 32.65 35.71 37.76 31.12 23.47 37.76 45.41
MBPP (0-shot) 73.81 75.66 77.51 73.81 73.28 84.92 76.19
EvalPlus Avg (0-shot) 63.96 67.95 59.54 61.20 53.48 73.41 66.45
Avg General 62.39 66.67 67.86 65.04 60.98 69.22 70.75
Avg All 61.88 65.61 66.24 65.39 61.32 69.20 69.99
  • Avg General: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
  • Avg All: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).

Citation

If you find this work useful, please cite:

@misc{openseek-mid-v1,
  title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
  year={2026},
  note={Technical report coming soon}
}

Acknowledgements

This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.

Downloads last month
25
Safetensors
Model size
11B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including BAAI/OpenSeek-Mid-v1