You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to not use the dataset to conduct experiments that cause harm to human subjects.

license: apache-2.0

OpenSeek-Mid-v1

OpenSeek-Mid-v1 is a 10.61-billion-parameter language model grown from Qwen3-4B-Base through a two-stage model expansion pipeline and trained on only 2 trillion tokens of fully open-source data.

Despite having 25% fewer parameters and using 18x less training data, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.

Highlights

Model Growth, Not From-Scratch Training: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
Extreme Data Efficiency: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training — an 18x reduction in data requirement.
Muon Optimizer: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
Fully Open-Source Data: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).

Architecture

Specification	Value
Parameters	10.61B
Layers	56
Hidden Size (d_model)	2560
FFN Intermediate Size (d_FFN)	19456
Attention Heads	32
KV Heads	8
Sequence Length	8192
Vocabulary Size	same as Qwen3-4B

Growth Pipeline

Qwen3-4B (4.02B, 36L)
    │  Width expansion (d_FFN: 9728 → 19456, SNR=10dB)
    ▼
Width-Expanded (7.10B, 36L)
    │  Partial depth stacking (layers 14–34 × 2)
    ▼
OpenSeek-Mid-v1 (10.61B, 56L)
    │  Continual pretraining with Muon (2T tokens)
    ▼
Final Model

Training

Training Configuration

Parameter	Value
Optimizer	Muon
Sequence Length	8192
Global Batch Size	2048 sequences
Peak Learning Rate	1e-4
LR Schedule	Cosine with linear warmup
Warmup Steps	1000
Weight Decay	0.1
Training Framework	FlagScale (FlagOS)
Total Training Tokens	~2.06T

Stage 1: Broad Knowledge Acquisition (1.36T tokens)

Stage 1 Data Mixture

Category	Proportion	Tokens (B)
Web	42%	~571B
Math	20%	~272B
Code	20%	~272B
STEM	15%	~204B
Multilingual	3%	~41B

Stage 2: Capability Specialization (0.70T tokens)

Stage 2 Data Mixture

Category	Proportion	Tokens (B)	Delta vs. Stage 1
Web	35%	~245B	-7%
Math	20%	~140B	—
Code	24%	~168B	+4%
STEM	18%	~126B	+3%
Multilingual	3%	~21B	—

Detailed Dataset Composition

Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "—" indicates the dataset is not used in that stage.

Web

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-CC-v2-HQ-Syn	798.41	23.24	19.36
Nemotron-CC-v2-Diverse-QA (×5 shards)	340.81	9.92	8.26
Nemotron-CC-v2-HQ (×5 shards)	303.82	8.84	7.36
dolmino-mix-1124-wiki	3.82	0.15	0.18
dolmino-mix-1124-stackexchange	1.30	0.05	0.06

Math

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-SFT-MATH	207.46	11.70	11.70
Nemotron-CC-Math-v1-4plus-MIND	74.34	4.19	4.19
Nemotron-CC-Math-v1-4plus	53.37	3.01	3.01
Dolmino-math	11.17	0.63	0.63
OpenMathInstruct-2	5.30	0.30	0.30
OpenMathReasoning-4k	2.48	0.14	0.14
NuminaMath-1.5	0.38	0.02	0.02

Code

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-Pretraining-Code-v1-Syn	171.53	9.05	10.86
Nemotron-SFT-Code	57.47	3.03	3.64
stack-edu-Java	31.70	1.06	1.27
stack-edu-Markdown	26.64	0.38	0.45
stack-edu-Python	18.27	1.54	1.85
stack-edu-Cpp	12.62	1.11	1.33
stack-edu-JavaScript	8.99	1.00	1.20
stack-edu-SQL	8.23	0.37	0.44
github-issue	8.46	0.25	0.30
stack-edu-PHP	7.43	0.25	0.30
stack-edu-CSharp	7.26	0.37	0.44
stack-edu-C	4.80	0.43	0.52
stack-edu-Shell	2.60	0.01	0.01
stack-edu-TypeScript	2.51	0.18	0.22
OpenCodeInstruct	1.59	—	0.10
stack-edu-Swift	1.53	0.06	0.07
stack-edu-Rust	1.45	0.05	0.06
stack-edu-Go	1.42	0.03	0.04
kaggle-notebooks	1.42	0.65	0.78
stack-edu-Ruby	1.36	0.01	0.01
OpenCodeReasoning-2-cpp-4k	0.76	0.04	0.05
OpenCodeReasoning-2-python-4k	0.58	0.03	0.04
github-code-review	0.32	—	0.02

STEM & Science

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-Pretraining-Specialized-v1 (×4 shards)	276.83	10.55	12.73
Nemotron-Pretraining-SFT-v1-General	86.93	3.31	4.00
dolmino-mix-1124-pes2o	60.19	0.50	0.50
Nemotron-Pretraining-Specialized-v1.1	9.04	—	0.42
OpenScienceReasoning-2-4k	1.72	0.07	0.08
MegaScience	0.98	0.04	0.04

Multilingual

Dataset	Tokens (B)	Stage 1 (%)	Stage 2 (%)
Nemotron-CC-v2-Translated-Diverse-QA	135.80	1.74	1.74
CCI4_0-Zh-High	98.76	1.26	1.26

Checkpoint Merging

The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:

Checkpoint	Weight	Role	Key Metric
iter 169984	0.30	Code anchor	MBPP 78.84
iter 219136	0.25	Reasoning lead	GPQA-d 44.39
iter 174080	0.15	Code peak	EvalPlus 68.88
iter 190464	0.15	Math bridge	GPQA-d 42.86
iter 217088	0.15	General boost	BBH 82.84

Evaluation Results

All evaluations conducted via lm-eval-harness with consistent settings.

Benchmark	Qwen3-4B	Qwen3-8B	Qwen3.5-9B	Nemotron-12B	Gemma3-12B	Qwen3-14B	OpenSeek-Mid-v1
Training tokens	36T	36T	36T	20T	12T	36T	2T
MMLU (5-shot)	72.72	76.57	78.64	78.07	73.28	80.57	79.31
MMLU-Pro (5-shot CoT)	49.31	52.35	58.48	57.57	41.16	56.00	66.57
AGIEval-en (0-shot)	45.92	49.09	45.15	49.20	44.89	52.83	52.18
BBH (3-shot CoT)	71.20	77.75	82.23	69.65	73.78	78.71	82.55
HellaSwag (5-shot)	75.36	79.47	81.04	83.13	83.45	82.05	81.81
Winogrande (5-shot)	71.90	77.51	76.80	79.24	80.35	79.40	79.24
PIQA (5-shot)	78.89	81.39	81.61	82.97	81.80	83.30	83.19
OpenBookQA (5-shot)	45.00	49.00	50.00	50.20	49.60	50.80	49.80
ARC-C (0-shot)	51.19	56.91	56.83	60.58	64.68	59.30	62.12
GSM8K (4-shot CoT)	84.31	86.73	85.60	81.43	72.02	90.07	89.16
MATH (4-shot CoT)	50.16	52.48	56.16	57.30	43.30	59.70	65.88
GPQA-diamond (3-shot CoT)	32.65	35.71	37.76	31.12	23.47	37.76	45.41
MBPP (0-shot)	73.81	75.66	77.51	73.81	73.28	84.92	76.19
EvalPlus Avg (0-shot)	63.96	67.95	59.54	61.20	53.48	73.41	66.45

Avg General	62.39	66.67	67.86	65.04	60.98	69.22	70.75
Avg All	61.88	65.61	66.24	65.39	61.32	69.20	69.99

Avg General: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
Avg All: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).

Citation

If you find this work useful, please cite:

@misc{openseek-mid-v1,
  title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
  year={2026},
  note={Technical report coming soon}
}

Acknowledgements

This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.

Downloads last month: 25

Safetensors

Model size

11B params

Tensor type

BF16

Collection including BAAI/OpenSeek-Mid-v1

OpenSeek

Collection

open-source community driven next generation of AI models • 11 items • Updated 4 days ago • 10