richardprobe commited on
Commit
a552f64
·
verified ·
1 Parent(s): 07781ae

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +58 -0
  2. meta.json +14 -0
  3. model.pt +3 -0
  4. report.md +269 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NanoChat Model
2
+
3
+ A 561M parameter language model trained using the NanoChat framework.
4
+
5
+ ## Model Details
6
+
7
+ - **Architecture**: Transformer d20 (20 layers, 561M parameters)
8
+ - **Training Pipeline**: Pretraining → Midtraining → Supervised Fine-Tuning (SFT)
9
+ - **Tokenizer**: Custom BPE tokenizer with 65,536 vocabulary size
10
+ - **Training Hardware**: 8xH100 GPUs
11
+ - **Training Time**: ~4 hours
12
+
13
+ ## Files
14
+
15
+ - `model.pt`: The final SFT checkpoint
16
+ - `meta.json`: Model metadata
17
+ - `tokenizer.model`: Custom BPE tokenizer
18
+ - `report.md`: Full training report (if available)
19
+
20
+ ## Usage
21
+
22
+ To use this model with NanoChat:
23
+
24
+ ```python
25
+ import torch
26
+
27
+ # Load the model
28
+ checkpoint = torch.load('model.pt', map_location='cpu')
29
+ model_state_dict = checkpoint['model']
30
+ ```
31
+
32
+ For full usage instructions, see the [NanoChat repository](https://github.com/karpathy/nanochat).
33
+
34
+ ## Training Configuration
35
+
36
+ The model was trained using the speedrun configuration with:
37
+ - Batch size optimized for 8xH100
38
+ - Mixed precision training
39
+ - Custom curriculum through pretraining, midtraining, and SFT phases
40
+
41
+ ## Performance
42
+
43
+ See `report.md` for detailed performance metrics and evaluation results.
44
+
45
+ ## License
46
+
47
+ This model inherits the license from the NanoChat project.
48
+
49
+ ## Citation
50
+
51
+ If you use this model, please cite the original NanoChat repository:
52
+ ```
53
+ @software{nanochat,
54
+ author = {Andrej Karpathy},
55
+ title = {NanoChat},
56
+ url = {https://github.com/karpathy/nanochat}
57
+ }
58
+ ```
meta.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "step": 650,
3
+ "val_loss": 1.0146265029907227,
4
+ "mmlu_acc": 0.3037109375,
5
+ "arc_easy_acc": 0.33203125,
6
+ "model_config": {
7
+ "sequence_len": 2048,
8
+ "vocab_size": 65536,
9
+ "n_layer": 20,
10
+ "n_head": 10,
11
+ "n_kv_head": 10,
12
+ "n_embd": 1280
13
+ }
14
+ }
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b70d8dd66d4d0c73aa1d6076876a263d1f8cfa3a93aa0f4edf6d4fe4a7003b53
3
+ size 2076230219
report.md ADDED
@@ -0,0 +1,269 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # nanochat training report
2
+
3
+ Generated: 2025-10-19 19:54:27
4
+
5
+ ## Environment
6
+
7
+ ### Git Information
8
+ - Branch: master
9
+ - Commit: d6d86cb (dirty)
10
+ - Message: update readme with a link to the CPU|MPS branch
11
+
12
+ ### Hardware
13
+ - Platform: Linux
14
+ - CPUs: 240 cores (240 logical)
15
+ - Memory: 1771.7 GB
16
+ - GPUs: 8x NVIDIA A100-SXM4-80GB
17
+ - GPU Memory: 634.0 GB total
18
+ - CUDA Version: 12.8
19
+ - Hourly Rate: $14.32/hour
20
+
21
+ ### Software
22
+ - Python: 3.10.12
23
+ - PyTorch: 2.8.0+cu128
24
+
25
+
26
+ ### Bloat
27
+ - Characters: 357,831
28
+ - Lines: 8,718
29
+ - Files: 44
30
+ - Tokens (approx): 89,457
31
+ - Dependencies (uv.lock lines): 2,004
32
+
33
+ Run started: 2025-10-19 19:54:32
34
+
35
+ ---
36
+
37
+ ## Tokenizer training
38
+ timestamp: 2025-10-19 19:56:03
39
+
40
+ - max_chars: 2,000,000,000
41
+ - doc_cap: 10,000
42
+ - vocab_size: 65,536
43
+ - train_time: 71.4154
44
+ - num_special_tokens: 9
45
+ - token_bytes_min: 1
46
+ - token_bytes_max: 32
47
+ - token_bytes_mean: 6.9197
48
+ - token_bytes_std: 2.8748
49
+
50
+
51
+ ## Tokenizer evaluation
52
+ timestamp: 2025-10-19 19:56:15
53
+
54
+ ### Comparison with GPT-2
55
+
56
+ | Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
57
+ |-----------|-------|--------------|--------------|-------------|------------|-----------------|
58
+ | news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% |
59
+ | korean | 893 | 745 | 1.20 | 712 | 1.25 | +4.4% |
60
+ | code | 1259 | 576 | 2.19 | 492 | 2.56 | +14.6% |
61
+ | math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% |
62
+ | science | 1112 | 260 | 4.28 | 228 | 4.88 | +12.3% |
63
+ | fwe-train | 4208518 | 900364 | 4.67 | 856883 | 4.91 | +4.8% |
64
+ | fwe-val | 4908443 | 1059062 | 4.63 | 1010352 | 4.86 | +4.6% |
65
+
66
+ ### Comparison with GPT-4
67
+
68
+ | Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
69
+ |-----------|-------|--------------|--------------|-------------|------------|-----------------|
70
+ | news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% |
71
+ | korean | 893 | 364 | 2.45 | 712 | 1.25 | -95.6% |
72
+ | code | 1259 | 309 | 4.07 | 492 | 2.56 | -59.2% |
73
+ | math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% |
74
+ | science | 1112 | 249 | 4.47 | 228 | 4.88 | +8.4% |
75
+ | fwe-train | 4208518 | 874799 | 4.81 | 856883 | 4.91 | +2.0% |
76
+ | fwe-val | 4908443 | 1029691 | 4.77 | 1010352 | 4.86 | +1.9% |
77
+
78
+
79
+ ## Base model training
80
+ timestamp: 2025-10-20 03:02:00
81
+
82
+ - run: speedrun
83
+ - depth: 20
84
+ - max_seq_len: 2048
85
+ - num_iterations: -1
86
+ - target_flops: -1.0000
87
+ - target_param_data_ratio: 20
88
+ - device_batch_size: 32
89
+ - total_batch_size: 524,288
90
+ - embedding_lr: 0.2000
91
+ - unembedding_lr: 0.0040
92
+ - weight_decay: 0.0000
93
+ - matrix_lr: 0.0200
94
+ - grad_clip: 1.0000
95
+ - eval_every: 250
96
+ - eval_tokens: 10,485,760
97
+ - core_metric_every: 2000
98
+ - core_metric_max_per_task: 500
99
+ - sample_every: 2000
100
+ - model_tag:
101
+ - Number of parameters: 560,988,160
102
+ - Number of FLOPs per token: 3.491758e+09
103
+ - Calculated number of iterations: 21,400
104
+ - Number of training tokens: 11,219,763,200
105
+ - Tokens : Params ratio: 20.0000
106
+ - DDP world size: 8
107
+ - warmup_ratio: 0.0000
108
+ - warmdown_ratio: 0.2000
109
+ - final_lr_frac: 0.0000
110
+ - Minimum validation bpb: 0.8143
111
+ - Final validation bpb: 0.8143
112
+ - CORE metric estimate: 0.2133
113
+ - MFU %: 21.02%
114
+ - Total training flops: 3.917670e+19
115
+ - Total training time: 394.42m
116
+ - Peak memory usage: 75374.27MiB
117
+
118
+
119
+ ## Base model loss
120
+ timestamp: 2025-10-20 03:03:28
121
+
122
+ - train bpb: 0.8171
123
+ - val bpb: 0.8144
124
+ - sample 0: <|bos|>The capital of France is Paris. The capital of France is Paris. The capital of France is Paris.
125
+ - sample 1: <|bos|>The chemical symbol of gold is Au. The chemical symbol of gold is Au. The chemical symbol of gold is
126
+ - sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Friday, and so on. This is a very common way of thinking about the
127
+ - sample 3: <|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold.
128
+ - sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
129
+ - sample 5: <|bos|>My favorite color is red. I love it. I love it. I love it. I love
130
+ - sample 6: <|bos|>If 5*x + 3 = 13, then x is a multiple of 5. If 5*x + 3 =
131
+
132
+
133
+ ## Base model evaluation
134
+ timestamp: 2025-10-20 03:10:53
135
+
136
+ - Model: base_model (step 21400)
137
+ - CORE metric: 0.2084
138
+ - hellaswag_zeroshot: 0.2626
139
+ - jeopardy: 0.1068
140
+ - bigbench_qa_wikidata: 0.5118
141
+ - arc_easy: 0.5325
142
+ - arc_challenge: 0.1274
143
+ - copa: 0.4000
144
+ - commonsense_qa: 0.0274
145
+ - piqa: 0.3645
146
+ - openbook_qa: 0.1200
147
+ - lambada_openai: 0.3813
148
+ - hellaswag: 0.2631
149
+ - winograd: 0.2234
150
+ - winogrande: 0.0545
151
+ - bigbench_dyck_languages: 0.1270
152
+ - agi_eval_lsat_ar: 0.0489
153
+ - bigbench_cs_algorithms: 0.3545
154
+ - bigbench_operators: 0.1429
155
+ - bigbench_repeat_copy_logic: 0.0312
156
+ - squad: 0.2391
157
+ - coqa: 0.2176
158
+ - boolq: -0.1267
159
+ - bigbench_language_identification: 0.1740
160
+
161
+
162
+ ## Midtraining
163
+ timestamp: 2025-10-20 03:29:50
164
+
165
+ - run: speedrun
166
+ - dtype: bfloat16
167
+ - max_seq_len: 2048
168
+ - device_batch_size: 32
169
+ - unembedding_lr: 0.0040
170
+ - embedding_lr: 0.2000
171
+ - matrix_lr: 0.0200
172
+ - init_lr_frac: 1.0000
173
+ - weight_decay: 0.0000
174
+ - eval_every: 150
175
+ - eval_tokens: 10,485,760
176
+ - total_batch_size: 524,288
177
+ - dry_run: 0
178
+ - Number of iterations: 765
179
+ - DDP world size: 8
180
+ - Minimum validation bpb: 0.3963
181
+
182
+
183
+ ## Chat evaluation mid
184
+ timestamp: 2025-10-20 03:48:39
185
+
186
+ - source: mid
187
+ - task_name: None
188
+ - dtype: bfloat16
189
+ - temperature: 0.0000
190
+ - max_new_tokens: 512
191
+ - num_samples: 1
192
+ - top_k: 50
193
+ - batch_size: 8
194
+ - model_tag: None
195
+ - step: None
196
+ - max_problems: None
197
+ - ARC-Easy: 0.3119
198
+ - ARC-Challenge: 0.2927
199
+ - MMLU: 0.2975
200
+ - GSM8K: 0.0402
201
+ - HumanEval: 0.0976
202
+ - ChatCORE metric: 0.0681
203
+
204
+
205
+ ## Chat SFT
206
+ timestamp: 2025-10-20 03:53:11
207
+
208
+ - run: speedrun
209
+ - source: mid
210
+ - dtype: bfloat16
211
+ - device_batch_size: 4
212
+ - num_epochs: 1
213
+ - max_iterations: -1
214
+ - target_examples_per_step: 32
215
+ - unembedding_lr: 0.0040
216
+ - embedding_lr: 0.2000
217
+ - matrix_lr: 0.0200
218
+ - weight_decay: 0.0000
219
+ - init_lr_frac: 0.0200
220
+ - eval_every: 100
221
+ - eval_steps: 100
222
+ - eval_metrics_every: 200
223
+ - Training rows: 20,843
224
+ - Number of iterations: 651
225
+ - Training loss: 1.1234
226
+ - Validation loss: 1.0146
227
+
228
+
229
+ ## Chat evaluation sft
230
+ timestamp: 2025-10-20 04:09:28
231
+
232
+ - source: sft
233
+ - task_name: None
234
+ - dtype: bfloat16
235
+ - temperature: 0.0000
236
+ - max_new_tokens: 512
237
+ - num_samples: 1
238
+ - top_k: 50
239
+ - batch_size: 8
240
+ - model_tag: None
241
+ - step: None
242
+ - max_problems: None
243
+ - ARC-Easy: 0.3338
244
+ - ARC-Challenge: 0.3046
245
+ - MMLU: 0.2955
246
+ - GSM8K: 0.0599
247
+ - HumanEval: 0.1220
248
+ - ChatCORE metric: 0.0854
249
+
250
+
251
+ ## Summary
252
+
253
+ - Characters: 357,831
254
+ - Lines: 8,718
255
+ - Files: 44
256
+ - Tokens (approx): 89,457
257
+ - Dependencies (uv.lock lines): 2,004
258
+
259
+ | Metric | BASE | MID | SFT | RL |
260
+ |-----------------|----------|----------|----------|----------|
261
+ | CORE | 0.2084 | - | - | - |
262
+ | ARC-Challenge | - | 0.2927 | 0.3046 | - |
263
+ | ARC-Easy | - | 0.3119 | 0.3338 | - |
264
+ | GSM8K | - | 0.0402 | 0.0599 | - |
265
+ | HumanEval | - | 0.0976 | 0.1220 | - |
266
+ | MMLU | - | 0.2975 | 0.2955 | - |
267
+ | ChatCORE | - | 0.0681 | 0.0854 | - |
268
+
269
+ Total wall clock time: 8h14m