Evgueni Poloukarov commited on
Commit
20e09fc
·
1 Parent(s): 0405814

docs: comprehensive Session 11 summary with GPU upgrade path and blocker status

Browse files
Files changed (1) hide show
  1. doc/activity.md +180 -19
doc/activity.md CHANGED
@@ -145,36 +145,197 @@ Pushed to GitHub: https://github.com/evgspacdmy/fbmc_chronos2
145
  - [ ] Calculate MAE D+1 through D+14 (pending forecast)
146
  - [ ] Document results in activity.md (pending evaluation)
147
 
148
- ### Next Steps (Resume Here Next Session)
 
 
149
 
150
- **PRIORITY 1**: Wait for HF Space rebuild or trigger manual Factory Rebuild
151
- - Go to: https://huggingface.co/spaces/evgueni-p/fbmc-chronos2/settings
152
- - Click "Factory Rebuild" if auto-rebuild hasn't triggered
153
- - Wait ~5 minutes for rebuild completion
 
154
 
155
- **PRIORITY 2**: Validate memory fix works
156
  ```bash
157
- cd /c/Users/evgue/projects/fbmc_chronos2
158
- .venv/Scripts/python.exe scripts/evaluate_october_2024.py
159
  ```
160
 
161
- **PRIORITY 3**: If successful, proceed with original Day 4 plan
162
- - Calculate MAE metrics for D+1 through D+14
163
- - Update activity.md with Session 11 results
164
- - Create HANDOVER_GUIDE.md
165
- - Archive test scripts
166
- - Commit and push final results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
 
168
- **PRIORITY 4**: If still OOM, consider alternatives:
169
- - Further reduce batch_size to 16 or 8
170
- - Reduce context_hours from 256h to 128h
171
- - Document as limitation requiring A100 40GB GPU for production
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
  ### Key Learnings
174
 
175
- 1. **Always check default parameters** - Libraries often have defaults optimized for different use cases
176
  2. **batch_size doesn't affect quality** - It's purely a computational optimization parameter
177
  3. **Memory usage isn't linear** - Transformer attention creates quadratic memory growth
 
 
 
178
  4. **Test with realistic data sizes** - Smoke tests (1 border) can hide multi-border issues
179
  5. **Document assumptions** - User correctly challenged the historical features assumption
180
  6. **HF Space rebuild delays** - May need manual trigger, not instant after push
 
145
  - [ ] Calculate MAE D+1 through D+14 (pending forecast)
146
  - [ ] Document results in activity.md (pending evaluation)
147
 
148
+ ### CRITICAL Git Workflow Issue Discovered
149
+
150
+ **Problem**: Code pushed to GitHub but NOT deploying to HF Space
151
 
152
+ **Investigation**:
153
+ - Local repo uses `master` branch
154
+ - HF Space uses `main` branch
155
+ - Was only pushing: `git push origin master` (GitHub only)
156
+ - HF Space never received the updates!
157
 
158
+ **Solution** (added to CLAUDE.md Rule 30):
159
  ```bash
160
+ git push origin master # Push to GitHub (master branch)
161
+ git push hf-new master:main # Push to HF Space (main branch) - NOTE: master:main mapping!
162
  ```
163
 
164
+ **Files Created**:
165
+ - `DEPLOYMENT_NOTES.md` - Troubleshooting guide for HF Space deployment
166
+ - Updated `CLAUDE.md` Rule 30 with branch mapping
167
+
168
+ **Commits**:
169
+ - `38f4bc1` - docs: add CRITICAL git workflow rule for HF Space deployment
170
+ - `caf0333` - docs: update activity.md with Session 11 progress
171
+ - `7a9aff9` - fix: reduce batch_size to 32 and quantiles to 3 for GPU memory optimization
172
+
173
+ ### Deployment Attempts & Results
174
+
175
+ #### Attempt 1: Initial batch_size=32 fix (commit 7a9aff9)
176
+ - Pushed to both remotes with correct branch mapping
177
+ - Waited 3 minutes for rebuild
178
+ - **Result**: Space still running OLD code (line 196 traceback, no batch_size parameter)
179
+
180
+ #### Attempt 2: Version bump to force rebuild (commit 239885b)
181
+ - Changed version string: v1.1.0 → v1.2.0
182
+ - Pushed to both remotes
183
+ - **Result**: New code deployed! (line 204 traceback confirms torch.inference_mode())
184
+ - Smoke test (1 border): ✓ SUCCESS
185
+ - Full forecast (38 borders): ✗ STILL OOM on first border (18.04 GB baseline)
186
+
187
+ #### Attempt 3: Reduce context window 256h → 128h (commit 4be9db4)
188
+ - Reduced `context_hours: int = 256` → `128`
189
+ - Version bump: v1.2.0 → v1.3.0
190
+ - **Result**: Memory dropped slightly (17.96 GB), still OOM on first border
191
+ - **Analysis**: L4 GPU (22 GB) fundamentally insufficient
192
+
193
+ ### GPU Memory Analysis
194
+
195
+ **Baseline Memory Usage** (before inference):
196
+ - Model weights (bfloat16): ~2 GB
197
+ - Dataset in memory: ~1 GB
198
+ - **PyTorch workspace cache**: ~15 GB (the main culprit!)
199
+ - **Total**: ~18 GB
200
+
201
+ **Attention Computation Needs**:
202
+ - Single border attention: 10.75 GB
203
+ - **Available on L4**: 22 - 18 = 4 GB
204
+ - **Shortfall**: 10.75 - 4 = 6.75 GB ❌
205
+
206
+ **PyTorch Workspace Cache Explanation**:
207
+ - CUDA Caching Allocator pre-allocates memory for efficiency
208
+ - Temporary "scratch space" for attention, matmul, convolutions
209
+ - Set `expandable_segments:True` to reduce fragmentation (line 17)
210
+ - But on 22 GB L4, leaves only ~4 GB for inference
211
+
212
+ **Why Smoke Test Succeeds but Full Forecast Fails**:
213
+ - Smoke test: 1 border × 7 days = smaller memory footprint
214
+ - Full forecast: 38 borders × 14 days = larger context, hits OOM on **first** border
215
+ - Not a border-to-border accumulation issue - baseline too high
216
+
217
+ ### GPU Upgrade Path
218
+
219
+ #### Attempt 4: Upgrade to A10G-small (24 GB) - commit deace48
220
+ ```yaml
221
+ suggested_hardware: l4x1 → a10g-small
222
+ ```
223
+ - **Rationale**: 2 GB extra headroom (24 vs 22 GB)
224
+ - **Result**: Not tested (moved to A100)
225
+
226
+ #### Attempt 5: Upgrade to A100-large (40-80 GB) - commit 0405814
227
+ ```yaml
228
+ suggested_hardware: a10g-small → a100-large
229
+ ```
230
+ - **Rationale**: 40-80 GB VRAM easily handles 18 GB baseline + 11 GB attention
231
+ - **Result**: **Space PAUSED** - requires higher tier access or manual approval
232
+
233
+ ### Current Blocker: HF Space PAUSED
234
+
235
+ **Error**:
236
+ ```
237
+ ValueError: The current space is in the invalid state: PAUSED.
238
+ Please contact the owner to fix this.
239
+ ```
240
 
241
+ **Likely Causes**:
242
+ 1. A100-large requires Pro/Enterprise tier
243
+ 2. Billing/quota check triggered
244
+ 3. Manual approval needed for high-tier GPU
245
+
246
+ **Resolution Options** (for tomorrow):
247
+ 1. **Check HF account tier** - Verify available GPU options
248
+ 2. **Approve A100 access** - If available on current tier
249
+ 3. **Downgrade to A10G-large** - 24 GB might be sufficient with optimizations
250
+ 4. **Process in batches** - Run 5-10 borders at a time on L4
251
+ 5. **Run locally** - If GPU available (requires dataset download)
252
+
253
+ ### Session 11 Summary
254
+
255
+ **Achievements**:
256
+ - ✓ Identified root cause: batch_size=256, 9 quantiles
257
+ - ✓ Implemented memory optimizations: batch_size=32, 3 quantiles
258
+ - ✓ Fixed critical git workflow issue (master vs main)
259
+ - ✓ Created deployment documentation
260
+ - ✓ Reduced context window 256h → 128h
261
+ - ✓ Smoke test working (1 border succeeds)
262
+ - ✓ Identified L4 GPU insufficient for full workload
263
+
264
+ **Commits Created** (all pushed to both GitHub and HF Space):
265
+ ```
266
+ 0405814 - perf: upgrade to A100-large GPU (40-80GB) for multivariate forecasting
267
+ deace48 - perf: upgrade to A10G GPU (24GB) for memory headroom
268
+ 4be9db4 - perf: reduce context window from 256h to 128h to fit L4 GPU memory
269
+ 239885b - fix: force rebuild with version bump to v1.2.0 (batch_size=32 optimization)
270
+ 38f4bc1 - docs: add CRITICAL git workflow rule for HF Space deployment
271
+ caf0333 - docs: update activity.md with Session 11 progress
272
+ 7a9aff9 - fix: reduce batch_size to 32 and quantiles to 3 for GPU memory optimization
273
+ ```
274
+
275
+ **Files Created/Modified**:
276
+ - `DEPLOYMENT_NOTES.md` - HF Space troubleshooting guide
277
+ - `CLAUDE.md` Rule 30 - Mandatory dual-remote push workflow
278
+ - `README.md` - GPU hardware specification
279
+ - `src/forecasting/chronos_inference.py` - Memory optimizations
280
+ - `scripts/evaluate_october_2024.py` - Evaluation script
281
+
282
+ **Outstanding Tasks**:
283
+ - [ ] Resolve HF Space PAUSED status (check tier, approve GPU, or downgrade)
284
+ - [ ] Complete October 2024 evaluation (38 borders × 14 days)
285
+ - [ ] Calculate MAE metrics D+1 through D+14
286
+ - [ ] Create HANDOVER_GUIDE.md for quant analyst
287
+ - [ ] Archive test scripts to archive/testing/
288
+ - [ ] Commit and push final results
289
+
290
+ ### Next Steps (Resume Here Next Session)
291
+
292
+ **PRIORITY 1**: Resolve HF Space PAUSED Status
293
+ 1. Check HF Space: https://huggingface.co/spaces/evgueni-p/fbmc-chronos2
294
+ 2. Verify account tier and available GPU options
295
+ 3. Choose resolution:
296
+ - **Option A**: Approve A100-large (if available on tier) - RECOMMENDED
297
+ - **Option B**: Downgrade to `a10g-large` (24 GB) in README.md
298
+ - **Option C**: Revert to `l4x1` and run batched inference (5-10 borders at a time)
299
+ - **Option D**: Run evaluation locally if GPU available
300
+
301
+ **PRIORITY 2**: Complete October 2024 Evaluation
302
+ 1. Once Space is running, execute:
303
+ ```bash
304
+ cd C:/Users/evgue/projects/fbmc_chronos2
305
+ .venv/Scripts/python.exe scripts/evaluate_october_2024.py
306
+ ```
307
+ 2. Verify parquet output (not debug .txt)
308
+ 3. Check forecast shape: (336 hours, 38 borders × 3 quantiles)
309
+
310
+ **PRIORITY 3**: Calculate MAE Metrics
311
+ 1. Run MAE calculation for D+1 through D+14
312
+ 2. Compare against target: 134 MW (must be <150 MW)
313
+ 3. Document results in activity.md
314
+
315
+ **PRIORITY 4**: Handover Documentation
316
+ 1. Create `HANDOVER_GUIDE.md` for quant analyst
317
+ 2. Archive test scripts to `archive/testing/`
318
+ 3. Final commit and push
319
+
320
+ **Key Files for Tomorrow**:
321
+ - `evaluation_run.log` - Last evaluation attempt logs
322
+ - `DEPLOYMENT_NOTES.md` - HF Space troubleshooting
323
+ - `scripts/evaluate_october_2024.py` - Evaluation script
324
+ - Current Space status: **PAUSED** (A100-large pending approval)
325
+
326
+ **Git Status**:
327
+ - Latest commit: `0405814` (A100-large GPU upgrade)
328
+ - All changes pushed to both GitHub and HF Space
329
+ - Branch: master (local) → main (HF Space)
330
 
331
  ### Key Learnings
332
 
333
+ 1. **Always check default parameters** - Libraries often have defaults optimized for different use cases (batch_size=256!)
334
  2. **batch_size doesn't affect quality** - It's purely a computational optimization parameter
335
  3. **Memory usage isn't linear** - Transformer attention creates quadratic memory growth
336
+ 4. **Git branch mapping critical** - Local master ≠ HF Space main, must use `master:main` in push
337
+ 5. **PyTorch workspace cache** - Pre-allocated memory can consume 15 GB on large models
338
+ 6. **GPU sizing matters** - L4 (22 GB) insufficient for multivariate forecasting, need A100 (40-80 GB)
339
  4. **Test with realistic data sizes** - Smoke tests (1 border) can hide multi-border issues
340
  5. **Document assumptions** - User correctly challenged the historical features assumption
341
  6. **HF Space rebuild delays** - May need manual trigger, not instant after push