Spaces:

evgueni-p
/

fbmc-chronos2

Sleeping

App Files Files Community

Evgueni Poloukarov commited on 27 days ago

Commit

20e09fc

1 Parent(s): 0405814

docs: comprehensive Session 11 summary with GPU upgrade path and blocker status

Browse files

Files changed (1) hide show

doc/activity.md +180 -19

doc/activity.md CHANGED Viewed

@@ -145,36 +145,197 @@ Pushed to GitHub: https://github.com/evgspacdmy/fbmc_chronos2
 - [ ] Calculate MAE D+1 through D+14 (pending forecast)
 - [ ] Document results in activity.md (pending evaluation)
-### Next Steps (Resume Here Next Session)
-**PRIORITY 1**: Wait for HF Space rebuild or trigger manual Factory Rebuild
-- Go to: https://huggingface.co/spaces/evgueni-p/fbmc-chronos2/settings
-- Click "Factory Rebuild" if auto-rebuild hasn't triggered
-- Wait ~5 minutes for rebuild completion
-**PRIORITY 2**: Validate memory fix works
 ```bash
-cd /c/Users/evgue/projects/fbmc_chronos2
-.venv/Scripts/python.exe scripts/evaluate_october_2024.py
 ```
-**PRIORITY 3**: If successful, proceed with original Day 4 plan
-- Calculate MAE metrics for D+1 through D+14
-- Update activity.md with Session 11 results
-- Create HANDOVER_GUIDE.md
-- Archive test scripts
-- Commit and push final results
-**PRIORITY 4**: If still OOM, consider alternatives:
-- Further reduce batch_size to 16 or 8
-- Reduce context_hours from 256h to 128h
-- Document as limitation requiring A100 40GB GPU for production
 ### Key Learnings
-1. **Always check default parameters** - Libraries often have defaults optimized for different use cases
 2. **batch_size doesn't affect quality** - It's purely a computational optimization parameter
 3. **Memory usage isn't linear** - Transformer attention creates quadratic memory growth
 4. **Test with realistic data sizes** - Smoke tests (1 border) can hide multi-border issues
 5. **Document assumptions** - User correctly challenged the historical features assumption
 6. **HF Space rebuild delays** - May need manual trigger, not instant after push

 - [ ] Calculate MAE D+1 through D+14 (pending forecast)
 - [ ] Document results in activity.md (pending evaluation)
+### CRITICAL Git Workflow Issue Discovered
+**Problem**: Code pushed to GitHub but NOT deploying to HF Space
+**Investigation**:
+- Local repo uses `master` branch
+- HF Space uses `main` branch
+- Was only pushing: `git push origin master` (GitHub only)
+- HF Space never received the updates!
+**Solution** (added to CLAUDE.md Rule 30):
 ```bash
+git push origin master           # Push to GitHub (master branch)
+git push hf-new master:main      # Push to HF Space (main branch) - NOTE: master:main mapping!
 ```
+**Files Created**:
+- `DEPLOYMENT_NOTES.md` - Troubleshooting guide for HF Space deployment
+- Updated `CLAUDE.md` Rule 30 with branch mapping
+**Commits**:
+- `38f4bc1` - docs: add CRITICAL git workflow rule for HF Space deployment
+- `caf0333` - docs: update activity.md with Session 11 progress
+- `7a9aff9` - fix: reduce batch_size to 32 and quantiles to 3 for GPU memory optimization
+### Deployment Attempts & Results
+#### Attempt 1: Initial batch_size=32 fix (commit 7a9aff9)
+- Pushed to both remotes with correct branch mapping
+- Waited 3 minutes for rebuild
+- **Result**: Space still running OLD code (line 196 traceback, no batch_size parameter)
+#### Attempt 2: Version bump to force rebuild (commit 239885b)
+- Changed version string: v1.1.0 → v1.2.0
+- Pushed to both remotes
+- **Result**: New code deployed! (line 204 traceback confirms torch.inference_mode())
+- Smoke test (1 border): ✓ SUCCESS
+- Full forecast (38 borders): ✗ STILL OOM on first border (18.04 GB baseline)
+#### Attempt 3: Reduce context window 256h → 128h (commit 4be9db4)
+- Reduced `context_hours: int = 256` → `128`
+- Version bump: v1.2.0 → v1.3.0
+- **Result**: Memory dropped slightly (17.96 GB), still OOM on first border
+- **Analysis**: L4 GPU (22 GB) fundamentally insufficient
+### GPU Memory Analysis
+**Baseline Memory Usage** (before inference):
+- Model weights (bfloat16): ~2 GB
+- Dataset in memory: ~1 GB
+- **PyTorch workspace cache**: ~15 GB (the main culprit!)
+- **Total**: ~18 GB
+**Attention Computation Needs**:
+- Single border attention: 10.75 GB
+- **Available on L4**: 22 - 18 = 4 GB
+- **Shortfall**: 10.75 - 4 = 6.75 GB ❌
+**PyTorch Workspace Cache Explanation**:
+- CUDA Caching Allocator pre-allocates memory for efficiency
+- Temporary "scratch space" for attention, matmul, convolutions
+- Set `expandable_segments:True` to reduce fragmentation (line 17)
+- But on 22 GB L4, leaves only ~4 GB for inference
+**Why Smoke Test Succeeds but Full Forecast Fails**:
+- Smoke test: 1 border × 7 days = smaller memory footprint
+- Full forecast: 38 borders × 14 days = larger context, hits OOM on **first** border
+- Not a border-to-border accumulation issue - baseline too high
+### GPU Upgrade Path
+#### Attempt 4: Upgrade to A10G-small (24 GB) - commit deace48
+```yaml
+suggested_hardware: l4x1 → a10g-small
+```
+- **Rationale**: 2 GB extra headroom (24 vs 22 GB)
+- **Result**: Not tested (moved to A100)
+#### Attempt 5: Upgrade to A100-large (40-80 GB) - commit 0405814
+```yaml
+suggested_hardware: a10g-small → a100-large
+```
+- **Rationale**: 40-80 GB VRAM easily handles 18 GB baseline + 11 GB attention
+- **Result**: **Space PAUSED** - requires higher tier access or manual approval
+### Current Blocker: HF Space PAUSED
+**Error**:
+```
+ValueError: The current space is in the invalid state: PAUSED.
+Please contact the owner to fix this.
+```
+**Likely Causes**:
+1. A100-large requires Pro/Enterprise tier
+2. Billing/quota check triggered
+3. Manual approval needed for high-tier GPU
+**Resolution Options** (for tomorrow):
+1. **Check HF account tier** - Verify available GPU options
+2. **Approve A100 access** - If available on current tier
+3. **Downgrade to A10G-large** - 24 GB might be sufficient with optimizations
+4. **Process in batches** - Run 5-10 borders at a time on L4
+5. **Run locally** - If GPU available (requires dataset download)
+### Session 11 Summary
+**Achievements**:
+- ✓ Identified root cause: batch_size=256, 9 quantiles
+- ✓ Implemented memory optimizations: batch_size=32, 3 quantiles
+- ✓ Fixed critical git workflow issue (master vs main)
+- ✓ Created deployment documentation
+- ✓ Reduced context window 256h → 128h
+- ✓ Smoke test working (1 border succeeds)
+- ✓ Identified L4 GPU insufficient for full workload
+**Commits Created** (all pushed to both GitHub and HF Space):
+```
+0405814 - perf: upgrade to A100-large GPU (40-80GB) for multivariate forecasting
+deace48 - perf: upgrade to A10G GPU (24GB) for memory headroom
+4be9db4 - perf: reduce context window from 256h to 128h to fit L4 GPU memory
+239885b - fix: force rebuild with version bump to v1.2.0 (batch_size=32 optimization)
+38f4bc1 - docs: add CRITICAL git workflow rule for HF Space deployment
+caf0333 - docs: update activity.md with Session 11 progress
+7a9aff9 - fix: reduce batch_size to 32 and quantiles to 3 for GPU memory optimization
+```
+**Files Created/Modified**:
+- `DEPLOYMENT_NOTES.md` - HF Space troubleshooting guide
+- `CLAUDE.md` Rule 30 - Mandatory dual-remote push workflow
+- `README.md` - GPU hardware specification
+- `src/forecasting/chronos_inference.py` - Memory optimizations
+- `scripts/evaluate_october_2024.py` - Evaluation script
+**Outstanding Tasks**:
+- [ ] Resolve HF Space PAUSED status (check tier, approve GPU, or downgrade)
+- [ ] Complete October 2024 evaluation (38 borders × 14 days)
+- [ ] Calculate MAE metrics D+1 through D+14
+- [ ] Create HANDOVER_GUIDE.md for quant analyst
+- [ ] Archive test scripts to archive/testing/
+- [ ] Commit and push final results
+### Next Steps (Resume Here Next Session)
+**PRIORITY 1**: Resolve HF Space PAUSED Status
+1. Check HF Space: https://huggingface.co/spaces/evgueni-p/fbmc-chronos2
+2. Verify account tier and available GPU options
+3. Choose resolution:
+   - **Option A**: Approve A100-large (if available on tier) - RECOMMENDED
+   - **Option B**: Downgrade to `a10g-large` (24 GB) in README.md
+   - **Option C**: Revert to `l4x1` and run batched inference (5-10 borders at a time)
+   - **Option D**: Run evaluation locally if GPU available
+**PRIORITY 2**: Complete October 2024 Evaluation
+1. Once Space is running, execute:
+   ```bash
+   cd C:/Users/evgue/projects/fbmc_chronos2
+   .venv/Scripts/python.exe scripts/evaluate_october_2024.py
+   ```
+2. Verify parquet output (not debug .txt)
+3. Check forecast shape: (336 hours, 38 borders × 3 quantiles)
+**PRIORITY 3**: Calculate MAE Metrics
+1. Run MAE calculation for D+1 through D+14
+2. Compare against target: 134 MW (must be <150 MW)
+3. Document results in activity.md
+**PRIORITY 4**: Handover Documentation
+1. Create `HANDOVER_GUIDE.md` for quant analyst
+2. Archive test scripts to `archive/testing/`
+3. Final commit and push
+**Key Files for Tomorrow**:
+- `evaluation_run.log` - Last evaluation attempt logs
+- `DEPLOYMENT_NOTES.md` - HF Space troubleshooting
+- `scripts/evaluate_october_2024.py` - Evaluation script
+- Current Space status: **PAUSED** (A100-large pending approval)
+**Git Status**:
+- Latest commit: `0405814` (A100-large GPU upgrade)
+- All changes pushed to both GitHub and HF Space
+- Branch: master (local) → main (HF Space)
 ### Key Learnings
+1. **Always check default parameters** - Libraries often have defaults optimized for different use cases (batch_size=256!)
 2. **batch_size doesn't affect quality** - It's purely a computational optimization parameter
 3. **Memory usage isn't linear** - Transformer attention creates quadratic memory growth
+4. **Git branch mapping critical** - Local master ≠ HF Space main, must use `master:main` in push
+5. **PyTorch workspace cache** - Pre-allocated memory can consume 15 GB on large models
+6. **GPU sizing matters** - L4 (22 GB) insufficient for multivariate forecasting, need A100 (40-80 GB)
 4. **Test with realistic data sizes** - Smoke tests (1 border) can hide multi-border issues
 5. **Document assumptions** - User correctly challenged the historical features assumption
 6. **HF Space rebuild delays** - May need manual trigger, not instant after push