Spaces:
Sleeping
Sleeping
Evgueni Poloukarov
commited on
Commit
·
20e09fc
1
Parent(s):
0405814
docs: comprehensive Session 11 summary with GPU upgrade path and blocker status
Browse files- doc/activity.md +180 -19
doc/activity.md
CHANGED
|
@@ -145,36 +145,197 @@ Pushed to GitHub: https://github.com/evgspacdmy/fbmc_chronos2
|
|
| 145 |
- [ ] Calculate MAE D+1 through D+14 (pending forecast)
|
| 146 |
- [ ] Document results in activity.md (pending evaluation)
|
| 147 |
|
| 148 |
-
###
|
|
|
|
|
|
|
| 149 |
|
| 150 |
-
**
|
| 151 |
-
-
|
| 152 |
-
-
|
| 153 |
-
-
|
|
|
|
| 154 |
|
| 155 |
-
**
|
| 156 |
```bash
|
| 157 |
-
|
| 158 |
-
|
| 159 |
```
|
| 160 |
|
| 161 |
-
**
|
| 162 |
-
-
|
| 163 |
-
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
|
| 168 |
-
**
|
| 169 |
-
-
|
| 170 |
-
|
| 171 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
### Key Learnings
|
| 174 |
|
| 175 |
-
1. **Always check default parameters** - Libraries often have defaults optimized for different use cases
|
| 176 |
2. **batch_size doesn't affect quality** - It's purely a computational optimization parameter
|
| 177 |
3. **Memory usage isn't linear** - Transformer attention creates quadratic memory growth
|
|
|
|
|
|
|
|
|
|
| 178 |
4. **Test with realistic data sizes** - Smoke tests (1 border) can hide multi-border issues
|
| 179 |
5. **Document assumptions** - User correctly challenged the historical features assumption
|
| 180 |
6. **HF Space rebuild delays** - May need manual trigger, not instant after push
|
|
|
|
| 145 |
- [ ] Calculate MAE D+1 through D+14 (pending forecast)
|
| 146 |
- [ ] Document results in activity.md (pending evaluation)
|
| 147 |
|
| 148 |
+
### CRITICAL Git Workflow Issue Discovered
|
| 149 |
+
|
| 150 |
+
**Problem**: Code pushed to GitHub but NOT deploying to HF Space
|
| 151 |
|
| 152 |
+
**Investigation**:
|
| 153 |
+
- Local repo uses `master` branch
|
| 154 |
+
- HF Space uses `main` branch
|
| 155 |
+
- Was only pushing: `git push origin master` (GitHub only)
|
| 156 |
+
- HF Space never received the updates!
|
| 157 |
|
| 158 |
+
**Solution** (added to CLAUDE.md Rule 30):
|
| 159 |
```bash
|
| 160 |
+
git push origin master # Push to GitHub (master branch)
|
| 161 |
+
git push hf-new master:main # Push to HF Space (main branch) - NOTE: master:main mapping!
|
| 162 |
```
|
| 163 |
|
| 164 |
+
**Files Created**:
|
| 165 |
+
- `DEPLOYMENT_NOTES.md` - Troubleshooting guide for HF Space deployment
|
| 166 |
+
- Updated `CLAUDE.md` Rule 30 with branch mapping
|
| 167 |
+
|
| 168 |
+
**Commits**:
|
| 169 |
+
- `38f4bc1` - docs: add CRITICAL git workflow rule for HF Space deployment
|
| 170 |
+
- `caf0333` - docs: update activity.md with Session 11 progress
|
| 171 |
+
- `7a9aff9` - fix: reduce batch_size to 32 and quantiles to 3 for GPU memory optimization
|
| 172 |
+
|
| 173 |
+
### Deployment Attempts & Results
|
| 174 |
+
|
| 175 |
+
#### Attempt 1: Initial batch_size=32 fix (commit 7a9aff9)
|
| 176 |
+
- Pushed to both remotes with correct branch mapping
|
| 177 |
+
- Waited 3 minutes for rebuild
|
| 178 |
+
- **Result**: Space still running OLD code (line 196 traceback, no batch_size parameter)
|
| 179 |
+
|
| 180 |
+
#### Attempt 2: Version bump to force rebuild (commit 239885b)
|
| 181 |
+
- Changed version string: v1.1.0 → v1.2.0
|
| 182 |
+
- Pushed to both remotes
|
| 183 |
+
- **Result**: New code deployed! (line 204 traceback confirms torch.inference_mode())
|
| 184 |
+
- Smoke test (1 border): ✓ SUCCESS
|
| 185 |
+
- Full forecast (38 borders): ✗ STILL OOM on first border (18.04 GB baseline)
|
| 186 |
+
|
| 187 |
+
#### Attempt 3: Reduce context window 256h → 128h (commit 4be9db4)
|
| 188 |
+
- Reduced `context_hours: int = 256` → `128`
|
| 189 |
+
- Version bump: v1.2.0 → v1.3.0
|
| 190 |
+
- **Result**: Memory dropped slightly (17.96 GB), still OOM on first border
|
| 191 |
+
- **Analysis**: L4 GPU (22 GB) fundamentally insufficient
|
| 192 |
+
|
| 193 |
+
### GPU Memory Analysis
|
| 194 |
+
|
| 195 |
+
**Baseline Memory Usage** (before inference):
|
| 196 |
+
- Model weights (bfloat16): ~2 GB
|
| 197 |
+
- Dataset in memory: ~1 GB
|
| 198 |
+
- **PyTorch workspace cache**: ~15 GB (the main culprit!)
|
| 199 |
+
- **Total**: ~18 GB
|
| 200 |
+
|
| 201 |
+
**Attention Computation Needs**:
|
| 202 |
+
- Single border attention: 10.75 GB
|
| 203 |
+
- **Available on L4**: 22 - 18 = 4 GB
|
| 204 |
+
- **Shortfall**: 10.75 - 4 = 6.75 GB ❌
|
| 205 |
+
|
| 206 |
+
**PyTorch Workspace Cache Explanation**:
|
| 207 |
+
- CUDA Caching Allocator pre-allocates memory for efficiency
|
| 208 |
+
- Temporary "scratch space" for attention, matmul, convolutions
|
| 209 |
+
- Set `expandable_segments:True` to reduce fragmentation (line 17)
|
| 210 |
+
- But on 22 GB L4, leaves only ~4 GB for inference
|
| 211 |
+
|
| 212 |
+
**Why Smoke Test Succeeds but Full Forecast Fails**:
|
| 213 |
+
- Smoke test: 1 border × 7 days = smaller memory footprint
|
| 214 |
+
- Full forecast: 38 borders × 14 days = larger context, hits OOM on **first** border
|
| 215 |
+
- Not a border-to-border accumulation issue - baseline too high
|
| 216 |
+
|
| 217 |
+
### GPU Upgrade Path
|
| 218 |
+
|
| 219 |
+
#### Attempt 4: Upgrade to A10G-small (24 GB) - commit deace48
|
| 220 |
+
```yaml
|
| 221 |
+
suggested_hardware: l4x1 → a10g-small
|
| 222 |
+
```
|
| 223 |
+
- **Rationale**: 2 GB extra headroom (24 vs 22 GB)
|
| 224 |
+
- **Result**: Not tested (moved to A100)
|
| 225 |
+
|
| 226 |
+
#### Attempt 5: Upgrade to A100-large (40-80 GB) - commit 0405814
|
| 227 |
+
```yaml
|
| 228 |
+
suggested_hardware: a10g-small → a100-large
|
| 229 |
+
```
|
| 230 |
+
- **Rationale**: 40-80 GB VRAM easily handles 18 GB baseline + 11 GB attention
|
| 231 |
+
- **Result**: **Space PAUSED** - requires higher tier access or manual approval
|
| 232 |
+
|
| 233 |
+
### Current Blocker: HF Space PAUSED
|
| 234 |
+
|
| 235 |
+
**Error**:
|
| 236 |
+
```
|
| 237 |
+
ValueError: The current space is in the invalid state: PAUSED.
|
| 238 |
+
Please contact the owner to fix this.
|
| 239 |
+
```
|
| 240 |
|
| 241 |
+
**Likely Causes**:
|
| 242 |
+
1. A100-large requires Pro/Enterprise tier
|
| 243 |
+
2. Billing/quota check triggered
|
| 244 |
+
3. Manual approval needed for high-tier GPU
|
| 245 |
+
|
| 246 |
+
**Resolution Options** (for tomorrow):
|
| 247 |
+
1. **Check HF account tier** - Verify available GPU options
|
| 248 |
+
2. **Approve A100 access** - If available on current tier
|
| 249 |
+
3. **Downgrade to A10G-large** - 24 GB might be sufficient with optimizations
|
| 250 |
+
4. **Process in batches** - Run 5-10 borders at a time on L4
|
| 251 |
+
5. **Run locally** - If GPU available (requires dataset download)
|
| 252 |
+
|
| 253 |
+
### Session 11 Summary
|
| 254 |
+
|
| 255 |
+
**Achievements**:
|
| 256 |
+
- ✓ Identified root cause: batch_size=256, 9 quantiles
|
| 257 |
+
- ✓ Implemented memory optimizations: batch_size=32, 3 quantiles
|
| 258 |
+
- ✓ Fixed critical git workflow issue (master vs main)
|
| 259 |
+
- ✓ Created deployment documentation
|
| 260 |
+
- ✓ Reduced context window 256h → 128h
|
| 261 |
+
- ✓ Smoke test working (1 border succeeds)
|
| 262 |
+
- ✓ Identified L4 GPU insufficient for full workload
|
| 263 |
+
|
| 264 |
+
**Commits Created** (all pushed to both GitHub and HF Space):
|
| 265 |
+
```
|
| 266 |
+
0405814 - perf: upgrade to A100-large GPU (40-80GB) for multivariate forecasting
|
| 267 |
+
deace48 - perf: upgrade to A10G GPU (24GB) for memory headroom
|
| 268 |
+
4be9db4 - perf: reduce context window from 256h to 128h to fit L4 GPU memory
|
| 269 |
+
239885b - fix: force rebuild with version bump to v1.2.0 (batch_size=32 optimization)
|
| 270 |
+
38f4bc1 - docs: add CRITICAL git workflow rule for HF Space deployment
|
| 271 |
+
caf0333 - docs: update activity.md with Session 11 progress
|
| 272 |
+
7a9aff9 - fix: reduce batch_size to 32 and quantiles to 3 for GPU memory optimization
|
| 273 |
+
```
|
| 274 |
+
|
| 275 |
+
**Files Created/Modified**:
|
| 276 |
+
- `DEPLOYMENT_NOTES.md` - HF Space troubleshooting guide
|
| 277 |
+
- `CLAUDE.md` Rule 30 - Mandatory dual-remote push workflow
|
| 278 |
+
- `README.md` - GPU hardware specification
|
| 279 |
+
- `src/forecasting/chronos_inference.py` - Memory optimizations
|
| 280 |
+
- `scripts/evaluate_october_2024.py` - Evaluation script
|
| 281 |
+
|
| 282 |
+
**Outstanding Tasks**:
|
| 283 |
+
- [ ] Resolve HF Space PAUSED status (check tier, approve GPU, or downgrade)
|
| 284 |
+
- [ ] Complete October 2024 evaluation (38 borders × 14 days)
|
| 285 |
+
- [ ] Calculate MAE metrics D+1 through D+14
|
| 286 |
+
- [ ] Create HANDOVER_GUIDE.md for quant analyst
|
| 287 |
+
- [ ] Archive test scripts to archive/testing/
|
| 288 |
+
- [ ] Commit and push final results
|
| 289 |
+
|
| 290 |
+
### Next Steps (Resume Here Next Session)
|
| 291 |
+
|
| 292 |
+
**PRIORITY 1**: Resolve HF Space PAUSED Status
|
| 293 |
+
1. Check HF Space: https://huggingface.co/spaces/evgueni-p/fbmc-chronos2
|
| 294 |
+
2. Verify account tier and available GPU options
|
| 295 |
+
3. Choose resolution:
|
| 296 |
+
- **Option A**: Approve A100-large (if available on tier) - RECOMMENDED
|
| 297 |
+
- **Option B**: Downgrade to `a10g-large` (24 GB) in README.md
|
| 298 |
+
- **Option C**: Revert to `l4x1` and run batched inference (5-10 borders at a time)
|
| 299 |
+
- **Option D**: Run evaluation locally if GPU available
|
| 300 |
+
|
| 301 |
+
**PRIORITY 2**: Complete October 2024 Evaluation
|
| 302 |
+
1. Once Space is running, execute:
|
| 303 |
+
```bash
|
| 304 |
+
cd C:/Users/evgue/projects/fbmc_chronos2
|
| 305 |
+
.venv/Scripts/python.exe scripts/evaluate_october_2024.py
|
| 306 |
+
```
|
| 307 |
+
2. Verify parquet output (not debug .txt)
|
| 308 |
+
3. Check forecast shape: (336 hours, 38 borders × 3 quantiles)
|
| 309 |
+
|
| 310 |
+
**PRIORITY 3**: Calculate MAE Metrics
|
| 311 |
+
1. Run MAE calculation for D+1 through D+14
|
| 312 |
+
2. Compare against target: 134 MW (must be <150 MW)
|
| 313 |
+
3. Document results in activity.md
|
| 314 |
+
|
| 315 |
+
**PRIORITY 4**: Handover Documentation
|
| 316 |
+
1. Create `HANDOVER_GUIDE.md` for quant analyst
|
| 317 |
+
2. Archive test scripts to `archive/testing/`
|
| 318 |
+
3. Final commit and push
|
| 319 |
+
|
| 320 |
+
**Key Files for Tomorrow**:
|
| 321 |
+
- `evaluation_run.log` - Last evaluation attempt logs
|
| 322 |
+
- `DEPLOYMENT_NOTES.md` - HF Space troubleshooting
|
| 323 |
+
- `scripts/evaluate_october_2024.py` - Evaluation script
|
| 324 |
+
- Current Space status: **PAUSED** (A100-large pending approval)
|
| 325 |
+
|
| 326 |
+
**Git Status**:
|
| 327 |
+
- Latest commit: `0405814` (A100-large GPU upgrade)
|
| 328 |
+
- All changes pushed to both GitHub and HF Space
|
| 329 |
+
- Branch: master (local) → main (HF Space)
|
| 330 |
|
| 331 |
### Key Learnings
|
| 332 |
|
| 333 |
+
1. **Always check default parameters** - Libraries often have defaults optimized for different use cases (batch_size=256!)
|
| 334 |
2. **batch_size doesn't affect quality** - It's purely a computational optimization parameter
|
| 335 |
3. **Memory usage isn't linear** - Transformer attention creates quadratic memory growth
|
| 336 |
+
4. **Git branch mapping critical** - Local master ≠ HF Space main, must use `master:main` in push
|
| 337 |
+
5. **PyTorch workspace cache** - Pre-allocated memory can consume 15 GB on large models
|
| 338 |
+
6. **GPU sizing matters** - L4 (22 GB) insufficient for multivariate forecasting, need A100 (40-80 GB)
|
| 339 |
4. **Test with realistic data sizes** - Smoke tests (1 border) can hide multi-border issues
|
| 340 |
5. **Document assumptions** - User correctly challenged the historical features assumption
|
| 341 |
6. **HF Space rebuild delays** - May need manual trigger, not instant after push
|