Spaces:

evgueni-p
/

fbmc-chronos2

Sleeping

Evgueni Poloukarov Claude commited on 14 days ago

Commit

73e9c10

1 Parent(s): d2f9ff2

fix: revert PyTorch + redirect cache to /tmp for A100

- Revert torch>=2.4.0 back to torch>=2.0.0 (2.4+ broke L40)
- Add cache redirection to /tmp to prevent 50GB storage limit exceeded
(most common cause of silent A100 failures per HF forums)

Environment variables set before imports:
- TORCH_HOME=/tmp/torch_cache
- HF_HOME=/tmp/hf_home
- TRANSFORMERS_CACHE=/tmp/transformers_cache
- HUB_DIR=/tmp/torch_hub

Co-Authored-By: Claude <[email protected]>

Files changed (2) hide show

app.py +8 -0
requirements.txt +2 -2

app.py CHANGED Viewed

@@ -12,6 +12,14 @@ FORCE REBUILD: Optimized for 96GB VRAM with memory profiling diagnostics
 import os
 os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
 import sys
 print(f"[STARTUP] Python version: {sys.version}", flush=True)
 print(f"[STARTUP] Python path: {sys.path[:3]}", flush=True)

 import os
 os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
+# Redirect ALL caches to /tmp to prevent 50GB storage limit exceeded
+# This is the most common cause of "no logs" silent failures on A100 Spaces
+# See: https://discuss.huggingface.co/t/how-to-fix-workload-evicted-storage-limit-exceeded-50g-error-in-huggingface-spaces/169258
+os.environ['TORCH_HOME'] = '/tmp/torch_cache'
+os.environ['HF_HOME'] = '/tmp/hf_home'
+os.environ['TRANSFORMERS_CACHE'] = '/tmp/transformers_cache'
+os.environ['HUB_DIR'] = '/tmp/torch_hub'
 import sys
 print(f"[STARTUP] Python version: {sys.version}", flush=True)
 print(f"[STARTUP] Python path: {sys.path[:3]}", flush=True)

requirements.txt CHANGED Viewed

@@ -2,7 +2,7 @@
 gradio==4.44.0
 # Core ML/Data
-torch>=2.4.0
 transformers>=4.35.0
 chronos-forecasting>=1.2.0
 datasets>=2.14.0
@@ -19,4 +19,4 @@ altair>=5.0.0
 python-dotenv
 tqdm
-# Cache bust: v1.6.0 - PyTorch 2.4+ for H200/SM_90 + 1125h context

 gradio==4.44.0
 # Core ML/Data
+torch>=2.0.0
 transformers>=4.35.0
 chronos-forecasting>=1.2.0
 datasets>=2.14.0
 python-dotenv
 tqdm
+# Cache bust: v1.7.0 - Revert PyTorch + cache to /tmp for A100