Spaces:
Sleeping
Sleeping
| # ๐ Hugging Face Spaces Deployment - Troubleshooting Guide | |
| ## โ **Your Local Fix Applied** | |
| Great news! The core issue has been resolved locally. The problem was that the downloaded model doesn't contain `actor_critic` weights, but the code assumed it did. This caused a `NoneType` error when clicking to start the game. | |
| **Fixed**: The app now properly detects when `actor_critic` weights are missing and falls back to human control mode instead of crashing. | |
| ## ๐ **Potential HF Spaces Issues & Solutions** | |
| ### **Issue 1: Model Download Timeouts** โฐ | |
| **Symptoms:** | |
| - "Model loading timed out" message | |
| - App shows loading forever | |
| - Click doesn't start the game | |
| **Root Cause:** HF Spaces network can be slower, 5-minute timeout may not be enough. | |
| **Solution:** | |
| ```python | |
| # In app.py, update the timeout in _load_model_from_url_async(): | |
| success = await asyncio.wait_for(future, timeout=900.0) # 15 minutes instead of 5 | |
| ``` | |
| ### **Issue 2: Memory Limitations** ๐พ | |
| **Symptoms:** | |
| - App crashes during model loading | |
| - "Out of memory" errors in logs | |
| - Models load but inference fails | |
| **Root Cause:** HF Spaces free tier has only 16GB RAM. | |
| **Quick Fix:** Force CPU-only mode | |
| ```python | |
| # Add at the top of app.py | |
| import os | |
| os.environ["CUDA_VISIBLE_DEVICES"] = "" # Force CPU mode for HF Spaces | |
| ``` | |
| **Better Solution:** Add memory management | |
| ```python | |
| # Add memory cleanup after model loading | |
| import gc | |
| gc.collect() | |
| if torch.cuda.is_available(): | |
| torch.cuda.empty_cache() | |
| ``` | |
| ### **Issue 3: WebSocket Connection Failures** ๐ | |
| **Symptoms:** | |
| - "Connection Error" or "Disconnected" status | |
| - Click works but no response | |
| - Frequent reconnections | |
| **Root Cause:** HF Spaces proxy/domain restrictions. | |
| **Solution:** Update the WebSocket connection code in the HTML template: | |
| ```javascript | |
| // Replace the connectWebSocket function in app.py HTML | |
| function connectWebSocket() { | |
| const isHFSpaces = window.location.hostname.includes('huggingface.co'); | |
| const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:'; | |
| const wsUrl = `${protocol}//${window.location.host}/ws`; | |
| ws = new WebSocket(wsUrl); | |
| // Longer timeout for HF Spaces | |
| const timeout = isHFSpaces ? 30000 : 10000; | |
| const connectTimer = setTimeout(() => { | |
| if (ws.readyState !== WebSocket.OPEN) { | |
| ws.close(); | |
| setTimeout(connectWebSocket, 5000); // Retry after 5s | |
| } | |
| }, timeout); | |
| ws.onopen = function(event) { | |
| clearTimeout(connectTimer); | |
| statusEl.textContent = 'Connected'; | |
| statusEl.style.color = '#00ff00'; | |
| // Re-send start if user already clicked | |
| if (gameStarted && !gamePlaying) { | |
| ws.send(JSON.stringify({ type: 'start' })); | |
| } | |
| }; | |
| } | |
| ``` | |
| ### **Issue 4: Actor-Critic Model Missing** ๐ง | |
| **Already Fixed!** โ The app now handles this gracefully: | |
| - Detects missing `actor_critic` weights | |
| - Falls back to human control mode | |
| - Shows proper warning messages | |
| - Game still works (user can control manually) | |
| ### **Issue 5: Dockerfile Optimization** ๐ณ | |
| **Update your Dockerfile for HF Spaces:** | |
| ```dockerfile | |
| # Add these optimizations | |
| ENV SHM_SIZE=2g | |
| ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 | |
| ENV OMP_NUM_THREADS=4 | |
| # Add health check | |
| HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \ | |
| CMD curl --fail http://localhost:7860/health || exit 1 | |
| ``` | |
| ## ๐ **Quick Deployment Checklist** | |
| ### **Before Deploying:** | |
| 1. โ **Test locally with conda**: `conda activate diamond && python run_web_demo.py` | |
| 2. โ **Verify the fix works**: Click should now work (even without actor_critic weights) | |
| 3. โ **Check model download**: Test internet connectivity for HF model URL | |
| ### **For HF Spaces Deployment:** | |
| 1. **Update timeout values:** | |
| ```python | |
| # In app.py line ~153 | |
| success = await asyncio.wait_for(future, timeout=900.0) # 15 min | |
| ``` | |
| 2. **Add health check endpoint:** | |
| ```python | |
| @app.get("/health") | |
| async def health_check(): | |
| return { | |
| "status": "healthy", | |
| "models_ready": game_engine.models_ready, | |
| "actor_critic_loaded": game_engine.actor_critic_loaded | |
| } | |
| ``` | |
| 3. **Force CPU mode for free tier:** | |
| ```python | |
| # Add at app.py startup | |
| os.environ["CUDA_VISIBLE_DEVICES"] = "" | |
| ``` | |
| 4. **Update Dockerfile** with the optimizations above | |
| 5. **Test WebSocket connection** - add the improved connection handling | |
| ## ๐ง **Debugging on HF Spaces** | |
| ### **Check Logs:** | |
| 1. Go to your Space page on HuggingFace | |
| 2. Click "Logs" tab | |
| 3. Look for these messages: | |
| - โ `"Actor-critic model exists but has no trained weights - using dummy mode!"` | |
| - โ `"WebPlayEnv set to human control mode"` | |
| - โ `"Model loading timed out"` | |
| - โ `"WebSocket error"` | |
| ### **Test Health Endpoint:** | |
| - Visit: `https://your-space.hf.space/health` | |
| - Should return JSON with status info | |
| ### **Browser Console:** | |
| - Open Developer Tools (F12) | |
| - Check for WebSocket connection errors | |
| - Look for JavaScript errors during click | |
| ## ๐ฏ **Expected Behavior After Fixes** | |
| 1. **App loads** โ Shows loading progress bar | |
| 2. **Models initialize** โ Either loads actor_critic OR shows "no trained weights" | |
| 3. **User clicks game area** โ Game starts immediately (no hanging) | |
| 4. **If actor_critic missing** โ User gets manual control (still playable!) | |
| 5. **If actor_critic loaded** โ AI takes control automatically | |
| ## ๐ **If Issues Persist** | |
| **Quick Diagnostic:** | |
| ```python | |
| # Add this test endpoint to app.py | |
| @app.get("/debug") | |
| async def debug_info(): | |
| return { | |
| "models_ready": game_engine.models_ready, | |
| "actor_critic_loaded": game_engine.actor_critic_loaded, | |
| "loading_status": game_engine.loading_status, | |
| "game_started": game_engine.game_started, | |
| "obs_shape": str(game_engine.obs.shape) if game_engine.obs is not None else "None", | |
| "connected_clients": len(connected_clients), | |
| "cuda_available": torch.cuda.is_available(), | |
| "device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0 | |
| } | |
| ``` | |
| Visit `/debug` endpoint to see the current state. | |
| **Most Common Issue:** If clicking still doesn't work on HF Spaces, it's usually the WebSocket connection. Update the connection handling as described above. | |
| The core model/clicking issue is now fixed - the remaining items are deployment optimizations for HF Spaces' specific environment! ๐ | |