PIWM / HF_SPACES_DEPLOYMENT_GUIDE.md
musictimer's picture
Fix initial bugs
02c6351
# ๐Ÿš€ Hugging Face Spaces Deployment - Troubleshooting Guide
## โœ… **Your Local Fix Applied**
Great news! The core issue has been resolved locally. The problem was that the downloaded model doesn't contain `actor_critic` weights, but the code assumed it did. This caused a `NoneType` error when clicking to start the game.
**Fixed**: The app now properly detects when `actor_critic` weights are missing and falls back to human control mode instead of crashing.
## ๐Ÿ” **Potential HF Spaces Issues & Solutions**
### **Issue 1: Model Download Timeouts** โฐ
**Symptoms:**
- "Model loading timed out" message
- App shows loading forever
- Click doesn't start the game
**Root Cause:** HF Spaces network can be slower, 5-minute timeout may not be enough.
**Solution:**
```python
# In app.py, update the timeout in _load_model_from_url_async():
success = await asyncio.wait_for(future, timeout=900.0) # 15 minutes instead of 5
```
### **Issue 2: Memory Limitations** ๐Ÿ’พ
**Symptoms:**
- App crashes during model loading
- "Out of memory" errors in logs
- Models load but inference fails
**Root Cause:** HF Spaces free tier has only 16GB RAM.
**Quick Fix:** Force CPU-only mode
```python
# Add at the top of app.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "" # Force CPU mode for HF Spaces
```
**Better Solution:** Add memory management
```python
# Add memory cleanup after model loading
import gc
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
```
### **Issue 3: WebSocket Connection Failures** ๐Ÿ”Œ
**Symptoms:**
- "Connection Error" or "Disconnected" status
- Click works but no response
- Frequent reconnections
**Root Cause:** HF Spaces proxy/domain restrictions.
**Solution:** Update the WebSocket connection code in the HTML template:
```javascript
// Replace the connectWebSocket function in app.py HTML
function connectWebSocket() {
const isHFSpaces = window.location.hostname.includes('huggingface.co');
const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:';
const wsUrl = `${protocol}//${window.location.host}/ws`;
ws = new WebSocket(wsUrl);
// Longer timeout for HF Spaces
const timeout = isHFSpaces ? 30000 : 10000;
const connectTimer = setTimeout(() => {
if (ws.readyState !== WebSocket.OPEN) {
ws.close();
setTimeout(connectWebSocket, 5000); // Retry after 5s
}
}, timeout);
ws.onopen = function(event) {
clearTimeout(connectTimer);
statusEl.textContent = 'Connected';
statusEl.style.color = '#00ff00';
// Re-send start if user already clicked
if (gameStarted && !gamePlaying) {
ws.send(JSON.stringify({ type: 'start' }));
}
};
}
```
### **Issue 4: Actor-Critic Model Missing** ๐Ÿง 
**Already Fixed!** โœ… The app now handles this gracefully:
- Detects missing `actor_critic` weights
- Falls back to human control mode
- Shows proper warning messages
- Game still works (user can control manually)
### **Issue 5: Dockerfile Optimization** ๐Ÿณ
**Update your Dockerfile for HF Spaces:**
```dockerfile
# Add these optimizations
ENV SHM_SIZE=2g
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
ENV OMP_NUM_THREADS=4
# Add health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
CMD curl --fail http://localhost:7860/health || exit 1
```
## ๐Ÿš€ **Quick Deployment Checklist**
### **Before Deploying:**
1. โœ… **Test locally with conda**: `conda activate diamond && python run_web_demo.py`
2. โœ… **Verify the fix works**: Click should now work (even without actor_critic weights)
3. โœ… **Check model download**: Test internet connectivity for HF model URL
### **For HF Spaces Deployment:**
1. **Update timeout values:**
```python
# In app.py line ~153
success = await asyncio.wait_for(future, timeout=900.0) # 15 min
```
2. **Add health check endpoint:**
```python
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"models_ready": game_engine.models_ready,
"actor_critic_loaded": game_engine.actor_critic_loaded
}
```
3. **Force CPU mode for free tier:**
```python
# Add at app.py startup
os.environ["CUDA_VISIBLE_DEVICES"] = ""
```
4. **Update Dockerfile** with the optimizations above
5. **Test WebSocket connection** - add the improved connection handling
## ๐Ÿ”ง **Debugging on HF Spaces**
### **Check Logs:**
1. Go to your Space page on HuggingFace
2. Click "Logs" tab
3. Look for these messages:
- โœ… `"Actor-critic model exists but has no trained weights - using dummy mode!"`
- โœ… `"WebPlayEnv set to human control mode"`
- โŒ `"Model loading timed out"`
- โŒ `"WebSocket error"`
### **Test Health Endpoint:**
- Visit: `https://your-space.hf.space/health`
- Should return JSON with status info
### **Browser Console:**
- Open Developer Tools (F12)
- Check for WebSocket connection errors
- Look for JavaScript errors during click
## ๐ŸŽฏ **Expected Behavior After Fixes**
1. **App loads** โ†’ Shows loading progress bar
2. **Models initialize** โ†’ Either loads actor_critic OR shows "no trained weights"
3. **User clicks game area** โ†’ Game starts immediately (no hanging)
4. **If actor_critic missing** โ†’ User gets manual control (still playable!)
5. **If actor_critic loaded** โ†’ AI takes control automatically
## ๐Ÿ†˜ **If Issues Persist**
**Quick Diagnostic:**
```python
# Add this test endpoint to app.py
@app.get("/debug")
async def debug_info():
return {
"models_ready": game_engine.models_ready,
"actor_critic_loaded": game_engine.actor_critic_loaded,
"loading_status": game_engine.loading_status,
"game_started": game_engine.game_started,
"obs_shape": str(game_engine.obs.shape) if game_engine.obs is not None else "None",
"connected_clients": len(connected_clients),
"cuda_available": torch.cuda.is_available(),
"device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0
}
```
Visit `/debug` endpoint to see the current state.
**Most Common Issue:** If clicking still doesn't work on HF Spaces, it's usually the WebSocket connection. Update the connection handling as described above.
The core model/clicking issue is now fixed - the remaining items are deployment optimizations for HF Spaces' specific environment! ๐ŸŽ‰