Spaces:
Sleeping
Sleeping
File size: 6,457 Bytes
02c6351 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
# ๐ Hugging Face Spaces Deployment - Troubleshooting Guide
## โ
**Your Local Fix Applied**
Great news! The core issue has been resolved locally. The problem was that the downloaded model doesn't contain `actor_critic` weights, but the code assumed it did. This caused a `NoneType` error when clicking to start the game.
**Fixed**: The app now properly detects when `actor_critic` weights are missing and falls back to human control mode instead of crashing.
## ๐ **Potential HF Spaces Issues & Solutions**
### **Issue 1: Model Download Timeouts** โฐ
**Symptoms:**
- "Model loading timed out" message
- App shows loading forever
- Click doesn't start the game
**Root Cause:** HF Spaces network can be slower, 5-minute timeout may not be enough.
**Solution:**
```python
# In app.py, update the timeout in _load_model_from_url_async():
success = await asyncio.wait_for(future, timeout=900.0) # 15 minutes instead of 5
```
### **Issue 2: Memory Limitations** ๐พ
**Symptoms:**
- App crashes during model loading
- "Out of memory" errors in logs
- Models load but inference fails
**Root Cause:** HF Spaces free tier has only 16GB RAM.
**Quick Fix:** Force CPU-only mode
```python
# Add at the top of app.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "" # Force CPU mode for HF Spaces
```
**Better Solution:** Add memory management
```python
# Add memory cleanup after model loading
import gc
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
```
### **Issue 3: WebSocket Connection Failures** ๐
**Symptoms:**
- "Connection Error" or "Disconnected" status
- Click works but no response
- Frequent reconnections
**Root Cause:** HF Spaces proxy/domain restrictions.
**Solution:** Update the WebSocket connection code in the HTML template:
```javascript
// Replace the connectWebSocket function in app.py HTML
function connectWebSocket() {
const isHFSpaces = window.location.hostname.includes('huggingface.co');
const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:';
const wsUrl = `${protocol}//${window.location.host}/ws`;
ws = new WebSocket(wsUrl);
// Longer timeout for HF Spaces
const timeout = isHFSpaces ? 30000 : 10000;
const connectTimer = setTimeout(() => {
if (ws.readyState !== WebSocket.OPEN) {
ws.close();
setTimeout(connectWebSocket, 5000); // Retry after 5s
}
}, timeout);
ws.onopen = function(event) {
clearTimeout(connectTimer);
statusEl.textContent = 'Connected';
statusEl.style.color = '#00ff00';
// Re-send start if user already clicked
if (gameStarted && !gamePlaying) {
ws.send(JSON.stringify({ type: 'start' }));
}
};
}
```
### **Issue 4: Actor-Critic Model Missing** ๐ง
**Already Fixed!** โ
The app now handles this gracefully:
- Detects missing `actor_critic` weights
- Falls back to human control mode
- Shows proper warning messages
- Game still works (user can control manually)
### **Issue 5: Dockerfile Optimization** ๐ณ
**Update your Dockerfile for HF Spaces:**
```dockerfile
# Add these optimizations
ENV SHM_SIZE=2g
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
ENV OMP_NUM_THREADS=4
# Add health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
CMD curl --fail http://localhost:7860/health || exit 1
```
## ๐ **Quick Deployment Checklist**
### **Before Deploying:**
1. โ
**Test locally with conda**: `conda activate diamond && python run_web_demo.py`
2. โ
**Verify the fix works**: Click should now work (even without actor_critic weights)
3. โ
**Check model download**: Test internet connectivity for HF model URL
### **For HF Spaces Deployment:**
1. **Update timeout values:**
```python
# In app.py line ~153
success = await asyncio.wait_for(future, timeout=900.0) # 15 min
```
2. **Add health check endpoint:**
```python
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"models_ready": game_engine.models_ready,
"actor_critic_loaded": game_engine.actor_critic_loaded
}
```
3. **Force CPU mode for free tier:**
```python
# Add at app.py startup
os.environ["CUDA_VISIBLE_DEVICES"] = ""
```
4. **Update Dockerfile** with the optimizations above
5. **Test WebSocket connection** - add the improved connection handling
## ๐ง **Debugging on HF Spaces**
### **Check Logs:**
1. Go to your Space page on HuggingFace
2. Click "Logs" tab
3. Look for these messages:
- โ
`"Actor-critic model exists but has no trained weights - using dummy mode!"`
- โ
`"WebPlayEnv set to human control mode"`
- โ `"Model loading timed out"`
- โ `"WebSocket error"`
### **Test Health Endpoint:**
- Visit: `https://your-space.hf.space/health`
- Should return JSON with status info
### **Browser Console:**
- Open Developer Tools (F12)
- Check for WebSocket connection errors
- Look for JavaScript errors during click
## ๐ฏ **Expected Behavior After Fixes**
1. **App loads** โ Shows loading progress bar
2. **Models initialize** โ Either loads actor_critic OR shows "no trained weights"
3. **User clicks game area** โ Game starts immediately (no hanging)
4. **If actor_critic missing** โ User gets manual control (still playable!)
5. **If actor_critic loaded** โ AI takes control automatically
## ๐ **If Issues Persist**
**Quick Diagnostic:**
```python
# Add this test endpoint to app.py
@app.get("/debug")
async def debug_info():
return {
"models_ready": game_engine.models_ready,
"actor_critic_loaded": game_engine.actor_critic_loaded,
"loading_status": game_engine.loading_status,
"game_started": game_engine.game_started,
"obs_shape": str(game_engine.obs.shape) if game_engine.obs is not None else "None",
"connected_clients": len(connected_clients),
"cuda_available": torch.cuda.is_available(),
"device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0
}
```
Visit `/debug` endpoint to see the current state.
**Most Common Issue:** If clicking still doesn't work on HF Spaces, it's usually the WebSocket connection. Update the connection handling as described above.
The core model/clicking issue is now fixed - the remaining items are deployment optimizations for HF Spaces' specific environment! ๐
|