Spaces:
Sleeping
๐ Hugging Face Spaces Deployment - Troubleshooting Guide
โ Your Local Fix Applied
Great news! The core issue has been resolved locally. The problem was that the downloaded model doesn't contain actor_critic weights, but the code assumed it did. This caused a NoneType error when clicking to start the game.
Fixed: The app now properly detects when actor_critic weights are missing and falls back to human control mode instead of crashing.
๐ Potential HF Spaces Issues & Solutions
Issue 1: Model Download Timeouts โฐ
Symptoms:
- "Model loading timed out" message
- App shows loading forever
- Click doesn't start the game
Root Cause: HF Spaces network can be slower, 5-minute timeout may not be enough.
Solution:
# In app.py, update the timeout in _load_model_from_url_async():
success = await asyncio.wait_for(future, timeout=900.0) # 15 minutes instead of 5
Issue 2: Memory Limitations ๐พ
Symptoms:
- App crashes during model loading
- "Out of memory" errors in logs
- Models load but inference fails
Root Cause: HF Spaces free tier has only 16GB RAM.
Quick Fix: Force CPU-only mode
# Add at the top of app.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "" # Force CPU mode for HF Spaces
Better Solution: Add memory management
# Add memory cleanup after model loading
import gc
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
Issue 3: WebSocket Connection Failures ๐
Symptoms:
- "Connection Error" or "Disconnected" status
- Click works but no response
- Frequent reconnections
Root Cause: HF Spaces proxy/domain restrictions.
Solution: Update the WebSocket connection code in the HTML template:
// Replace the connectWebSocket function in app.py HTML
function connectWebSocket() {
const isHFSpaces = window.location.hostname.includes('huggingface.co');
const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:';
const wsUrl = `${protocol}//${window.location.host}/ws`;
ws = new WebSocket(wsUrl);
// Longer timeout for HF Spaces
const timeout = isHFSpaces ? 30000 : 10000;
const connectTimer = setTimeout(() => {
if (ws.readyState !== WebSocket.OPEN) {
ws.close();
setTimeout(connectWebSocket, 5000); // Retry after 5s
}
}, timeout);
ws.onopen = function(event) {
clearTimeout(connectTimer);
statusEl.textContent = 'Connected';
statusEl.style.color = '#00ff00';
// Re-send start if user already clicked
if (gameStarted && !gamePlaying) {
ws.send(JSON.stringify({ type: 'start' }));
}
};
}
Issue 4: Actor-Critic Model Missing ๐ง
Already Fixed! โ The app now handles this gracefully:
- Detects missing
actor_criticweights - Falls back to human control mode
- Shows proper warning messages
- Game still works (user can control manually)
Issue 5: Dockerfile Optimization ๐ณ
Update your Dockerfile for HF Spaces:
# Add these optimizations
ENV SHM_SIZE=2g
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
ENV OMP_NUM_THREADS=4
# Add health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
CMD curl --fail http://localhost:7860/health || exit 1
๐ Quick Deployment Checklist
Before Deploying:
- โ
Test locally with conda:
conda activate diamond && python run_web_demo.py - โ Verify the fix works: Click should now work (even without actor_critic weights)
- โ Check model download: Test internet connectivity for HF model URL
For HF Spaces Deployment:
Update timeout values:
# In app.py line ~153 success = await asyncio.wait_for(future, timeout=900.0) # 15 minAdd health check endpoint:
@app.get("/health") async def health_check(): return { "status": "healthy", "models_ready": game_engine.models_ready, "actor_critic_loaded": game_engine.actor_critic_loaded }Force CPU mode for free tier:
# Add at app.py startup os.environ["CUDA_VISIBLE_DEVICES"] = ""Update Dockerfile with the optimizations above
Test WebSocket connection - add the improved connection handling
๐ง Debugging on HF Spaces
Check Logs:
- Go to your Space page on HuggingFace
- Click "Logs" tab
- Look for these messages:
- โ
"Actor-critic model exists but has no trained weights - using dummy mode!" - โ
"WebPlayEnv set to human control mode" - โ
"Model loading timed out" - โ
"WebSocket error"
- โ
Test Health Endpoint:
- Visit:
https://your-space.hf.space/health - Should return JSON with status info
Browser Console:
- Open Developer Tools (F12)
- Check for WebSocket connection errors
- Look for JavaScript errors during click
๐ฏ Expected Behavior After Fixes
- App loads โ Shows loading progress bar
- Models initialize โ Either loads actor_critic OR shows "no trained weights"
- User clicks game area โ Game starts immediately (no hanging)
- If actor_critic missing โ User gets manual control (still playable!)
- If actor_critic loaded โ AI takes control automatically
๐ If Issues Persist
Quick Diagnostic:
# Add this test endpoint to app.py
@app.get("/debug")
async def debug_info():
return {
"models_ready": game_engine.models_ready,
"actor_critic_loaded": game_engine.actor_critic_loaded,
"loading_status": game_engine.loading_status,
"game_started": game_engine.game_started,
"obs_shape": str(game_engine.obs.shape) if game_engine.obs is not None else "None",
"connected_clients": len(connected_clients),
"cuda_available": torch.cuda.is_available(),
"device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0
}
Visit /debug endpoint to see the current state.
Most Common Issue: If clicking still doesn't work on HF Spaces, it's usually the WebSocket connection. Update the connection handling as described above.
The core model/clicking issue is now fixed - the remaining items are deployment optimizations for HF Spaces' specific environment! ๐