PIWM / HF_SPACES_DEPLOYMENT_GUIDE.md
musictimer's picture
Fix initial bugs
02c6351

๐Ÿš€ Hugging Face Spaces Deployment - Troubleshooting Guide

โœ… Your Local Fix Applied

Great news! The core issue has been resolved locally. The problem was that the downloaded model doesn't contain actor_critic weights, but the code assumed it did. This caused a NoneType error when clicking to start the game.

Fixed: The app now properly detects when actor_critic weights are missing and falls back to human control mode instead of crashing.

๐Ÿ” Potential HF Spaces Issues & Solutions

Issue 1: Model Download Timeouts โฐ

Symptoms:

  • "Model loading timed out" message
  • App shows loading forever
  • Click doesn't start the game

Root Cause: HF Spaces network can be slower, 5-minute timeout may not be enough.

Solution:

# In app.py, update the timeout in _load_model_from_url_async():
success = await asyncio.wait_for(future, timeout=900.0)  # 15 minutes instead of 5

Issue 2: Memory Limitations ๐Ÿ’พ

Symptoms:

  • App crashes during model loading
  • "Out of memory" errors in logs
  • Models load but inference fails

Root Cause: HF Spaces free tier has only 16GB RAM.

Quick Fix: Force CPU-only mode

# Add at the top of app.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""  # Force CPU mode for HF Spaces

Better Solution: Add memory management

# Add memory cleanup after model loading
import gc
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

Issue 3: WebSocket Connection Failures ๐Ÿ”Œ

Symptoms:

  • "Connection Error" or "Disconnected" status
  • Click works but no response
  • Frequent reconnections

Root Cause: HF Spaces proxy/domain restrictions.

Solution: Update the WebSocket connection code in the HTML template:

// Replace the connectWebSocket function in app.py HTML
function connectWebSocket() {
    const isHFSpaces = window.location.hostname.includes('huggingface.co');
    const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:';
    const wsUrl = `${protocol}//${window.location.host}/ws`;
    
    ws = new WebSocket(wsUrl);
    
    // Longer timeout for HF Spaces
    const timeout = isHFSpaces ? 30000 : 10000;
    
    const connectTimer = setTimeout(() => {
        if (ws.readyState !== WebSocket.OPEN) {
            ws.close();
            setTimeout(connectWebSocket, 5000); // Retry after 5s
        }
    }, timeout);
    
    ws.onopen = function(event) {
        clearTimeout(connectTimer);
        statusEl.textContent = 'Connected';
        statusEl.style.color = '#00ff00';
        
        // Re-send start if user already clicked
        if (gameStarted && !gamePlaying) {
            ws.send(JSON.stringify({ type: 'start' }));
        }
    };
}

Issue 4: Actor-Critic Model Missing ๐Ÿง 

Already Fixed! โœ… The app now handles this gracefully:

  • Detects missing actor_critic weights
  • Falls back to human control mode
  • Shows proper warning messages
  • Game still works (user can control manually)

Issue 5: Dockerfile Optimization ๐Ÿณ

Update your Dockerfile for HF Spaces:

# Add these optimizations
ENV SHM_SIZE=2g
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
ENV OMP_NUM_THREADS=4

# Add health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
  CMD curl --fail http://localhost:7860/health || exit 1

๐Ÿš€ Quick Deployment Checklist

Before Deploying:

  1. โœ… Test locally with conda: conda activate diamond && python run_web_demo.py
  2. โœ… Verify the fix works: Click should now work (even without actor_critic weights)
  3. โœ… Check model download: Test internet connectivity for HF model URL

For HF Spaces Deployment:

  1. Update timeout values:

    # In app.py line ~153
    success = await asyncio.wait_for(future, timeout=900.0)  # 15 min
    
  2. Add health check endpoint:

    @app.get("/health")
    async def health_check():
        return {
            "status": "healthy",
            "models_ready": game_engine.models_ready,
            "actor_critic_loaded": game_engine.actor_critic_loaded
        }
    
  3. Force CPU mode for free tier:

    # Add at app.py startup
    os.environ["CUDA_VISIBLE_DEVICES"] = ""
    
  4. Update Dockerfile with the optimizations above

  5. Test WebSocket connection - add the improved connection handling

๐Ÿ”ง Debugging on HF Spaces

Check Logs:

  1. Go to your Space page on HuggingFace
  2. Click "Logs" tab
  3. Look for these messages:
    • โœ… "Actor-critic model exists but has no trained weights - using dummy mode!"
    • โœ… "WebPlayEnv set to human control mode"
    • โŒ "Model loading timed out"
    • โŒ "WebSocket error"

Test Health Endpoint:

  • Visit: https://your-space.hf.space/health
  • Should return JSON with status info

Browser Console:

  • Open Developer Tools (F12)
  • Check for WebSocket connection errors
  • Look for JavaScript errors during click

๐ŸŽฏ Expected Behavior After Fixes

  1. App loads โ†’ Shows loading progress bar
  2. Models initialize โ†’ Either loads actor_critic OR shows "no trained weights"
  3. User clicks game area โ†’ Game starts immediately (no hanging)
  4. If actor_critic missing โ†’ User gets manual control (still playable!)
  5. If actor_critic loaded โ†’ AI takes control automatically

๐Ÿ†˜ If Issues Persist

Quick Diagnostic:

# Add this test endpoint to app.py
@app.get("/debug")
async def debug_info():
    return {
        "models_ready": game_engine.models_ready,
        "actor_critic_loaded": game_engine.actor_critic_loaded,
        "loading_status": game_engine.loading_status,
        "game_started": game_engine.game_started,
        "obs_shape": str(game_engine.obs.shape) if game_engine.obs is not None else "None",
        "connected_clients": len(connected_clients),
        "cuda_available": torch.cuda.is_available(),
        "device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0
    }

Visit /debug endpoint to see the current state.

Most Common Issue: If clicking still doesn't work on HF Spaces, it's usually the WebSocket connection. Update the connection handling as described above.

The core model/clicking issue is now fixed - the remaining items are deployment optimizations for HF Spaces' specific environment! ๐ŸŽ‰