File size: 6,457 Bytes
02c6351
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# ๐Ÿš€ Hugging Face Spaces Deployment - Troubleshooting Guide

## โœ… **Your Local Fix Applied**
Great news! The core issue has been resolved locally. The problem was that the downloaded model doesn't contain `actor_critic` weights, but the code assumed it did. This caused a `NoneType` error when clicking to start the game.

**Fixed**: The app now properly detects when `actor_critic` weights are missing and falls back to human control mode instead of crashing.

## ๐Ÿ” **Potential HF Spaces Issues & Solutions**

### **Issue 1: Model Download Timeouts** โฐ

**Symptoms:**
- "Model loading timed out" message
- App shows loading forever
- Click doesn't start the game

**Root Cause:** HF Spaces network can be slower, 5-minute timeout may not be enough.

**Solution:**
```python
# In app.py, update the timeout in _load_model_from_url_async():
success = await asyncio.wait_for(future, timeout=900.0)  # 15 minutes instead of 5
```

### **Issue 2: Memory Limitations** ๐Ÿ’พ

**Symptoms:**
- App crashes during model loading
- "Out of memory" errors in logs
- Models load but inference fails

**Root Cause:** HF Spaces free tier has only 16GB RAM.

**Quick Fix:** Force CPU-only mode
```python
# Add at the top of app.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""  # Force CPU mode for HF Spaces
```

**Better Solution:** Add memory management
```python
# Add memory cleanup after model loading
import gc
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
```

### **Issue 3: WebSocket Connection Failures** ๐Ÿ”Œ

**Symptoms:**
- "Connection Error" or "Disconnected" status
- Click works but no response
- Frequent reconnections

**Root Cause:** HF Spaces proxy/domain restrictions.

**Solution:** Update the WebSocket connection code in the HTML template:
```javascript
// Replace the connectWebSocket function in app.py HTML
function connectWebSocket() {
    const isHFSpaces = window.location.hostname.includes('huggingface.co');
    const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:';
    const wsUrl = `${protocol}//${window.location.host}/ws`;
    
    ws = new WebSocket(wsUrl);
    
    // Longer timeout for HF Spaces
    const timeout = isHFSpaces ? 30000 : 10000;
    
    const connectTimer = setTimeout(() => {
        if (ws.readyState !== WebSocket.OPEN) {
            ws.close();
            setTimeout(connectWebSocket, 5000); // Retry after 5s
        }
    }, timeout);
    
    ws.onopen = function(event) {
        clearTimeout(connectTimer);
        statusEl.textContent = 'Connected';
        statusEl.style.color = '#00ff00';
        
        // Re-send start if user already clicked
        if (gameStarted && !gamePlaying) {
            ws.send(JSON.stringify({ type: 'start' }));
        }
    };
}
```

### **Issue 4: Actor-Critic Model Missing** ๐Ÿง 

**Already Fixed!** โœ… The app now handles this gracefully:
- Detects missing `actor_critic` weights
- Falls back to human control mode  
- Shows proper warning messages
- Game still works (user can control manually)

### **Issue 5: Dockerfile Optimization** ๐Ÿณ

**Update your Dockerfile for HF Spaces:**
```dockerfile
# Add these optimizations
ENV SHM_SIZE=2g
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
ENV OMP_NUM_THREADS=4

# Add health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \
  CMD curl --fail http://localhost:7860/health || exit 1
```

## ๐Ÿš€ **Quick Deployment Checklist**

### **Before Deploying:**
1. โœ… **Test locally with conda**: `conda activate diamond && python run_web_demo.py`
2. โœ… **Verify the fix works**: Click should now work (even without actor_critic weights)
3. โœ… **Check model download**: Test internet connectivity for HF model URL

### **For HF Spaces Deployment:**

1. **Update timeout values:**
   ```python
   # In app.py line ~153
   success = await asyncio.wait_for(future, timeout=900.0)  # 15 min
   ```

2. **Add health check endpoint:**
   ```python
   @app.get("/health")
   async def health_check():
       return {
           "status": "healthy",
           "models_ready": game_engine.models_ready,
           "actor_critic_loaded": game_engine.actor_critic_loaded
       }
   ```

3. **Force CPU mode for free tier:**
   ```python
   # Add at app.py startup
   os.environ["CUDA_VISIBLE_DEVICES"] = ""
   ```

4. **Update Dockerfile** with the optimizations above

5. **Test WebSocket connection** - add the improved connection handling

## ๐Ÿ”ง **Debugging on HF Spaces**

### **Check Logs:**
1. Go to your Space page on HuggingFace
2. Click "Logs" tab
3. Look for these messages:
   - โœ… `"Actor-critic model exists but has no trained weights - using dummy mode!"`
   - โœ… `"WebPlayEnv set to human control mode"`
   - โŒ `"Model loading timed out"`
   - โŒ `"WebSocket error"`

### **Test Health Endpoint:**
- Visit: `https://your-space.hf.space/health`
- Should return JSON with status info

### **Browser Console:**
- Open Developer Tools (F12)
- Check for WebSocket connection errors
- Look for JavaScript errors during click

## ๐ŸŽฏ **Expected Behavior After Fixes**

1. **App loads** โ†’ Shows loading progress bar
2. **Models initialize** โ†’ Either loads actor_critic OR shows "no trained weights"  
3. **User clicks game area** โ†’ Game starts immediately (no hanging)
4. **If actor_critic missing** โ†’ User gets manual control (still playable!)
5. **If actor_critic loaded** โ†’ AI takes control automatically

## ๐Ÿ†˜ **If Issues Persist**

**Quick Diagnostic:**
```python
# Add this test endpoint to app.py
@app.get("/debug")
async def debug_info():
    return {
        "models_ready": game_engine.models_ready,
        "actor_critic_loaded": game_engine.actor_critic_loaded,
        "loading_status": game_engine.loading_status,
        "game_started": game_engine.game_started,
        "obs_shape": str(game_engine.obs.shape) if game_engine.obs is not None else "None",
        "connected_clients": len(connected_clients),
        "cuda_available": torch.cuda.is_available(),
        "device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0
    }
```

Visit `/debug` endpoint to see the current state.

**Most Common Issue:** If clicking still doesn't work on HF Spaces, it's usually the WebSocket connection. Update the connection handling as described above.

The core model/clicking issue is now fixed - the remaining items are deployment optimizations for HF Spaces' specific environment! ๐ŸŽ‰