Spaces:

nocapdev
/

my-gradio-momask

Sleeping

App Files Files Community

my-gradio-momask / TROUBLESHOOTING.md

nocapdev

Upload folder using huggingface_hub

4411958 verified 19 days ago

preview code

raw

history blame contribute delete

5.53 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

🔧 Troubleshooting: Slow/Hanging Generation

Your Issue

✅ Deployed successfully ⚠️ App loads but generation takes forever or hangs

🔍 Step 1: Check HF Space Logs

How to Access Logs:

Go to: https://huggingface.co/spaces/nocapdev/my-gradio-momask
Click "Logs" tab (top right)
Look for these specific messages:

🎯 Common Issues & Solutions

Issue A: Models Not Found

Look for in logs:

ERROR: Model checkpoints not found!
Looking for: ./checkpoints

Why: The checkpoints/ directory wasn't uploaded to HF Spaces

Solutions:

Upload models to Space (if <10GB total):

# In your local directory
git clone https://huggingface.co/spaces/nocapdev/my-gradio-momask
cd my-gradio-momask
cp -r /path/to/checkpoints ./
git add checkpoints/
git commit -m "Add model checkpoints"
git push

Use Git LFS for large files (recommended):

git lfs install
git lfs track "checkpoints/**/*.tar"
git lfs track "checkpoints/**/*.pth"
git add .gitattributes
git add checkpoints/
git commit -m "Add models with LFS"
git push

Host on HF Model Hub (best for very large files):
- Upload checkpoints to HF Model Hub
- Modify app.py to download on startup

Issue B: Using CPU Instead of GPU

Look for in logs:

Using device: cpu

Why: HF Spaces free tier uses CPU. Models are very slow on CPU.

Impact: Generation can take 5-30 minutes on CPU vs 10-30 seconds on GPU

Solutions:

Upgrade to GPU Space (costs money):
- Go to Space Settings
- Change hardware to T4 GPU (~$0.60/hour)
- Or use A10G for faster inference
Optimize for CPU (free but slower):
- Reduce time_steps from 18 to 10
- Use smaller batch processing
- Add timeout warnings

Use CPU optimizations: Add to app.py:

# Set CPU threads
torch.set_num_threads(4)
# Use CPU-optimized operations
torch.set_float32_matmul_precision('medium')

Issue C: Out of Memory

Look for in logs:

Killed
SIGKILL
OutOfMemoryError

Why: Models too large for available RAM

Solutions:

Upgrade Space hardware (HF Space Settings)
Reduce model size:
- Use FP16 instead of FP32
- Reduce batch sizes
Add memory monitoring

Issue D: Stuck During Generation

Look for in logs:

[1/4] Generating motion tokens...
[Nothing else appears]

Why:

CPU inference is very slow (can take 10-20 minutes)
Infinite loop in model
Process timeout

Solutions:

Wait longer - CPU generation can take 10-30 minutes!
Check if it's actually running:
- Look for CPU usage in HF Space metrics

Add timeout:

import signal
def timeout_handler(signum, frame):
    raise TimeoutError("Generation timed out")
signal.alarm(600)  # 10 minute timeout

📊 What You Should See in Logs

Healthy Startup:

Using device: cuda  # or cpu
Loading models...
✓ VQ model loaded
✓ Transformer loaded
✓ Residual model loaded
✓ Length estimator loaded
Models loaded successfully!
Running on local URL: http://0.0.0.0:7860

Healthy Generation (with my updates):

======================================================================
Generating motion for: 'a person walks forward'
======================================================================
[1/4] Generating motion tokens...
✓ Generated 80 frames
[2/4] Converting to BVH format...
✓ BVH conversion complete
[3/4] Rendering video...
✓ Video saved to ./gradio_outputs/motion_12345.mp4
[4/4] Complete!
======================================================================

Unhealthy - Models Missing:

ERROR: Model checkpoints not found!
Looking for: ./checkpoints
The model files are not included in this Space.

Unhealthy - Error During Init:

ERROR during initialization:
======================================================================
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/...'

🚀 Quick Fix: Redeploy with Better Logging

I've updated app.py with:

✅ Auto CPU/GPU detection
✅ Better error messages
✅ Progress indicators
✅ Graceful failure if models missing

To deploy the update:

python deploy.py

💡 Immediate Actions

Action 1: Check Logs NOW

Go to Logs tab on your Space
Copy the last 50 lines
Look for any ERROR messages
Share them if you need help

Action 2: Verify Models

# On your local machine, check model size
ls -lh checkpoints/

# If very large (>5GB), you'll need Git LFS

Action 3: Expected Timings

Hardware	Generation Time
Free CPU	10-30 minutes ⚠️
T4 GPU	20-60 seconds ✅
A10G GPU	10-30 seconds ✅

If using free tier: Be patient! First generation takes longer.

🎯 Next Steps

Check logs - Most important!
Redeploy updated app.py - Better error handling
Share log output - So I can help debug

To redeploy:

python deploy.py

🔧 Troubleshooting: Slow/Hanging Generation

Your Issue

🔍 Step 1: Check HF Space Logs

How to Access Logs:

🎯 Common Issues & Solutions

Issue A: Models Not Found

Issue B: Using CPU Instead of GPU

Issue C: Out of Memory

Issue D: Stuck During Generation

📊 What You Should See in Logs

Healthy Startup:

Healthy Generation (with my updates):

Unhealthy - Models Missing:

Unhealthy - Error During Init:

🚀 Quick Fix: Redeploy with Better Logging

💡 Immediate Actions

Action 1: Check Logs NOW

Action 2: Verify Models

Action 3: Expected Timings

🎯 Next Steps

To redeploy:

📝 Share These from Logs: