my-gradio-momask / TROUBLESHOOTING.md
nocapdev's picture
Upload folder using huggingface_hub
4411958 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

πŸ”§ Troubleshooting: Slow/Hanging Generation

Your Issue

βœ… Deployed successfully ⚠️ App loads but generation takes forever or hangs


πŸ” Step 1: Check HF Space Logs

How to Access Logs:

  1. Go to: https://huggingface.co/spaces/nocapdev/my-gradio-momask
  2. Click "Logs" tab (top right)
  3. Look for these specific messages:

🎯 Common Issues & Solutions

Issue A: Models Not Found

Look for in logs:

ERROR: Model checkpoints not found!
Looking for: ./checkpoints

Why: The checkpoints/ directory wasn't uploaded to HF Spaces

Solutions:

  1. Upload models to Space (if <10GB total):

    # In your local directory
    git clone https://huggingface.co/spaces/nocapdev/my-gradio-momask
    cd my-gradio-momask
    cp -r /path/to/checkpoints ./
    git add checkpoints/
    git commit -m "Add model checkpoints"
    git push
    
  2. Use Git LFS for large files (recommended):

    git lfs install
    git lfs track "checkpoints/**/*.tar"
    git lfs track "checkpoints/**/*.pth"
    git add .gitattributes
    git add checkpoints/
    git commit -m "Add models with LFS"
    git push
    
  3. Host on HF Model Hub (best for very large files):

    • Upload checkpoints to HF Model Hub
    • Modify app.py to download on startup

Issue B: Using CPU Instead of GPU

Look for in logs:

Using device: cpu

Why: HF Spaces free tier uses CPU. Models are very slow on CPU.

Impact: Generation can take 5-30 minutes on CPU vs 10-30 seconds on GPU

Solutions:

  1. Upgrade to GPU Space (costs money):

    • Go to Space Settings
    • Change hardware to T4 GPU (~$0.60/hour)
    • Or use A10G for faster inference
  2. Optimize for CPU (free but slower):

    • Reduce time_steps from 18 to 10
    • Use smaller batch processing
    • Add timeout warnings
  3. Use CPU optimizations: Add to app.py:

    # Set CPU threads
    torch.set_num_threads(4)
    # Use CPU-optimized operations
    torch.set_float32_matmul_precision('medium')
    

Issue C: Out of Memory

Look for in logs:

Killed
SIGKILL
OutOfMemoryError

Why: Models too large for available RAM

Solutions:

  1. Upgrade Space hardware (HF Space Settings)
  2. Reduce model size:
    • Use FP16 instead of FP32
    • Reduce batch sizes
  3. Add memory monitoring

Issue D: Stuck During Generation

Look for in logs:

[1/4] Generating motion tokens...
[Nothing else appears]

Why:

  • CPU inference is very slow (can take 10-20 minutes)
  • Infinite loop in model
  • Process timeout

Solutions:

  1. Wait longer - CPU generation can take 10-30 minutes!
  2. Check if it's actually running:
    • Look for CPU usage in HF Space metrics
  3. Add timeout:
    import signal
    def timeout_handler(signum, frame):
        raise TimeoutError("Generation timed out")
    signal.alarm(600)  # 10 minute timeout
    

πŸ“Š What You Should See in Logs

Healthy Startup:

Using device: cuda  # or cpu
Loading models...
βœ“ VQ model loaded
βœ“ Transformer loaded
βœ“ Residual model loaded
βœ“ Length estimator loaded
Models loaded successfully!
Running on local URL: http://0.0.0.0:7860

Healthy Generation (with my updates):

======================================================================
Generating motion for: 'a person walks forward'
======================================================================
[1/4] Generating motion tokens...
βœ“ Generated 80 frames
[2/4] Converting to BVH format...
βœ“ BVH conversion complete
[3/4] Rendering video...
βœ“ Video saved to ./gradio_outputs/motion_12345.mp4
[4/4] Complete!
======================================================================

Unhealthy - Models Missing:

ERROR: Model checkpoints not found!
Looking for: ./checkpoints
The model files are not included in this Space.

Unhealthy - Error During Init:

ERROR during initialization:
======================================================================
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/...'

πŸš€ Quick Fix: Redeploy with Better Logging

I've updated app.py with:

  • βœ… Auto CPU/GPU detection
  • βœ… Better error messages
  • βœ… Progress indicators
  • βœ… Graceful failure if models missing

To deploy the update:

python deploy.py

πŸ’‘ Immediate Actions

Action 1: Check Logs NOW

  1. Go to Logs tab on your Space
  2. Copy the last 50 lines
  3. Look for any ERROR messages
  4. Share them if you need help

Action 2: Verify Models

# On your local machine, check model size
ls -lh checkpoints/

# If very large (>5GB), you'll need Git LFS

Action 3: Expected Timings

Hardware Generation Time
Free CPU 10-30 minutes ⚠️
T4 GPU 20-60 seconds βœ…
A10G GPU 10-30 seconds βœ…

If using free tier: Be patient! First generation takes longer.


🎯 Next Steps

  1. Check logs - Most important!
  2. Redeploy updated app.py - Better error handling
  3. Share log output - So I can help debug

To redeploy:

python deploy.py

Then check logs again to see the new detailed output!


πŸ“ Share These from Logs:

Copy and share:

  1. Lines showing "Using device: X"
  2. Any lines with "ERROR" or "FAIL"
  3. Last 20 lines when you submitted a prompt
  4. Any "Killed" or "SIGKILL" messages

This will help identify the exact issue!