VOID: Video Object and Interaction Deletion

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed.

Project Page | Paper | GitHub | Demo

Quick Start

The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with 40GB+ VRAM (e.g., A100).

Model Details

VOID is built on CogVideoX-Fun-V1.5-5b-InP and fine-tuned for video inpainting with interaction-aware quadmask conditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).

Checkpoints

File	Description	Required?
`void_pass1.safetensors`	Base inpainting model	Yes
`void_pass2.safetensors`	Warped-noise refinement for temporal consistency	Optional

Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.

Architecture

Base: CogVideoX 3D Transformer (5B parameters)
Input: Video + quadmask + text prompt describing the scene after removal
Resolution: 384x672 (default)
Max frames: 197
Scheduler: DDIM
Precision: BF16 with FP8 quantization for memory efficiency

Usage

From the Notebook

The easiest way — clone the repo and run notebook.ipynb:

git clone https://github.com/netflix/void-model.git
cd void-model

From the CLI

# Install dependencies
pip install -r requirements.txt

# Download the base model
huggingface-cli download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

# Download VOID checkpoints
huggingface-cli download netflix/void-model \
    --local-dir .

# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="./sample" \
    --config.experiment.run_seqs="lime" \
    --config.experiment.save_path="./outputs" \
    --config.video_model.transformer_path="./void_pass1.safetensors"

Input Format

Each video needs three files in a folder:

my-video/
  input_video.mp4      # source video
  quadmask_0.mp4       # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
  prompt.json          # {"bg": "description of scene after removal"}

The repo includes a mask generation pipeline (VLM-MASK-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.

Training

Trained on paired counterfactual videos generated from two sources:

HUMOTO — human-object interactions rendered in Blender with physics simulation
Kubric — object-only interactions using Google Scanned Objects

Training was run on 8x A100 80GB GPUs using DeepSpeed ZeRO Stage 2. See the GitHub repo for full training instructions and data generation code.

Citation

@misc{motamed2026void,
  title={VOID: Video Object and Interaction Deletion},
  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
  year={2026},
  eprint={2604.02296},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02296}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 3 Ask for provider support

Spaces using netflix/void-model 2

Paper for netflix/void-model

VOID: Video Object and Interaction Deletion

Paper • 2604.02296 • Published 2 days ago • 15