VOID: Video Object and Interaction Deletion
VOID removes objects from videos along with all interactions they induce on the scene β not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed.
Project Page | Paper | GitHub | Demo
Quick Start
The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with 40GB+ VRAM (e.g., A100).
Model Details
VOID is built on CogVideoX-Fun-V1.5-5b-InP and fine-tuned for video inpainting with interaction-aware quadmask conditioning β a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).
Checkpoints
| File | Description | Required? |
|---|---|---|
void_pass1.safetensors |
Base inpainting model | Yes |
void_pass2.safetensors |
Warped-noise refinement for temporal consistency | Optional |
Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.
Architecture
- Base: CogVideoX 3D Transformer (5B parameters)
- Input: Video + quadmask + text prompt describing the scene after removal
- Resolution: 384x672 (default)
- Max frames: 197
- Scheduler: DDIM
- Precision: BF16 with FP8 quantization for memory efficiency
Usage
From the Notebook
The easiest way β clone the repo and run notebook.ipynb:
git clone https://github.com/netflix/void-model.git
cd void-model
From the CLI
# Install dependencies
pip install -r requirements.txt
# Download the base model
huggingface-cli download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
--local-dir ./CogVideoX-Fun-V1.5-5b-InP
# Download VOID checkpoints
huggingface-cli download netflix/void-model \
--local-dir .
# Run Pass 1 inference on a sample
python inference/cogvideox_fun/predict_v2v.py \
--config config/quadmask_cogvideox.py \
--config.data.data_rootdir="./sample" \
--config.experiment.run_seqs="lime" \
--config.experiment.save_path="./outputs" \
--config.video_model.transformer_path="./void_pass1.safetensors"
Input Format
Each video needs three files in a folder:
my-video/
input_video.mp4 # source video
quadmask_0.mp4 # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
prompt.json # {"bg": "description of scene after removal"}
The repo includes a mask generation pipeline (VLM-MASK-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.
Training
Trained on paired counterfactual videos generated from two sources:
- HUMOTO β human-object interactions rendered in Blender with physics simulation
- Kubric β object-only interactions using Google Scanned Objects
Training was run on 8x A100 80GB GPUs using DeepSpeed ZeRO Stage 2. See the GitHub repo for full training instructions and data generation code.
Citation
@misc{motamed2026void,
title={VOID: Video Object and Interaction Deletion},
author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
year={2026},
eprint={2604.02296},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02296}
}