Cosmos 2B Action-Conditioned World Model β€” LIBERO Spatial

Self-contained checkpoint repository for running Cosmos Predict 2.5 as an action-conditioned world model on the LIBERO-Spatial benchmark, designed for use with the RLinf reinforcement-learning framework.

Repository Contents

File Size Description
libero-spatial-2b-19k.pt 11.89 GB Cosmos 2B DiT checkpoint (19k iterations on LIBERO-Spatial)
resnet_rm.pth 43 MB ResNet reward model for binary success/fail prediction
tokenizer/tokenizer.pth 485 MB Cosmos video VAE tokenizer
dataset/ (400 Γ— .npy) 77 MB LIBERO-Spatial initial-state trajectories (seed images)
dataset_statistics.json 2 KB Action normalization statistics (mean/std)

Model Details

  • Architecture: Cosmos 2B DiT (2048 hidden dim, 28 blocks, 16 heads)
  • Base Model: Cosmos-1.0-Diffusion-7B-Text2World
  • Training Data: LIBERO-Spatial (400 train + 100 val demonstrations, 10 spatial reasoning tasks)
  • Training Iterations: 19,000 (scaled from Bridge dataset baseline)
  • Resolution: 256 Γ— 320 @ 4 FPS
  • Frame Prediction: 12 future frames per inference step
  • Action Space: 7D (x, y, z, roll, pitch, yaw, gripper) Γ— 8 steps with stride 3
  • Denoising: 10 steps, RectifiedFlow 2AB solver, CFG guidance = 7
  • Training Duration: ~6.3 hours on A100 80GB

Quick Start with RLinf

1. Download

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="tayalmanan/cosmos-robotics",
    local_dir="models/Cosmos-Predict2.5-LIBERO-Spatial",
)

2. Install Cosmos dependencies (no OpenSora required)

cd RLinf
bash requirements/install.sh cosmos_world_model

3. Run GRPO training

# Set the model directory in the config, then:
bash examples/embodiment/run_embodiment.sh cosmos_libero_spatial_grpo_openvlaoft

The training config expects a single cosmos_model_dir path. All sub-paths (DiT checkpoint, reward model, tokenizer, dataset) are resolved relative to it:

cosmos_model_dir: "models/Cosmos-Predict2.5-LIBERO-Spatial"

Standalone Inference

from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="tayalmanan/cosmos-robotics",
    filename="libero-spatial-2b-19k.pt",
)

See the Cosmos Predict 2.5 docs for standalone inference usage.

Reward Model

The resnet_rm.pth is a ResNet-based binary reward model that predicts task success from a single RGB frame:

  • Architecture: ResNet (Conv7 β†’ 4 blocks 64β†’128β†’256β†’512 β†’ AdaptiveAvgPool β†’ Linear β†’ Sigmoid)
  • Output: Binary {0, 1} after rounding
  • Input: 256 Γ— 320 RGB observation, normalized to [-1, 1]

Action Format

Actions are 7D vectors (SE(3) + gripper), normalized using dataset_statistics.json:

{
  "action_mean": [...],
  "action_std": [...]
}

Normalized actions are further scaled by action_scaler: 20.0 before being fed to the DiT.

Performance (RLinf GRPO Training)

Metric Value
Epoch time ~9 min (4Γ— A100 80GB)
GPU memory ~50 GB per GPU
DiT forward passes / step 22 (10 steps Γ— 2 CFG + 2 final)
Batch size 16 envs per worker

Citation

@misc{cosmos-libero-2b,
  author = {Tayal, Manan},
  title = {Cosmos 2B Action-Conditioned World Model β€” LIBERO Spatial},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/tayalmanan/cosmos-robotics}}
}

@inproceedings{liu2024libero,
  title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
  author={Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter},
  booktitle={NeurIPS 2023 Datasets and Benchmarks Track},
  year={2024}
}

License

Released under the NVIDIA Open Model License.

Downloads last month
123
Video Preview
loading