Cosmos 2B Action-Conditioned World Model — LIBERO Spatial

Self-contained checkpoint repository for running Cosmos Predict 2.5 as an action-conditioned world model on the LIBERO-Spatial benchmark, designed for use with the RLinf reinforcement-learning framework.

Repository Contents

File	Size	Description
`libero-spatial-2b-19k.pt`	11.89 GB	Cosmos 2B DiT checkpoint (19k iterations on LIBERO-Spatial)
`resnet_rm.pth`	43 MB	ResNet reward model for binary success/fail prediction
`tokenizer/tokenizer.pth`	485 MB	Cosmos video VAE tokenizer
`dataset/` (400 × `.npy`)	77 MB	LIBERO-Spatial initial-state trajectories (seed images)
`dataset_statistics.json`	2 KB	Action normalization statistics (mean/std)

Model Details

Architecture: Cosmos 2B DiT (2048 hidden dim, 28 blocks, 16 heads)
Base Model: Cosmos-1.0-Diffusion-7B-Text2World
Training Data: LIBERO-Spatial (400 train + 100 val demonstrations, 10 spatial reasoning tasks)
Training Iterations: 19,000 (scaled from Bridge dataset baseline)
Resolution: 256 × 320 @ 4 FPS
Frame Prediction: 12 future frames per inference step
Action Space: 7D (x, y, z, roll, pitch, yaw, gripper) × 8 steps with stride 3
Denoising: 10 steps, RectifiedFlow 2AB solver, CFG guidance = 7
Training Duration: ~6.3 hours on A100 80GB

Quick Start with RLinf

1. Download

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="tayalmanan/cosmos-robotics",
    local_dir="models/Cosmos-Predict2.5-LIBERO-Spatial",
)

2. Install Cosmos dependencies (no OpenSora required)

cd RLinf
bash requirements/install.sh cosmos_world_model

3. Run GRPO training

# Set the model directory in the config, then:
bash examples/embodiment/run_embodiment.sh cosmos_libero_spatial_grpo_openvlaoft

The training config expects a single cosmos_model_dir path. All sub-paths (DiT checkpoint, reward model, tokenizer, dataset) are resolved relative to it:

cosmos_model_dir: "models/Cosmos-Predict2.5-LIBERO-Spatial"

Standalone Inference

from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="tayalmanan/cosmos-robotics",
    filename="libero-spatial-2b-19k.pt",
)

See the Cosmos Predict 2.5 docs for standalone inference usage.

Reward Model

The resnet_rm.pth is a ResNet-based binary reward model that predicts task success from a single RGB frame:

Architecture: ResNet (Conv7 → 4 blocks 64→128→256→512 → AdaptiveAvgPool → Linear → Sigmoid)
Output: Binary {0, 1} after rounding
Input: 256 × 320 RGB observation, normalized to [-1, 1]

Action Format

Actions are 7D vectors (SE(3) + gripper), normalized using dataset_statistics.json:

{
  "action_mean": [...],
  "action_std": [...]
}

Normalized actions are further scaled by action_scaler: 20.0 before being fed to the DiT.

Performance (RLinf GRPO Training)

Metric	Value
Epoch time	~9 min (4× A100 80GB)
GPU memory	~50 GB per GPU
DiT forward passes / step	22 (10 steps × 2 CFG + 2 final)
Batch size	16 envs per worker

Citation

@misc{cosmos-libero-2b,
  author = {Tayal, Manan},
  title = {Cosmos 2B Action-Conditioned World Model — LIBERO Spatial},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/tayalmanan/cosmos-robotics}}
}

@inproceedings{liu2024libero,
  title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
  author={Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter},
  booktitle={NeurIPS 2023 Datasets and Benchmarks Track},
  year={2024}
}

License

Released under the NVIDIA Open Model License.

Downloads last month: 123

Video Preview

Reinforcement Learning