Cosmos 2B Action-Conditioned World Model β LIBERO Spatial
Self-contained checkpoint repository for running Cosmos Predict 2.5 as an action-conditioned world model on the LIBERO-Spatial benchmark, designed for use with the RLinf reinforcement-learning framework.
Repository Contents
| File | Size | Description |
|---|---|---|
libero-spatial-2b-19k.pt |
11.89 GB | Cosmos 2B DiT checkpoint (19k iterations on LIBERO-Spatial) |
resnet_rm.pth |
43 MB | ResNet reward model for binary success/fail prediction |
tokenizer/tokenizer.pth |
485 MB | Cosmos video VAE tokenizer |
dataset/ (400 Γ .npy) |
77 MB | LIBERO-Spatial initial-state trajectories (seed images) |
dataset_statistics.json |
2 KB | Action normalization statistics (mean/std) |
Model Details
- Architecture: Cosmos 2B DiT (2048 hidden dim, 28 blocks, 16 heads)
- Base Model: Cosmos-1.0-Diffusion-7B-Text2World
- Training Data: LIBERO-Spatial (400 train + 100 val demonstrations, 10 spatial reasoning tasks)
- Training Iterations: 19,000 (scaled from Bridge dataset baseline)
- Resolution: 256 Γ 320 @ 4 FPS
- Frame Prediction: 12 future frames per inference step
- Action Space: 7D (x, y, z, roll, pitch, yaw, gripper) Γ 8 steps with stride 3
- Denoising: 10 steps, RectifiedFlow 2AB solver, CFG guidance = 7
- Training Duration: ~6.3 hours on A100 80GB
Quick Start with RLinf
1. Download
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="tayalmanan/cosmos-robotics",
local_dir="models/Cosmos-Predict2.5-LIBERO-Spatial",
)
2. Install Cosmos dependencies (no OpenSora required)
cd RLinf
bash requirements/install.sh cosmos_world_model
3. Run GRPO training
# Set the model directory in the config, then:
bash examples/embodiment/run_embodiment.sh cosmos_libero_spatial_grpo_openvlaoft
The training config expects a single cosmos_model_dir path. All sub-paths (DiT checkpoint, reward model, tokenizer, dataset) are resolved relative to it:
cosmos_model_dir: "models/Cosmos-Predict2.5-LIBERO-Spatial"
Standalone Inference
from huggingface_hub import hf_hub_download
checkpoint_path = hf_hub_download(
repo_id="tayalmanan/cosmos-robotics",
filename="libero-spatial-2b-19k.pt",
)
See the Cosmos Predict 2.5 docs for standalone inference usage.
Reward Model
The resnet_rm.pth is a ResNet-based binary reward model that predicts task success from a single RGB frame:
- Architecture: ResNet (Conv7 β 4 blocks 64β128β256β512 β AdaptiveAvgPool β Linear β Sigmoid)
- Output: Binary {0, 1} after rounding
- Input: 256 Γ 320 RGB observation, normalized to [-1, 1]
Action Format
Actions are 7D vectors (SE(3) + gripper), normalized using dataset_statistics.json:
{
"action_mean": [...],
"action_std": [...]
}
Normalized actions are further scaled by action_scaler: 20.0 before being fed to the DiT.
Performance (RLinf GRPO Training)
| Metric | Value |
|---|---|
| Epoch time | ~9 min (4Γ A100 80GB) |
| GPU memory | ~50 GB per GPU |
| DiT forward passes / step | 22 (10 steps Γ 2 CFG + 2 final) |
| Batch size | 16 envs per worker |
Citation
@misc{cosmos-libero-2b,
author = {Tayal, Manan},
title = {Cosmos 2B Action-Conditioned World Model β LIBERO Spatial},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/tayalmanan/cosmos-robotics}}
}
@inproceedings{liu2024libero,
title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
author={Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter},
booktitle={NeurIPS 2023 Datasets and Benchmarks Track},
year={2024}
}
License
Released under the NVIDIA Open Model License.
- Downloads last month
- 123