SmolVLA (VLA-Arena Fine-tuned)
About VLA-Arena
VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:
- Task Structure: 170+ tasks grouped into four key dimensions:
- Safety: Operating reliably under strict constraints.
- Distractor: Handling environmental unpredictability.
- Extrapolation: Generalizing to unseen scenarios.
- Long Horizon: Executing complex, multi-step tasks.
- Language Command: Variations in instruction complexity.
- Visual Observation: Perturbations in visual input.
Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.
Model Overview
The model is SmolVLA, a lightweight and efficient VLA policy explicitly fine-tuned on demonstration data generated from VLA-Arena. This model emphasizes computational efficiency by freezing the heavy vision backbone and focusing training exclusively on a specialized Action Expert.
It is particularly notable for its large Action Chunking capability, predicting 50 steps into the future in a single inference pass, which contributes to smooth and temporally consistent motion generation.
Model Architecture
SmolVLA adopts a modular design where the perceptual tower is kept static to preserve pre-trained visual features, while a dedicated expert network learns the specific sensorimotor mappings for the arena tasks.
| Component | Description |
|---|---|
| Vision Encoder | Frozen (Pre-trained weights preserved) |
| Policy Head | Action Expert (Trainable) |
| Input Modalities | 2 RGB Images ($256 \times 256$) + State Vector (8-dim) |
| Action Space | 7-DoF Continuous Control |
| Prediction Horizon | 50 steps (Action Chunking) |
Key Feature: Large Action Chunking ($N_{act}=50$)
Unlike models that predict single steps or small chunks, SmolVLA is configured to output a trajectory of 50 steps at once. This reduces the frequency of high-level inference required and ensures long-term temporal consistency in execution.
Training Details
Dataset
This model was trained on the VLA-Arena/VLA_Arena_L0_L_lerobot_smolvla dataset. The data is formatted specifically for the LeRobot pipeline, supporting efficient loading of image pairs and state vectors.
Hyperparameters
The training focused on the "Expert Only" strategy, optimizing the policy head while keeping the vision encoder frozen.
| Parameter | Value |
|---|---|
| Max Training Steps | 100,000 |
| Total Batch Size | 64 |
| Optimizer | AdamW |
| Learning Rate ($\eta$) | $1.0 \times 10^{-4}$ (Peak) |
| Weight Decay | $1.0 \times 10^{-10}$ |
| Gradient Clip Norm | 10.0 |
| Image Normalization | ImageNet Stats (TRUE) |
Scheduler Configuration
A Cosine Decay schedule was implemented to stabilize training, featuring a short warm-up phase followed by a gradual decay.
| Parameter | Value |
|---|---|
| Warmup Steps | 1,000 |
| Decay Steps | 30,000 |
| Min Learning Rate | $2.5 \times 10^{-6}$ |
Input/Output Configuration
| Parameter | Value |
|---|---|
| Observation Steps ($N_{obs}$) | 1 |
| Action Steps ($N_{act}$) | 50 |
| State Input Shape | $[8]$ |
| Action Output Shape | $[7]$ |
Evaluation & Usage
This model is designed to be evaluated within the VLA-Arena benchmark ecosystem. It has been tested across 11 specialized suites with difficulty levels ranging from L0 (Basic) to L2 (Advanced).
For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.