YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

GR00T Locomotion Policy Architecture

Model: GR00T-WholeBodyControl-Walk
Source: NVIDIA GR00T Whole Body Control
Total Parameters: 470,578

Overview
Input Space (Observation)
Output Space (Action)
Network Architecture
Forward Pass Data Flow
Comparison with Unitree Policy
Training Considerations

Overview

The GR00T policy is an asymmetric actor-critic architecture designed for bipedal locomotion with whole-body control. It uses:

A temporal observation stack (6 frames × 86D = 516D)
An estimator network (encoder) that processes full observations into a compressed latent space
An actor network (decoder) that generates actions from the latent representation + recent observations

Key Features

Temporal context: 6-frame history for stability and dynamics understanding
Privileged learning: Estimator sees all 29 DOF, actor only controls 15 DOF (legs + waist)
Height & orientation commands: Explicit control over base height and orientation
ELU activations: For smoother gradients compared to ReLU

Input Space (Observation)

Single Frame (86D)

One observation frame contains:

[0:3]     velocity_commands (3D)     - [vx, vy, yaw_rate] desired velocities
[3]       height_command (1D)        - desired base height
[4:7]     orientation_command (3D)   - [roll, pitch, yaw] desired orientation
[7:10]    angular_velocity (3D)      - IMU angular velocity (scaled by 0.5)
[10:13]   gravity_vector (3D)        - projected gravity direction
[13:42]   joint_positions (29D)      - all joint angles relative to default pose
[42:71]   joint_velocities (29D)     - all joint velocities
[71:86]   previous_actions (15D)     - actions from previous timestep
---
Total: 86D per frame

Stacked Observation (516D)

The policy receives 6 consecutive frames stacked together:

obs_516d = [frame_t-5, frame_t-4, frame_t-3, frame_t-2, frame_t-1, frame_t]
         = 6 × 86D = 516D

Temporal reasoning: The history allows the policy to:

Infer velocities and accelerations
Predict contact events and transitions
Maintain gait stability across steps

Observation Scaling

# From Unitree training config
ang_vel_scale = 0.5          # Angular velocity scaling
dof_pos_scale = 1.0          # Joint position scaling  
dof_vel_scale = 0.05         # Joint velocity scaling
cmd_scale = [2.0, 2.0, 0.5]  # [vx, vy, yaw] command scaling

Default Joint Positions (Standing Pose)

default_angles = [
    # Left leg (6D)
    -0.1,  0.0,  0.0,  0.3, -0.2, 0.0,  # hip_pitch, hip_roll, hip_yaw, knee, ankle_pitch, ankle_roll
    
    # Right leg (6D)  
    -0.1,  0.0,  0.0,  0.3, -0.2, 0.0,
    
    # Waist (3D)
    0.0, 0.0, 0.0,  # yaw, roll, pitch
    
    # Left arm (7D)
    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,  # shoulder_pitch/roll/yaw, elbow, wrist_roll/pitch/yaw
    
    # Right arm (7D)
    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
]

Output Space (Action)

Action Vector (15D)

[0:6]     left_leg_deltas (6D)      - position offsets for left leg joints
[6:12]    right_leg_deltas (6D)     - position offsets for right leg joints
[12:15]   waist_deltas (3D)         - position offsets for waist joints
---
Total: 15D

Action Scaling & Application

action_scale = 0.25  # Action multiplier

# Target joint positions
target_pos = default_angles[:15] + action * action_scale

Controlled DOFs

The policy controls only the lower body:

12 leg joints: Full leg control for locomotion
3 waist joints: Yaw, roll, pitch for balance

The arms (14 DOF) are NOT controlled by the locomotion policy and can be:

Held at fixed positions (e.g., slight forward/outward pose)
Controlled by a separate arm policy
Used for manipulation while walking

Network Architecture

Overview Diagram

Input (516D)
    │
    ├─────────────────────────────────────────────────┐
    │                                                  │
    v                                                  v
┌─────────────────┐                          ┌────────────────┐
│   ESTIMATOR     │                          │  Input Slice   │
│   (Encoder)     │                          │   (Last 86D)   │
└─────────────────┘                          └────────────────┘
    │                                                  │
    │ (35D latent)                                     │
    │                                                  │
    ├──► [:3]     ──────────┐                        │
    │    (3D)               │                        │
    │                       v                        v
    └──► [3:]     ────► Normalize  ─────►  Concat (121D)
         (32D)            (32D)                      │
                                                     v
                                             ┌─────────────┐
                                             │    ACTOR    │
                                             │  (Decoder)  │
                                             └─────────────┘
                                                     │
                                                     v
                                               Output (15D)

1. Estimator Network (Encoder)

Purpose: Encode full 516D observation into a compact 35D latent representation.

Architecture

Estimator(
    # Layer 1: Initial compression
    Linear(516, 256),      # Weight: [256, 516] = 132,096 params
    ELU(),                 # Smooth activation
    
    # Layer 2: Feature refinement  
    Linear(256, 256),      # Weight: [256, 256] = 65,536 params
    ELU(),
    
    # Layer 3: Final encoding
    Linear(256, 35),       # Weight: [35, 256] = 8,960 params
)

Layer Details

Layer 1: `Linear(516 → 256)` + `ELU()`

Input: 516D stacked observations (6 frames × 86D)
Output: 256D intermediate features
Parameters: 132,096 + 256 (bias) = 132,352
Purpose: Initial dimensionality reduction while preserving temporal information

Layer 2: `Linear(256 → 256)` + `ELU()`

Input: 256D intermediate features
Output: 256D refined features
Parameters: 65,536 + 256 (bias) = 65,792
Purpose: Feature refinement and non-linear transformation

Layer 3: `Linear(256 → 35)`

Input: 256D refined features
Output: 35D latent encoding
Parameters: 8,960 + 35 (bias) = 8,995
Purpose: Final compression to latent space

Total Estimator Parameters: 207,139

Latent Space Breakdown (35D)

The 35D latent is split into two components:

First 3D (indices 0-2): Direct features (not normalized)
Last 32D (indices 3-34): Dynamics features (L2-normalized for stability)

Normalization:

latent_first = latent[:, :3]     # (batch, 3)
latent_rest = latent[:, 3:]      # (batch, 32)

# L2 normalize the rest
latent_rest_norm = torch.clamp(
    torch.norm(latent_rest, p=2, dim=-1, keepdim=True),
    min=1e-8
)
latent_rest_normalized = latent_rest / latent_rest_norm  # (batch, 32)

2. Actor Network (Decoder)

Purpose: Decode latent representation + recent observation into motor actions.

Actor Input Construction (121D)

actor_input = torch.cat([
    obs_recent,             # Last 86D from input (most recent frame)
    latent_first,           # First 3D from estimator (direct features)
    latent_rest_normalized  # Last 32D from estimator (normalized dynamics)
], dim=-1)

# Total: 86D + 3D + 32D = 121D

Architecture

Actor(
    # Layer 1: Initial expansion
    Linear(121, 512),      # Weight: [512, 121] = 61,952 params
    ELU(),
    
    # Layer 2: Feature processing
    Linear(512, 256),      # Weight: [256, 512] = 131,072 params
    ELU(),
    
    # Layer 3: Feature refinement
    Linear(256, 256),      # Weight: [256, 256] = 65,536 params
    ELU(),
    
    # Layer 4: Action head
    Linear(256, 15),       # Weight: [15, 256] = 3,840 params
)

Layer Details

Layer 1: `Linear(121 → 512)` + `ELU()`

Input: 121D (recent obs + latent)
Output: 512D expanded features
Parameters: 61,952 + 512 (bias) = 62,464
Purpose: Expand compressed representation for action generation

Layer 2: `Linear(512 → 256)` + `ELU()`

Input: 512D expanded features
Output: 256D intermediate features
Parameters: 131,072 + 256 (bias) = 131,328
Purpose: Process features with large capacity

Layer 3: `Linear(256 → 256)` + `ELU()`

Input: 256D intermediate features
Output: 256D refined features
Parameters: 65,536 + 256 (bias) = 65,792
Purpose: Additional refinement layer for complex motor patterns

Layer 4: `Linear(256 → 15)`

Input: 256D refined features
Output: 15D actions (leg + waist deltas)
Parameters: 3,840 + 15 (bias) = 3,855
Purpose: Final action prediction layer

Total Actor Parameters: 263,439

Forward Pass Data Flow

Complete Forward Pass

def forward(obs_516d):
    """
    Args:
        obs_516d: (batch, 516) stacked observations
        
    Returns:
        actions: (batch, 15) joint position deltas
    """
    # 1. Encode full observation
    latent = estimator(obs_516d)  # (batch, 35)
    
    # 2. Split and normalize latent
    latent_first = latent[:, :3]              # (batch, 3)
    latent_rest = latent[:, 3:]               # (batch, 32)
    
    latent_rest_norm = torch.clamp(
        torch.norm(latent_rest, p=2, dim=-1, keepdim=True),
        min=1e-8
    )
    latent_rest_normalized = latent_rest / latent_rest_norm  # (batch, 32)
    
    # 3. Extract recent observation
    obs_recent = obs_516d[:, -86:]  # Last frame (batch, 86)
    
    # 4. Concatenate actor input
    actor_input = torch.cat([
        obs_recent,              # 86D
        latent_first,            # 3D
        latent_rest_normalized   # 32D
    ], dim=-1)  # Total: 121D
    
    # 5. Generate actions
    actions = actor(actor_input)  # (batch, 15)
    
    return actions

Computational Cost

Forward pass operations:
1. Estimator: 516 → 256 → 256 → 35     (~207K params)
2. Normalization: L2 norm on 32D       (~32 ops)
3. Concatenation: Build 121D input     (memory copy)
4. Actor: 121 → 512 → 256 → 256 → 15  (~263K params)

Total: ~470K multiply-adds per inference

References

GR00T Project: https://github.com/NVlabs/gr00t_wbc
Paper: "GR00T: Whole-Body Control for Humanoid Robots" (NVIDIA Research)
Unitree SDK: https://github.com/unitreerobotics/unitree_sdk2_python

License

The GR00T model is provided by NVIDIA under the NVIDIA Open Model License.
See NVIDIA Open Model License in the model directory for details.

Last Updated: 2025-01-20

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

nepyope
/

GR00T-WholeBodyControl_g1

GR00T Locomotion Policy Architecture

Model: GR00T-WholeBodyControl-Walk
Source: NVIDIA GR00T Whole Body Control
Total Parameters: 470,578

Table of Contents

Overview

Key Features

Input Space (Observation)

Single Frame (86D)

Stacked Observation (516D)

Observation Scaling

Default Joint Positions (Standing Pose)

Output Space (Action)

Action Vector (15D)

Action Scaling & Application

Controlled DOFs

Network Architecture

Overview Diagram

1. Estimator Network (Encoder)

Architecture

Layer Details

Layer 1: `Linear(516 → 256)` + `ELU()`

Layer 2: `Linear(256 → 256)` + `ELU()`

Layer 3: `Linear(256 → 35)`

Latent Space Breakdown (35D)

2. Actor Network (Decoder)

Actor Input Construction (121D)

Architecture

Layer Details

Layer 1: `Linear(121 → 512)` + `ELU()`

Layer 2: `Linear(512 → 256)` + `ELU()`

Layer 3: `Linear(256 → 256)` + `ELU()`

Layer 4: `Linear(256 → 15)`

Forward Pass Data Flow

Complete Forward Pass

Computational Cost

References

License

GR00T Locomotion Policy Architecture

Model: GR00T-WholeBodyControl-WalkSource: NVIDIA GR00T Whole Body ControlTotal Parameters: 470,578

Table of Contents

Overview

Key Features

Input Space (Observation)

Single Frame (86D)

Stacked Observation (516D)

Observation Scaling

Default Joint Positions (Standing Pose)

Output Space (Action)

Action Vector (15D)

Action Scaling & Application

Controlled DOFs

Network Architecture

Overview Diagram

1. Estimator Network (Encoder)

Architecture

Layer Details

Layer 1: Linear(516 → 256) + ELU()

Layer 2: Linear(256 → 256) + ELU()

Layer 3: Linear(256 → 35)

Latent Space Breakdown (35D)

2. Actor Network (Decoder)

Actor Input Construction (121D)

Architecture

Layer Details

Layer 1: Linear(121 → 512) + ELU()

Layer 2: Linear(512 → 256) + ELU()

Layer 3: Linear(256 → 256) + ELU()

Layer 4: Linear(256 → 15)

Forward Pass Data Flow

Complete Forward Pass

Computational Cost

References

License

Model: GR00T-WholeBodyControl-Walk
Source: NVIDIA GR00T Whole Body Control
Total Parameters: 470,578

Layer 1: `Linear(516 → 256)` + `ELU()`

Layer 2: `Linear(256 → 256)` + `ELU()`

Layer 3: `Linear(256 → 35)`

Layer 1: `Linear(121 → 512)` + `ELU()`

Layer 2: `Linear(512 → 256)` + `ELU()`

Layer 3: `Linear(256 → 256)` + `ELU()`

Layer 4: `Linear(256 → 15)`