YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

GR00T Locomotion Policy Architecture

Model: GR00T-WholeBodyControl-Walk
Source: NVIDIA GR00T Whole Body Control
Total Parameters: 470,578

Table of Contents

  1. Overview
  2. Input Space (Observation)
  3. Output Space (Action)
  4. Network Architecture
  5. Forward Pass Data Flow
  6. Comparison with Unitree Policy
  7. Training Considerations

Overview

The GR00T policy is an asymmetric actor-critic architecture designed for bipedal locomotion with whole-body control. It uses:

  • A temporal observation stack (6 frames Γ— 86D = 516D)
  • An estimator network (encoder) that processes full observations into a compressed latent space
  • An actor network (decoder) that generates actions from the latent representation + recent observations

Key Features

  • Temporal context: 6-frame history for stability and dynamics understanding
  • Privileged learning: Estimator sees all 29 DOF, actor only controls 15 DOF (legs + waist)
  • Height & orientation commands: Explicit control over base height and orientation
  • ELU activations: For smoother gradients compared to ReLU

Input Space (Observation)

Single Frame (86D)

One observation frame contains:

[0:3]     velocity_commands (3D)     - [vx, vy, yaw_rate] desired velocities
[3]       height_command (1D)        - desired base height
[4:7]     orientation_command (3D)   - [roll, pitch, yaw] desired orientation
[7:10]    angular_velocity (3D)      - IMU angular velocity (scaled by 0.5)
[10:13]   gravity_vector (3D)        - projected gravity direction
[13:42]   joint_positions (29D)      - all joint angles relative to default pose
[42:71]   joint_velocities (29D)     - all joint velocities
[71:86]   previous_actions (15D)     - actions from previous timestep
---
Total: 86D per frame

Stacked Observation (516D)

The policy receives 6 consecutive frames stacked together:

obs_516d = [frame_t-5, frame_t-4, frame_t-3, frame_t-2, frame_t-1, frame_t]
         = 6 Γ— 86D = 516D

Temporal reasoning: The history allows the policy to:

  • Infer velocities and accelerations
  • Predict contact events and transitions
  • Maintain gait stability across steps

Observation Scaling

# From Unitree training config
ang_vel_scale = 0.5          # Angular velocity scaling
dof_pos_scale = 1.0          # Joint position scaling  
dof_vel_scale = 0.05         # Joint velocity scaling
cmd_scale = [2.0, 2.0, 0.5]  # [vx, vy, yaw] command scaling

Default Joint Positions (Standing Pose)

default_angles = [
    # Left leg (6D)
    -0.1,  0.0,  0.0,  0.3, -0.2, 0.0,  # hip_pitch, hip_roll, hip_yaw, knee, ankle_pitch, ankle_roll
    
    # Right leg (6D)  
    -0.1,  0.0,  0.0,  0.3, -0.2, 0.0,
    
    # Waist (3D)
    0.0, 0.0, 0.0,  # yaw, roll, pitch
    
    # Left arm (7D)
    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,  # shoulder_pitch/roll/yaw, elbow, wrist_roll/pitch/yaw
    
    # Right arm (7D)
    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
]

Output Space (Action)

Action Vector (15D)

[0:6]     left_leg_deltas (6D)      - position offsets for left leg joints
[6:12]    right_leg_deltas (6D)     - position offsets for right leg joints
[12:15]   waist_deltas (3D)         - position offsets for waist joints
---
Total: 15D

Action Scaling & Application

action_scale = 0.25  # Action multiplier

# Target joint positions
target_pos = default_angles[:15] + action * action_scale

Controlled DOFs

The policy controls only the lower body:

  • 12 leg joints: Full leg control for locomotion
  • 3 waist joints: Yaw, roll, pitch for balance

The arms (14 DOF) are NOT controlled by the locomotion policy and can be:

  • Held at fixed positions (e.g., slight forward/outward pose)
  • Controlled by a separate arm policy
  • Used for manipulation while walking

Network Architecture

Overview Diagram

Input (516D)
    β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                                                  β”‚
    v                                                  v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   ESTIMATOR     β”‚                          β”‚  Input Slice   β”‚
β”‚   (Encoder)     β”‚                          β”‚   (Last 86D)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚                                                  β”‚
    β”‚ (35D latent)                                     β”‚
    β”‚                                                  β”‚
    β”œβ”€β”€β–Ί [:3]     ──────────┐                        β”‚
    β”‚    (3D)               β”‚                        β”‚
    β”‚                       v                        v
    └──► [3:]     ────► Normalize  ─────►  Concat (121D)
         (32D)            (32D)                      β”‚
                                                     v
                                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                             β”‚    ACTOR    β”‚
                                             β”‚  (Decoder)  β”‚
                                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                     β”‚
                                                     v
                                               Output (15D)

1. Estimator Network (Encoder)

Purpose: Encode full 516D observation into a compact 35D latent representation.

Architecture

Estimator(
    # Layer 1: Initial compression
    Linear(516, 256),      # Weight: [256, 516] = 132,096 params
    ELU(),                 # Smooth activation
    
    # Layer 2: Feature refinement  
    Linear(256, 256),      # Weight: [256, 256] = 65,536 params
    ELU(),
    
    # Layer 3: Final encoding
    Linear(256, 35),       # Weight: [35, 256] = 8,960 params
)

Layer Details

Layer 1: Linear(516 β†’ 256) + ELU()

  • Input: 516D stacked observations (6 frames Γ— 86D)
  • Output: 256D intermediate features
  • Parameters: 132,096 + 256 (bias) = 132,352
  • Purpose: Initial dimensionality reduction while preserving temporal information

Layer 2: Linear(256 β†’ 256) + ELU()

  • Input: 256D intermediate features
  • Output: 256D refined features
  • Parameters: 65,536 + 256 (bias) = 65,792
  • Purpose: Feature refinement and non-linear transformation

Layer 3: Linear(256 β†’ 35)

  • Input: 256D refined features
  • Output: 35D latent encoding
  • Parameters: 8,960 + 35 (bias) = 8,995
  • Purpose: Final compression to latent space

Total Estimator Parameters: 207,139

Latent Space Breakdown (35D)

The 35D latent is split into two components:

  1. First 3D (indices 0-2): Direct features (not normalized)
  2. Last 32D (indices 3-34): Dynamics features (L2-normalized for stability)

Normalization:

latent_first = latent[:, :3]     # (batch, 3)
latent_rest = latent[:, 3:]      # (batch, 32)

# L2 normalize the rest
latent_rest_norm = torch.clamp(
    torch.norm(latent_rest, p=2, dim=-1, keepdim=True),
    min=1e-8
)
latent_rest_normalized = latent_rest / latent_rest_norm  # (batch, 32)

2. Actor Network (Decoder)

Purpose: Decode latent representation + recent observation into motor actions.

Actor Input Construction (121D)

actor_input = torch.cat([
    obs_recent,             # Last 86D from input (most recent frame)
    latent_first,           # First 3D from estimator (direct features)
    latent_rest_normalized  # Last 32D from estimator (normalized dynamics)
], dim=-1)

# Total: 86D + 3D + 32D = 121D

Architecture

Actor(
    # Layer 1: Initial expansion
    Linear(121, 512),      # Weight: [512, 121] = 61,952 params
    ELU(),
    
    # Layer 2: Feature processing
    Linear(512, 256),      # Weight: [256, 512] = 131,072 params
    ELU(),
    
    # Layer 3: Feature refinement
    Linear(256, 256),      # Weight: [256, 256] = 65,536 params
    ELU(),
    
    # Layer 4: Action head
    Linear(256, 15),       # Weight: [15, 256] = 3,840 params
)

Layer Details

Layer 1: Linear(121 β†’ 512) + ELU()

  • Input: 121D (recent obs + latent)
  • Output: 512D expanded features
  • Parameters: 61,952 + 512 (bias) = 62,464
  • Purpose: Expand compressed representation for action generation

Layer 2: Linear(512 β†’ 256) + ELU()

  • Input: 512D expanded features
  • Output: 256D intermediate features
  • Parameters: 131,072 + 256 (bias) = 131,328
  • Purpose: Process features with large capacity

Layer 3: Linear(256 β†’ 256) + ELU()

  • Input: 256D intermediate features
  • Output: 256D refined features
  • Parameters: 65,536 + 256 (bias) = 65,792
  • Purpose: Additional refinement layer for complex motor patterns

Layer 4: Linear(256 β†’ 15)

  • Input: 256D refined features
  • Output: 15D actions (leg + waist deltas)
  • Parameters: 3,840 + 15 (bias) = 3,855
  • Purpose: Final action prediction layer

Total Actor Parameters: 263,439


Forward Pass Data Flow

Complete Forward Pass

def forward(obs_516d):
    """
    Args:
        obs_516d: (batch, 516) stacked observations
        
    Returns:
        actions: (batch, 15) joint position deltas
    """
    # 1. Encode full observation
    latent = estimator(obs_516d)  # (batch, 35)
    
    # 2. Split and normalize latent
    latent_first = latent[:, :3]              # (batch, 3)
    latent_rest = latent[:, 3:]               # (batch, 32)
    
    latent_rest_norm = torch.clamp(
        torch.norm(latent_rest, p=2, dim=-1, keepdim=True),
        min=1e-8
    )
    latent_rest_normalized = latent_rest / latent_rest_norm  # (batch, 32)
    
    # 3. Extract recent observation
    obs_recent = obs_516d[:, -86:]  # Last frame (batch, 86)
    
    # 4. Concatenate actor input
    actor_input = torch.cat([
        obs_recent,              # 86D
        latent_first,            # 3D
        latent_rest_normalized   # 32D
    ], dim=-1)  # Total: 121D
    
    # 5. Generate actions
    actions = actor(actor_input)  # (batch, 15)
    
    return actions

Computational Cost

Forward pass operations:
1. Estimator: 516 β†’ 256 β†’ 256 β†’ 35     (~207K params)
2. Normalization: L2 norm on 32D       (~32 ops)
3. Concatenation: Build 121D input     (memory copy)
4. Actor: 121 β†’ 512 β†’ 256 β†’ 256 β†’ 15  (~263K params)

Total: ~470K multiply-adds per inference

References

  1. GR00T Project: https://github.com/NVlabs/gr00t_wbc
  2. Paper: "GR00T: Whole-Body Control for Humanoid Robots" (NVIDIA Research)
  3. Unitree SDK: https://github.com/unitreerobotics/unitree_sdk2_python

License

The GR00T model is provided by NVIDIA under the NVIDIA Open Model License.
See NVIDIA Open Model License in the model directory for details.


Last Updated: 2025-01-20

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support