GR00T Locomotion Policy Architecture
Model: GR00T-WholeBodyControl-Walk
Source: NVIDIA GR00T Whole Body Control
Total Parameters: 470,578
Table of Contents
- Overview
- Input Space (Observation)
- Output Space (Action)
- Network Architecture
- Forward Pass Data Flow
- Comparison with Unitree Policy
- Training Considerations
Overview
The GR00T policy is an asymmetric actor-critic architecture designed for bipedal locomotion with whole-body control. It uses:
- A temporal observation stack (6 frames Γ 86D = 516D)
- An estimator network (encoder) that processes full observations into a compressed latent space
- An actor network (decoder) that generates actions from the latent representation + recent observations
Key Features
- Temporal context: 6-frame history for stability and dynamics understanding
- Privileged learning: Estimator sees all 29 DOF, actor only controls 15 DOF (legs + waist)
- Height & orientation commands: Explicit control over base height and orientation
- ELU activations: For smoother gradients compared to ReLU
Input Space (Observation)
Single Frame (86D)
One observation frame contains:
[0:3] velocity_commands (3D) - [vx, vy, yaw_rate] desired velocities
[3] height_command (1D) - desired base height
[4:7] orientation_command (3D) - [roll, pitch, yaw] desired orientation
[7:10] angular_velocity (3D) - IMU angular velocity (scaled by 0.5)
[10:13] gravity_vector (3D) - projected gravity direction
[13:42] joint_positions (29D) - all joint angles relative to default pose
[42:71] joint_velocities (29D) - all joint velocities
[71:86] previous_actions (15D) - actions from previous timestep
---
Total: 86D per frame
Stacked Observation (516D)
The policy receives 6 consecutive frames stacked together:
obs_516d = [frame_t-5, frame_t-4, frame_t-3, frame_t-2, frame_t-1, frame_t]
= 6 Γ 86D = 516D
Temporal reasoning: The history allows the policy to:
- Infer velocities and accelerations
- Predict contact events and transitions
- Maintain gait stability across steps
Observation Scaling
# From Unitree training config
ang_vel_scale = 0.5 # Angular velocity scaling
dof_pos_scale = 1.0 # Joint position scaling
dof_vel_scale = 0.05 # Joint velocity scaling
cmd_scale = [2.0, 2.0, 0.5] # [vx, vy, yaw] command scaling
Default Joint Positions (Standing Pose)
default_angles = [
# Left leg (6D)
-0.1, 0.0, 0.0, 0.3, -0.2, 0.0, # hip_pitch, hip_roll, hip_yaw, knee, ankle_pitch, ankle_roll
# Right leg (6D)
-0.1, 0.0, 0.0, 0.3, -0.2, 0.0,
# Waist (3D)
0.0, 0.0, 0.0, # yaw, roll, pitch
# Left arm (7D)
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, # shoulder_pitch/roll/yaw, elbow, wrist_roll/pitch/yaw
# Right arm (7D)
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
]
Output Space (Action)
Action Vector (15D)
[0:6] left_leg_deltas (6D) - position offsets for left leg joints
[6:12] right_leg_deltas (6D) - position offsets for right leg joints
[12:15] waist_deltas (3D) - position offsets for waist joints
---
Total: 15D
Action Scaling & Application
action_scale = 0.25 # Action multiplier
# Target joint positions
target_pos = default_angles[:15] + action * action_scale
Controlled DOFs
The policy controls only the lower body:
- 12 leg joints: Full leg control for locomotion
- 3 waist joints: Yaw, roll, pitch for balance
The arms (14 DOF) are NOT controlled by the locomotion policy and can be:
- Held at fixed positions (e.g., slight forward/outward pose)
- Controlled by a separate arm policy
- Used for manipulation while walking
Network Architecture
Overview Diagram
Input (516D)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
v v
βββββββββββββββββββ ββββββββββββββββββ
β ESTIMATOR β β Input Slice β
β (Encoder) β β (Last 86D) β
βββββββββββββββββββ ββββββββββββββββββ
β β
β (35D latent) β
β β
ββββΊ [:3] βββββββββββ β
β (3D) β β
β v v
ββββΊ [3:] βββββΊ Normalize ββββββΊ Concat (121D)
(32D) (32D) β
v
βββββββββββββββ
β ACTOR β
β (Decoder) β
βββββββββββββββ
β
v
Output (15D)
1. Estimator Network (Encoder)
Purpose: Encode full 516D observation into a compact 35D latent representation.
Architecture
Estimator(
# Layer 1: Initial compression
Linear(516, 256), # Weight: [256, 516] = 132,096 params
ELU(), # Smooth activation
# Layer 2: Feature refinement
Linear(256, 256), # Weight: [256, 256] = 65,536 params
ELU(),
# Layer 3: Final encoding
Linear(256, 35), # Weight: [35, 256] = 8,960 params
)
Layer Details
Layer 1: Linear(516 β 256) + ELU()
- Input: 516D stacked observations (6 frames Γ 86D)
- Output: 256D intermediate features
- Parameters: 132,096 + 256 (bias) = 132,352
- Purpose: Initial dimensionality reduction while preserving temporal information
Layer 2: Linear(256 β 256) + ELU()
- Input: 256D intermediate features
- Output: 256D refined features
- Parameters: 65,536 + 256 (bias) = 65,792
- Purpose: Feature refinement and non-linear transformation
Layer 3: Linear(256 β 35)
- Input: 256D refined features
- Output: 35D latent encoding
- Parameters: 8,960 + 35 (bias) = 8,995
- Purpose: Final compression to latent space
Total Estimator Parameters: 207,139
Latent Space Breakdown (35D)
The 35D latent is split into two components:
- First 3D (indices 0-2): Direct features (not normalized)
- Last 32D (indices 3-34): Dynamics features (L2-normalized for stability)
Normalization:
latent_first = latent[:, :3] # (batch, 3)
latent_rest = latent[:, 3:] # (batch, 32)
# L2 normalize the rest
latent_rest_norm = torch.clamp(
torch.norm(latent_rest, p=2, dim=-1, keepdim=True),
min=1e-8
)
latent_rest_normalized = latent_rest / latent_rest_norm # (batch, 32)
2. Actor Network (Decoder)
Purpose: Decode latent representation + recent observation into motor actions.
Actor Input Construction (121D)
actor_input = torch.cat([
obs_recent, # Last 86D from input (most recent frame)
latent_first, # First 3D from estimator (direct features)
latent_rest_normalized # Last 32D from estimator (normalized dynamics)
], dim=-1)
# Total: 86D + 3D + 32D = 121D
Architecture
Actor(
# Layer 1: Initial expansion
Linear(121, 512), # Weight: [512, 121] = 61,952 params
ELU(),
# Layer 2: Feature processing
Linear(512, 256), # Weight: [256, 512] = 131,072 params
ELU(),
# Layer 3: Feature refinement
Linear(256, 256), # Weight: [256, 256] = 65,536 params
ELU(),
# Layer 4: Action head
Linear(256, 15), # Weight: [15, 256] = 3,840 params
)
Layer Details
Layer 1: Linear(121 β 512) + ELU()
- Input: 121D (recent obs + latent)
- Output: 512D expanded features
- Parameters: 61,952 + 512 (bias) = 62,464
- Purpose: Expand compressed representation for action generation
Layer 2: Linear(512 β 256) + ELU()
- Input: 512D expanded features
- Output: 256D intermediate features
- Parameters: 131,072 + 256 (bias) = 131,328
- Purpose: Process features with large capacity
Layer 3: Linear(256 β 256) + ELU()
- Input: 256D intermediate features
- Output: 256D refined features
- Parameters: 65,536 + 256 (bias) = 65,792
- Purpose: Additional refinement layer for complex motor patterns
Layer 4: Linear(256 β 15)
- Input: 256D refined features
- Output: 15D actions (leg + waist deltas)
- Parameters: 3,840 + 15 (bias) = 3,855
- Purpose: Final action prediction layer
Total Actor Parameters: 263,439
Forward Pass Data Flow
Complete Forward Pass
def forward(obs_516d):
"""
Args:
obs_516d: (batch, 516) stacked observations
Returns:
actions: (batch, 15) joint position deltas
"""
# 1. Encode full observation
latent = estimator(obs_516d) # (batch, 35)
# 2. Split and normalize latent
latent_first = latent[:, :3] # (batch, 3)
latent_rest = latent[:, 3:] # (batch, 32)
latent_rest_norm = torch.clamp(
torch.norm(latent_rest, p=2, dim=-1, keepdim=True),
min=1e-8
)
latent_rest_normalized = latent_rest / latent_rest_norm # (batch, 32)
# 3. Extract recent observation
obs_recent = obs_516d[:, -86:] # Last frame (batch, 86)
# 4. Concatenate actor input
actor_input = torch.cat([
obs_recent, # 86D
latent_first, # 3D
latent_rest_normalized # 32D
], dim=-1) # Total: 121D
# 5. Generate actions
actions = actor(actor_input) # (batch, 15)
return actions
Computational Cost
Forward pass operations:
1. Estimator: 516 β 256 β 256 β 35 (~207K params)
2. Normalization: L2 norm on 32D (~32 ops)
3. Concatenation: Build 121D input (memory copy)
4. Actor: 121 β 512 β 256 β 256 β 15 (~263K params)
Total: ~470K multiply-adds per inference
References
- GR00T Project: https://github.com/NVlabs/gr00t_wbc
- Paper: "GR00T: Whole-Body Control for Humanoid Robots" (NVIDIA Research)
- Unitree SDK: https://github.com/unitreerobotics/unitree_sdk2_python
License
The GR00T model is provided by NVIDIA under the NVIDIA Open Model License.
See NVIDIA Open Model License in the model directory for details.
Last Updated: 2025-01-20