GR00T-N1.5-3B Finetuned for SO-101 Table Cleanup Tasks
Model Description
This is a finetuned version of NVIDIA's GR00T-N1.5-3B foundation model, specifically adapted for SO-101 robot arm table cleanup tasks. The model has been trained on a custom dataset of 80 episodes (47,513 frames) demonstrating table cleanup behaviors using dual-camera observations.
Key Features
- Base Model: NVIDIA GR00T-N1.5-3B (2.7B parameters)
- Embodiment: SO-101 robot arm with 6-DOF control
- Task Domain: Table cleanup and manipulation tasks
- Input Modalities: Dual camera views (front + wrist cameras) + proprioceptive state
- Output: 6-DOF joint space actions (5 arm joints + gripper)
Model Architecture
The model combines:
- Vision-Language Model: Eagle 2.5 backbone (frozen during finetuning)
- Action Head: Flow-matching diffusion transformer with 16 layers
- Projector: MLP adapter connecting vision encoder to LLM
- Embodiment Head: Custom action head for SO-101 robot configuration
Technical Specifications
- Action Space: 5-DOF arm joints + gripper (6 total DOF)
- Action Horizon: 16 timesteps (predicts future action sequence)
- State Space: 5-DOF normalized joint positions + gripper state
- Vision Input: Dual cameras (480×640×3) at 30 FPS
- Model Precision: bfloat16 during training, float32 for inference
- Control Type: Delta joint space control (relative movements)
Training Details
Dataset
- Source: SO-101 table cleanup demonstrations
- Episodes: 80 total episodes
- Frames: 47,513 total frames
- Tasks: 4 different table cleanup tasks
- Cameras: Front camera + wrist camera
- Data Format: LeRobot compatible schema
Training Configuration
- Learning Rate: 1e-4 with cosine scheduling
- Batch Size: 32 per device
- Max Steps: 10,000
- Optimizer: AdamW (β₁=0.95, β₂=0.999)
- Weight Decay: 1e-5
- Gradient Clipping: 1.0
- Warmup: 5% of total steps
Finetuning Strategy
- Frozen Components: Vision encoder, language model
- Tuned Components: Projector, diffusion model, embodiment-specific action head
- Training Loss: Flow matching loss + FLARE objective
- Inference Steps: 4 denoising steps
Performance Metrics
Based on training logs:
- Final Training Loss: ~0.04 (converged from initial ~0.96)
- Training Steps: 1,460 completed
- Gradient Norm: Stabilized around 0.5
- Learning Rate: Decayed to ~9.75e-5
Usage
Basic Inference
from gr00t.model.policy import Gr00tPolicy
from gr00t.data.embodiment_tags import EmbodimentTag
import numpy as np
# Load the finetuned model
policy = Gr00tPolicy(
model_path="path/to/your/finetuned/model",
modality_config=modality_config,
modality_transform=transforms,
embodiment_tag=EmbodimentTag.NEW_EMBODIMENT, # Custom SO-101 embodiment
device="cuda"
)
# Prepare observation in correct format
obs = {
"video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
"state.gripper": np.array([[-2.5943134e-10]]),
"annotation.human.task_description": ["pick up the cup from the table"],
}
# Run inference
action_chunk = policy.get_action(obs)
# Returns: {"action.single_arm": (16, 5), "action.gripper": (16,)}
Server-Client Setup
# Start inference server
python scripts/inference_service.py \
--model-path /path/to/your/finetuned/model \
--server \
--data-config so100_dualcam \
--embodiment-tag new_embodiment
# Run client for evaluation
python scripts/inference_service.py --client
Input/Output Format
Input Observation
obs = {
"video.front": np.ndarray, # (1, 480, 640, 3) front camera frames
"video.wrist": np.ndarray, # (1, 480, 640, 3) wrist camera frames
"state.single_arm": np.ndarray, # (1, 5) normalized joint positions [pan, lift, elbow, wrist_flex, wrist_roll]
"state.gripper": np.ndarray, # (1, 1) normalized gripper position
"annotation.human.task_description": [str] # Language instruction as list
}
Example Input:
obs = {
"video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
"state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
"state.gripper": np.array([[-2.5943134e-10]]),
"annotation.human.task_description": ["pick up the cup from the table"],
}
Output Action
{
"action.single_arm": np.ndarray, # (16, 5) action horizon × joint deltas
"action.gripper": np.ndarray # (16,) gripper actions over horizon
}
Example Output:
{
"action.single_arm": np.array([
[ -8.586029, -19.553513, 21.609299, 62.635612, 1.1289978],
[ -7.937084, -20.358055, 23.032593, 63.238243, 2.4241562],
# ... 14 more timesteps
[ -6.097515, -48.115795, 44.834297, 64.58016, 3.4453125]
]), # Shape: (16, 5)
"action.gripper": np.array([
15.843709, 14.624388, 11.970064, 11.224205, 10.709647, 10.889213,
7.1574492, 7.342965, 4.87047, 3.5736556, 2.6360812, 0.70475096,
1.0765471, 0.68617123, 0.5908453, 0.559207
]) # Shape: (16,)
}
Hardware Requirements
Training
- GPU: H100, L40, RTX 4090, or A6000
- Memory: 24GB+ VRAM recommended
- CUDA: Version 12.4
- Python: 3.10
Inference
- GPU: RTX 3090, RTX 4090, or A6000
- Memory: 8GB+ VRAM
- Latency: ~48ms per inference (H100)
Limitations and Considerations
- Domain Specificity: Model is specialized for table cleanup tasks and may not generalize to other manipulation domains
- Robot Configuration: Optimized for SO-101 robot arm; adaptation required for other embodiments
- Camera Setup: Requires specific dual-camera configuration (front + wrist)
- Action Space: Limited to 6-DOF joint space control
- Safety: Model outputs should be validated and constrained for safe robot operation
Citation
If you use this model, please cite the original GR00T paper:
@article{gr00t2024,
title={GR00T: A Foundation Model for Generalized Humanoid Robot Reasoning and Skills},
author={NVIDIA Research},
journal={arXiv preprint arXiv:2503.14734},
year={2024}
}
License
This model is released under the Apache License 2.0. See the LICENSE file for details.
Acknowledgments
- Built on NVIDIA's GR00T-N1.5 foundation model
- Dataset based on LeRobot SO-101 demonstrations
- Training infrastructure provided by NVIDIA Isaac GR00T framework
Model tree for Pushpakcc/gr00t-so100_dualcam-finetuned
Base model
nvidia/GR00T-N1.5-3B