GR00T-N1.5-3B Finetuned for SO-101 Table Cleanup Tasks

Model Description

This is a finetuned version of NVIDIA's GR00T-N1.5-3B foundation model, specifically adapted for SO-101 robot arm table cleanup tasks. The model has been trained on a custom dataset of 80 episodes (47,513 frames) demonstrating table cleanup behaviors using dual-camera observations.

Key Features

  • Base Model: NVIDIA GR00T-N1.5-3B (2.7B parameters)
  • Embodiment: SO-101 robot arm with 6-DOF control
  • Task Domain: Table cleanup and manipulation tasks
  • Input Modalities: Dual camera views (front + wrist cameras) + proprioceptive state
  • Output: 6-DOF joint space actions (5 arm joints + gripper)

Model Architecture

The model combines:

  • Vision-Language Model: Eagle 2.5 backbone (frozen during finetuning)
  • Action Head: Flow-matching diffusion transformer with 16 layers
  • Projector: MLP adapter connecting vision encoder to LLM
  • Embodiment Head: Custom action head for SO-101 robot configuration

Technical Specifications

  • Action Space: 5-DOF arm joints + gripper (6 total DOF)
  • Action Horizon: 16 timesteps (predicts future action sequence)
  • State Space: 5-DOF normalized joint positions + gripper state
  • Vision Input: Dual cameras (480×640×3) at 30 FPS
  • Model Precision: bfloat16 during training, float32 for inference
  • Control Type: Delta joint space control (relative movements)

Training Details

Dataset

  • Source: SO-101 table cleanup demonstrations
  • Episodes: 80 total episodes
  • Frames: 47,513 total frames
  • Tasks: 4 different table cleanup tasks
  • Cameras: Front camera + wrist camera
  • Data Format: LeRobot compatible schema

Training Configuration

  • Learning Rate: 1e-4 with cosine scheduling
  • Batch Size: 32 per device
  • Max Steps: 10,000
  • Optimizer: AdamW (β₁=0.95, β₂=0.999)
  • Weight Decay: 1e-5
  • Gradient Clipping: 1.0
  • Warmup: 5% of total steps

Finetuning Strategy

  • Frozen Components: Vision encoder, language model
  • Tuned Components: Projector, diffusion model, embodiment-specific action head
  • Training Loss: Flow matching loss + FLARE objective
  • Inference Steps: 4 denoising steps

Performance Metrics

Based on training logs:

  • Final Training Loss: ~0.04 (converged from initial ~0.96)
  • Training Steps: 1,460 completed
  • Gradient Norm: Stabilized around 0.5
  • Learning Rate: Decayed to ~9.75e-5

Usage

Basic Inference

from gr00t.model.policy import Gr00tPolicy
from gr00t.data.embodiment_tags import EmbodimentTag
import numpy as np

# Load the finetuned model
policy = Gr00tPolicy(
    model_path="path/to/your/finetuned/model",
    modality_config=modality_config,
    modality_transform=transforms,
    embodiment_tag=EmbodimentTag.NEW_EMBODIMENT,  # Custom SO-101 embodiment
    device="cuda"
)

# Prepare observation in correct format
obs = {
    "video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
    "state.gripper": np.array([[-2.5943134e-10]]),
    "annotation.human.task_description": ["pick up the cup from the table"],
}

# Run inference
action_chunk = policy.get_action(obs)
# Returns: {"action.single_arm": (16, 5), "action.gripper": (16,)}

Server-Client Setup

# Start inference server
python scripts/inference_service.py \
    --model-path /path/to/your/finetuned/model \
    --server \
    --data-config so100_dualcam \
    --embodiment-tag new_embodiment

# Run client for evaluation
python scripts/inference_service.py --client

Input/Output Format

Input Observation

obs = {
    "video.front": np.ndarray,      # (1, 480, 640, 3) front camera frames
    "video.wrist": np.ndarray,      # (1, 480, 640, 3) wrist camera frames
    "state.single_arm": np.ndarray, # (1, 5) normalized joint positions [pan, lift, elbow, wrist_flex, wrist_roll]
    "state.gripper": np.ndarray,    # (1, 1) normalized gripper position
    "annotation.human.task_description": [str]  # Language instruction as list
}

Example Input:

obs = {
    "video.front": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "video.wrist": np.random.randint(0, 256, (1, 480, 640, 3), dtype=np.uint8),
    "state.single_arm": np.array([[-1.1961363e-16, 1.1968981e-10, 4.2663229e-10, 6.3043515e-10, 3.8023183e-12]]),
    "state.gripper": np.array([[-2.5943134e-10]]),
    "annotation.human.task_description": ["pick up the cup from the table"],
}

Output Action

{
    "action.single_arm": np.ndarray,  # (16, 5) action horizon × joint deltas
    "action.gripper": np.ndarray      # (16,) gripper actions over horizon
}

Example Output:

{
    "action.single_arm": np.array([
        [ -8.586029,  -19.553513,   21.609299,   62.635612,    1.1289978],
        [ -7.937084,  -20.358055,   23.032593,   63.238243,    2.4241562],
        # ... 14 more timesteps
        [ -6.097515,  -48.115795,   44.834297,   64.58016,     3.4453125]
    ]),  # Shape: (16, 5)
    
    "action.gripper": np.array([
        15.843709, 14.624388, 11.970064, 11.224205, 10.709647, 10.889213,
        7.1574492, 7.342965, 4.87047, 3.5736556, 2.6360812, 0.70475096,
        1.0765471, 0.68617123, 0.5908453, 0.559207
    ])  # Shape: (16,)
}

Hardware Requirements

Training

  • GPU: H100, L40, RTX 4090, or A6000
  • Memory: 24GB+ VRAM recommended
  • CUDA: Version 12.4
  • Python: 3.10

Inference

  • GPU: RTX 3090, RTX 4090, or A6000
  • Memory: 8GB+ VRAM
  • Latency: ~48ms per inference (H100)

Limitations and Considerations

  1. Domain Specificity: Model is specialized for table cleanup tasks and may not generalize to other manipulation domains
  2. Robot Configuration: Optimized for SO-101 robot arm; adaptation required for other embodiments
  3. Camera Setup: Requires specific dual-camera configuration (front + wrist)
  4. Action Space: Limited to 6-DOF joint space control
  5. Safety: Model outputs should be validated and constrained for safe robot operation

Citation

If you use this model, please cite the original GR00T paper:

@article{gr00t2024,
  title={GR00T: A Foundation Model for Generalized Humanoid Robot Reasoning and Skills},
  author={NVIDIA Research},
  journal={arXiv preprint arXiv:2503.14734},
  year={2024}
}

License

This model is released under the Apache License 2.0. See the LICENSE file for details.

Acknowledgments

  • Built on NVIDIA's GR00T-N1.5 foundation model
  • Dataset based on LeRobot SO-101 demonstrations
  • Training infrastructure provided by NVIDIA Isaac GR00T framework
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for Pushpakcc/gr00t-so100_dualcam-finetuned

Finetuned
(31)
this model

Dataset used to train Pushpakcc/gr00t-so100_dualcam-finetuned