Robotics
Safetensors

Description:

GN1.6-Tuned-Arena-GR1-PlaceItemCloseDoor-Task is a fine tuned NVIDIA Isaac GR00T N1.6 model that performs sequential manipulation tasks, including putting an object in the fridge and closing the fridge door, provided in IsaacLab Arena.

Isaac GR00T N1.6-3B is the medium-sized version of our model built using pre-trained vision and language encoders, and uses a flow matching action transformer to model a chunk of actions conditioned on vision, language and proprioception.

This model is ready for non-commercial use only.

License/Terms of Use

Nvidia License
You are responsible for ensuring that your use of NVIDIA provided Models complies with all applicable laws.

Deployment Geography:

Global

Release Date:

GitHub: 03/18/2026 via isaac-sim/IsaacLab-Arena Hugging Face: 03/18/2026 via nvidia/GN1.6-Tuned-Arena-GR1-PlaceItemCloseDoor-Task

Use Case:

Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development. Developers: Integrate and customize AI for various robotic applications. Startups & Companies: Accelerate robotics development and reduce training costs.

Reference(s):

Eagle VLM: Chen, Guo, et al. "Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models." arXiv:2504.15271 (2025).

Liu, Xingchao, and Chengyue Gong. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." The Eleventh International Conference on Learning Representations”.

Flow Matching Policy: Black, Kevin, et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint arXiv:2410.24164 (2024).

Model Architecture:

Architecture Type: Vision Transformer, Multilayer Perceptron, Flow matching Transformer

GR00T N1.6 uses vision and text transformers to encode the robot's image observations and text instructions. The architecture handles a varying number of views per embodiment by concatenating image token embeddings from all frames into a sequence, followed by language token embeddings.

To model proprioception and a sequence of actions conditioned on observations, GR00T N1.6 uses a flow matching transformer. The flow matching transformer interleaves self-attention over proprioception and actions with cross-attention to the vision and language embeddings. During training, the input actions are corrupted by randomly interpolating between the clean action vector and a gaussian noise vector. At inference time, the policy first samples a gaussian noise vector and iteratively reconstructs a continuous-value action using its velocity prediction.

In GR00T N1.6, the MLP connector between the vision-language features and the diffusion-transformer (DiT) has been modified for improved performance on our sim benchmarks. Also, it was trained jointly with flow matching and world-modeling objectives.

Network Architecture: image/png The schematic diagram is shown in the illustration above. Red, Green, Blue (RGB) camera frames are processed through a pre-trained vision transformer (SigLip2). Text is encoded by a pre-trained transformer (T5) Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP. Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment. The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).

Number of Model Parameters: 3B

Input:

Input Type:

  • Vision: Image Frames
  • State: Robot Proprioception
  • Language Instruction: Text

Input Format:

  • Vision: Variable number of 224x224 uint8 image frames, coming from robot cameras
  • State: Floating Point
  • Language Instruction: String

Input Parameters:

  • Vision: Two-Dimensional (2D) - Red, Green, Blue (RGB) image, square
  • State: One-Dimensional (1D) - Floating number vector
  • Language Instruction: One-Dimensional (1D) - String

Output:

Output Type(s): Actions
Output Format Continuous-value vectors
Output Parameters: [Two-Dimensional (2D)]
Other Properties Related to Output: Continuous-value vectors correspond to different motor controls on a robot, which depends on Degrees of Freedom of the robot embodiment.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Model Version(s):

Version 1.6.

Post-Training Dataset:

Trained with dataset nvidia/Arena-GR1-Manipulation-PlaceItemCloseDoor-Task

Data Collection Method

State: Hybrid: Automated, Automatic/Sensors, Synthetic, Human

10 human teleoperated demonstrations are collected through a depth camera and keyboard in Isaac Lab. All 100 demos are generated automatically using a synthetic motion trajectory generation framework, Mimicgen [1]. Each demo is generated at 50 Hz.

Labeling Method

Not Applicable

Inference:

Acceleration Engine(s): PyTorch

Test Hardwares All of the below:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace

Supported Operating System(s):

  • Linux

Software Integration

Runtime Engine(s): Not Applicable

Supported Hardware Microarchitecture Compatibility: All of the below:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace

Preferred/Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train nvidia/GN1.6-Tuned-Arena-GR1-PlaceItemCloseDoor-Task

Papers for nvidia/GN1.6-Tuned-Arena-GR1-PlaceItemCloseDoor-Task