Robotics
LeRobot
Safetensors
tbdvla
vla
diffusion
qwen3-vl

Model Card for TBD-VLA

TBD-VLA (Temporal Block Diffusion Vision Language Action Model) is a Vision-Language-Action policy based on Block Discrete Denoising Diffusion. It uses a Qwen3-VL vision-language backbone and predicts robot action chunks through temporal-level block diffusion.

The model was introduced in the paper TBD-VLA: Temporal Block Diffusion Vision Language Action Model by Sung-Wook Lee, Xuhui Kang, and Yen-Ling Kuo.

This policy has been trained and pushed to the Hub using a LeRobot fork. See the full documentation here.


How to Get Started with the Model

Installation

git clone https://github.com/TBD-VLA/lerobot.git
cd lerobot
uv python install 3.12
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[libero]"
uv pip install -U transformers
uv pip install -U accelerate

Train from scratch

python src/lerobot/scripts/lerobot_train.py \
  --policy.type=tbdvla \
  --output_dir=/$OUTPUT_DIR \
  --dataset.repo_id=sean1295/libero_all \
  --job_name=tbdvla_experiment \
  --steps=150000 \
  --batch_size=4 \
  --save_freq=20000 \
  --log_freq=1000 \
  --policy.device=cuda \
  --policy.n_bins=512 \
  --policy.block_temporal_size=4 \
  --policy.n_diffusion_steps=2 \
  --policy.gripper_dims=[-1] \
  --policy.chunk_size=16 \
  --policy.n_action_steps=16 \
  --policy.gradient_checkpointing=true \
  --policy.push_to_hub=false \
  --wandb.enable=false

Writes checkpoints to the configured output directory.

Multi-GPU training

accelerate launch --multi_gpu --num_processes=4 \
  src/lerobot/scripts/lerobot_train.py \
  --wandb.enable=false \
  --num_workers=4 \
  --policy.type=tbdvla \
  --policy.push_to_hub=false \
  --dataset.repo_id=sean1295/libero_all \
  --steps=150000 \
  --save_freq=20000 \
  --log_freq=1000 \
  --batch_size=16 \
  --policy.device=cuda \
  --policy.n_bins=512 \
  --policy.block_temporal_size=4 \
  --policy.n_diffusion_steps=2 \
  --policy.gripper_dims=[-1] \
  --policy.chunk_size=16 \
  --policy.n_action_steps=16

Evaluate the policy/run inference

uv run python src/lerobot/scripts/lerobot_eval.py \
  --policy.path=$CKPT_DIR \
  --env.type=libero \
  --env.task=libero_10 \
  --eval.n_episodes=50 \
  --eval.batch_size=1 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  --policy.n_action_steps=12 \
  --policy.n_diffusion_steps=2 \
  --policy.compile_model=true

Use --policy.path to point to a local or Hub checkpoint.


Model Details

  • Model type: Vision-Language-Action policy
  • Architecture: Block Diffusion VLA
  • VLM backbone: Qwen/Qwen3-VL-2B-Instruct
  • License: apache-2.0

Architecture

TBD-VLA contains the following main components:

Component Description
VLM backbone Qwen3-VL model used for vision-language conditioning
Action tokenizer Discretizes continuous robot actions into token bins
Block denoising module Performs block-temporal denoising over action chunks
Pre/post-processors Handle normalization, device transfer, and action conversion

Files

File Description
configuration_tbdvla.py TBDVLAConfig dataclass with policy hyperparameters
modeling_tbdvla.py TBDVLAPolicy implementation, including model, training loss, and inference
processor_tbdvla.py Pre/post-processing pipelines for normalization and device transfer
__init__.py Exports TBDVLAConfig, TBDVLAPolicy, and make_tbdvla_pre_post_processors

Key Parameters

Model Architecture

Parameter Description Default
--policy.vlm_checkpoint Qwen3-VL model ID Qwen/Qwen3-VL-2B-Instruct
--policy.num_vlm_layers Number of VLM layers to use (-1 = all) -1

Diffusion / Block Denoising

Parameter Description Default
--policy.block_temporal_size Temporal steps per block 4
--policy.n_diffusion_steps Number of denoising steps at inference 2
--policy.chunk_size Action chunk length (multipliers of block_temporal_size) 16

Training Hyperparameters

Parameter Description Default
--policy.n_bins Number of action discretization bins 512
--policy.n_obs_steps Number of observation steps (only 1 supported) 1
--policy.max_task_tokens Max task/language tokens fed to the VLM 64
--policy.use_state Include proprioceptive state input true
--policy.state_dropout_p Dropout probability for state input 0.0
--policy.image_resolution Resize images to this resolution 256,256
--policy.gradient_checkpointing Enable gradient checkpointing (saves VRAM) false
--policy.precision Training precision (float16, bfloat16, float32) bfloat16
--policy.attn_implementation Attention backend (eager, sdpa, flex_attention) sdpa
--policy.optimizer_lr AdamW learning rate 1e-4

Inference Hyperparameters

Parameter Description Default
--policy.n_action_steps Steps executed per inference (must be <= chunk_size) 12
--policy.gripper_dims Gripper dimension indices [-1]
--policy.expectation_sample Use expectation-based sampling true
--policy.compile_model Wrap the VLM forward in torch.compile false

BibTex

@article{lee2026tbdvlatemporalblockdiffusion,
      title={TBD-VLA: Temporal Block Diffusion Vision Language Action Model},
      author={Lee, Sung-Wook and Kang, Xuhui and Kuo, Yen-Ling},
      journal={arXiv preprint},
      year={2026},
      url={https://arxiv.org/abs/2606.07895}
}
Downloads last month
232
Safetensors
Model size
2B params
Tensor type
BF16
·
Video Preview
loading

Model tree for sean1295/tbdvla_libero

Finetuned
(228)
this model

Dataset used to train sean1295/tbdvla_libero

Collection including sean1295/tbdvla_libero

Paper for sean1295/tbdvla_libero