Model Card for TBD-VLA

TBD-VLA (Temporal Block Diffusion Vision Language Action Model) is a Vision-Language-Action policy based on Block Discrete Denoising Diffusion. It uses a Qwen3-VL vision-language backbone and predicts robot action chunks through temporal-level block diffusion.

The model was introduced in the paper TBD-VLA: Temporal Block Diffusion Vision Language Action Model by Sung-Wook Lee, Xuhui Kang, and Yen-Ling Kuo.

Project Webpage: https://tbd-vla.github.io/
Code: LeRobot fork

This policy has been trained and pushed to the Hub using a LeRobot fork. See the full documentation here.

How to Get Started with the Model

Installation

git clone https://github.com/TBD-VLA/lerobot.git
cd lerobot
uv python install 3.12
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[libero]"
uv pip install -U transformers
uv pip install -U accelerate

Train from scratch

python src/lerobot/scripts/lerobot_train.py \
  --policy.type=tbdvla \
  --output_dir=/$OUTPUT_DIR \
  --dataset.repo_id=sean1295/libero_all \
  --job_name=tbdvla_experiment \
  --steps=150000 \
  --batch_size=4 \
  --save_freq=20000 \
  --log_freq=1000 \
  --policy.device=cuda \
  --policy.n_bins=512 \
  --policy.block_temporal_size=4 \
  --policy.n_diffusion_steps=2 \
  --policy.gripper_dims=[-1] \
  --policy.chunk_size=16 \
  --policy.n_action_steps=16 \
  --policy.gradient_checkpointing=true \
  --policy.push_to_hub=false \
  --wandb.enable=false

Writes checkpoints to the configured output directory.

Multi-GPU training

accelerate launch --multi_gpu --num_processes=4 \
  src/lerobot/scripts/lerobot_train.py \
  --wandb.enable=false \
  --num_workers=4 \
  --policy.type=tbdvla \
  --policy.push_to_hub=false \
  --dataset.repo_id=sean1295/libero_all \
  --steps=150000 \
  --save_freq=20000 \
  --log_freq=1000 \
  --batch_size=16 \
  --policy.device=cuda \
  --policy.n_bins=512 \
  --policy.block_temporal_size=4 \
  --policy.n_diffusion_steps=2 \
  --policy.gripper_dims=[-1] \
  --policy.chunk_size=16 \
  --policy.n_action_steps=16

Evaluate the policy/run inference

uv run python src/lerobot/scripts/lerobot_eval.py \
  --policy.path=$CKPT_DIR \
  --env.type=libero \
  --env.task=libero_10 \
  --eval.n_episodes=50 \
  --eval.batch_size=1 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  --policy.n_action_steps=12 \
  --policy.n_diffusion_steps=2 \
  --policy.compile_model=true

Use --policy.path to point to a local or Hub checkpoint.

Model Details

Model type: Vision-Language-Action policy
Architecture: Block Diffusion VLA
VLM backbone: Qwen/Qwen3-VL-2B-Instruct
License: apache-2.0

Architecture

TBD-VLA contains the following main components:

Component	Description
VLM backbone	Qwen3-VL model used for vision-language conditioning
Action tokenizer	Discretizes continuous robot actions into token bins
Block denoising module	Performs block-temporal denoising over action chunks
Pre/post-processors	Handle normalization, device transfer, and action conversion

Files

File	Description
`configuration_tbdvla.py`	`TBDVLAConfig` dataclass with policy hyperparameters
`modeling_tbdvla.py`	`TBDVLAPolicy` implementation, including model, training loss, and inference
`processor_tbdvla.py`	Pre/post-processing pipelines for normalization and device transfer
`__init__.py`	Exports `TBDVLAConfig`, `TBDVLAPolicy`, and `make_tbdvla_pre_post_processors`

Key Parameters

Model Architecture

Parameter	Description	Default
`--policy.vlm_checkpoint`	Qwen3-VL model ID	`Qwen/Qwen3-VL-2B-Instruct`
`--policy.num_vlm_layers`	Number of VLM layers to use (-1 = all)	-1

Diffusion / Block Denoising

Parameter	Description	Default
`--policy.block_temporal_size`	Temporal steps per block	4
`--policy.n_diffusion_steps`	Number of denoising steps at inference	2
`--policy.chunk_size`	Action chunk length (multipliers of block_temporal_size)	16

Training Hyperparameters

Parameter	Description	Default
`--policy.n_bins`	Number of action discretization bins	512
`--policy.n_obs_steps`	Number of observation steps (only 1 supported)	1
`--policy.max_task_tokens`	Max task/language tokens fed to the VLM	64
`--policy.use_state`	Include proprioceptive state input	true
`--policy.state_dropout_p`	Dropout probability for state input	0.0
`--policy.image_resolution`	Resize images to this resolution	256,256
`--policy.gradient_checkpointing`	Enable gradient checkpointing (saves VRAM)	false
`--policy.precision`	Training precision (`float16`, `bfloat16`, `float32`)	`bfloat16`
`--policy.attn_implementation`	Attention backend (`eager`, `sdpa`, `flex_attention`)	`sdpa`
`--policy.optimizer_lr`	AdamW learning rate	1e-4

Inference Hyperparameters

Parameter	Description	Default
`--policy.n_action_steps`	Steps executed per inference (must be <= chunk_size)	12
`--policy.gripper_dims`	Gripper dimension indices	[-1]
`--policy.expectation_sample`	Use expectation-based sampling	true
`--policy.compile_model`	Wrap the VLM forward in `torch.compile`	false

BibTex

@article{lee2026tbdvlatemporalblockdiffusion,
      title={TBD-VLA: Temporal Block Diffusion Vision Language Action Model},
      author={Lee, Sung-Wook and Kang, Xuhui and Kuo, Yen-Ling},
      journal={arXiv preprint},
      year={2026},
      url={https://arxiv.org/abs/2606.07895}
}