sean1295/libero_all
Viewer • Updated • 34 • 85
How to use sean1295/tbdvla_libero with LeRobot:
TBD-VLA (Temporal Block Diffusion Vision Language Action Model) is a Vision-Language-Action policy based on Block Discrete Denoising Diffusion. It uses a Qwen3-VL vision-language backbone and predicts robot action chunks through temporal-level block diffusion.
The model was introduced in the paper TBD-VLA: Temporal Block Diffusion Vision Language Action Model by Sung-Wook Lee, Xuhui Kang, and Yen-Ling Kuo.
This policy has been trained and pushed to the Hub using a LeRobot fork. See the full documentation here.
git clone https://github.com/TBD-VLA/lerobot.git
cd lerobot
uv python install 3.12
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[libero]"
uv pip install -U transformers
uv pip install -U accelerate
python src/lerobot/scripts/lerobot_train.py \
--policy.type=tbdvla \
--output_dir=/$OUTPUT_DIR \
--dataset.repo_id=sean1295/libero_all \
--job_name=tbdvla_experiment \
--steps=150000 \
--batch_size=4 \
--save_freq=20000 \
--log_freq=1000 \
--policy.device=cuda \
--policy.n_bins=512 \
--policy.block_temporal_size=4 \
--policy.n_diffusion_steps=2 \
--policy.gripper_dims=[-1] \
--policy.chunk_size=16 \
--policy.n_action_steps=16 \
--policy.gradient_checkpointing=true \
--policy.push_to_hub=false \
--wandb.enable=false
Writes checkpoints to the configured output directory.
accelerate launch --multi_gpu --num_processes=4 \
src/lerobot/scripts/lerobot_train.py \
--wandb.enable=false \
--num_workers=4 \
--policy.type=tbdvla \
--policy.push_to_hub=false \
--dataset.repo_id=sean1295/libero_all \
--steps=150000 \
--save_freq=20000 \
--log_freq=1000 \
--batch_size=16 \
--policy.device=cuda \
--policy.n_bins=512 \
--policy.block_temporal_size=4 \
--policy.n_diffusion_steps=2 \
--policy.gripper_dims=[-1] \
--policy.chunk_size=16 \
--policy.n_action_steps=16
uv run python src/lerobot/scripts/lerobot_eval.py \
--policy.path=$CKPT_DIR \
--env.type=libero \
--env.task=libero_10 \
--eval.n_episodes=50 \
--eval.batch_size=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
--policy.n_action_steps=12 \
--policy.n_diffusion_steps=2 \
--policy.compile_model=true
Use --policy.path to point to a local or Hub checkpoint.
Qwen/Qwen3-VL-2B-InstructTBD-VLA contains the following main components:
| Component | Description |
|---|---|
| VLM backbone | Qwen3-VL model used for vision-language conditioning |
| Action tokenizer | Discretizes continuous robot actions into token bins |
| Block denoising module | Performs block-temporal denoising over action chunks |
| Pre/post-processors | Handle normalization, device transfer, and action conversion |
| File | Description |
|---|---|
configuration_tbdvla.py |
TBDVLAConfig dataclass with policy hyperparameters |
modeling_tbdvla.py |
TBDVLAPolicy implementation, including model, training loss, and inference |
processor_tbdvla.py |
Pre/post-processing pipelines for normalization and device transfer |
__init__.py |
Exports TBDVLAConfig, TBDVLAPolicy, and make_tbdvla_pre_post_processors |
| Parameter | Description | Default |
|---|---|---|
--policy.vlm_checkpoint |
Qwen3-VL model ID | Qwen/Qwen3-VL-2B-Instruct |
--policy.num_vlm_layers |
Number of VLM layers to use (-1 = all) | -1 |
| Parameter | Description | Default |
|---|---|---|
--policy.block_temporal_size |
Temporal steps per block | 4 |
--policy.n_diffusion_steps |
Number of denoising steps at inference | 2 |
--policy.chunk_size |
Action chunk length (multipliers of block_temporal_size) | 16 |
| Parameter | Description | Default |
|---|---|---|
--policy.n_bins |
Number of action discretization bins | 512 |
--policy.n_obs_steps |
Number of observation steps (only 1 supported) | 1 |
--policy.max_task_tokens |
Max task/language tokens fed to the VLM | 64 |
--policy.use_state |
Include proprioceptive state input | true |
--policy.state_dropout_p |
Dropout probability for state input | 0.0 |
--policy.image_resolution |
Resize images to this resolution | 256,256 |
--policy.gradient_checkpointing |
Enable gradient checkpointing (saves VRAM) | false |
--policy.precision |
Training precision (float16, bfloat16, float32) |
bfloat16 |
--policy.attn_implementation |
Attention backend (eager, sdpa, flex_attention) |
sdpa |
--policy.optimizer_lr |
AdamW learning rate | 1e-4 |
| Parameter | Description | Default |
|---|---|---|
--policy.n_action_steps |
Steps executed per inference (must be <= chunk_size) | 12 |
--policy.gripper_dims |
Gripper dimension indices | [-1] |
--policy.expectation_sample |
Use expectation-based sampling | true |
--policy.compile_model |
Wrap the VLM forward in torch.compile |
false |
@article{lee2026tbdvlatemporalblockdiffusion,
title={TBD-VLA: Temporal Block Diffusion Vision Language Action Model},
author={Lee, Sung-Wook and Kang, Xuhui and Kuo, Yen-Ling},
journal={arXiv preprint},
year={2026},
url={https://arxiv.org/abs/2606.07895}
}
Base model
Qwen/Qwen3-VL-2B-Instruct