EGSPO & EGSPO-SA: Entropy-Guided Stepwise Policy Optimization for Diffusion LLMs
This repository contains task-specific checkpoints for EGSPO and EGSPO-SA methods from our paper for DREAM-7B base model:
"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages"
📄 Paper (arXiv:2603.12554) | 💻 Code (GitHub)
Overview
We introduce principled RL methods for diffusion language models that:
- Formulate diffusion generation as a finite-horizon MDP
- Derive exact policy gradients decomposed over denoising steps
- Use entropy-guided step selection for compute-efficient training
- Estimate stepwise advantages via one-step denoising completions
Base Model: Dream-7B-Instruct (Diffusion Language Model adapted from Qwen2.5 via continued pretraining)
Available Checkpoints
We provide checkpoints for two methods across four tasks:
| Task | EGSPO | EGSPO-SA | Best Performance |
|---|---|---|---|
| Sudoku | ✅ sudoku/egspo |
✅ sudoku/egspo-sa |
90.3% / 90.6% |
| Countdown | ✅ countdown/egspo |
✅ countdown/egspo-sa |
76.6% / 82.4% |
| GSM8K | ✅ gsm8k/egspo |
✅ gsm8k/egspo-sa |
82.3% / 86.1% |
| MATH500 | ✅ math/egspo |
✅ math/egspo-sa |
50.4% / 47.8% |
GSM8K and MATH500 are evaluated with 1-token/step decoding to match the ESPO baseline protocol; our models are trained with 2-token/step decoding.
Method Comparison
- EGSPO: Entropy-guided step selection with sequence-level advantages
- EGSPO-SA: EGSPO + stepwise advantages for improved credit assignment
EGSPO-SA generally achieves better performance, especially on logical reasoning tasks.
Usage
Loading a Checkpoint
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Dream-org/Dream-v0-Instruct-7B",
torch_dtype="auto",
device_map="auto"
)
# Load EGSPO-SA checkpoint for GSM8K
model = PeftModel.from_pretrained(
base_model,
"fatemehdoudi97/egspo-dream-7b",
subfolder="gsm8k/egspo-sa"
)
tokenizer = AutoTokenizer.from_pretrained("Dream-org/Dream-v0-Instruct-7B")
# Example inference
prompt = "Solve: John has 5 apples and gives 2 to Mary. How many does he have left?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_length=256,
temperature=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Loading Different Tasks/Methods
# EGSPO for Sudoku
model = PeftModel.from_pretrained(
base_model,
"fatemehdoudi97/egspo-dream-7b",
subfolder="sudoku/egspo"
)
# EGSPO-SA for Countdown
model = PeftModel.from_pretrained(
base_model,
"fatemehdoudi97/egspo-dream-7b",
subfolder="countdown/egspo-sa"
)
Performance Results
Comparison with Baselines
| Method | Sudoku | Countdown | GSM8K | MATH500 |
|---|---|---|---|---|
| Dream-7B-Instruct | 14.0 | 17.4 | 81.3 | 48.0 |
| ESPO | 72.3 | 68.8 | 82.3 | 50.3 |
| EGSPO (Ours) | 90.3 | 76.6 | 82.3 | 50.4 |
| EGSPO-SA (Ours) | 90.6 | 82.4 | 86.1 | 47.8 |
All results are 0-shot at the best generation length (128/256/512). GSM8K and MATH500 use 1-token/step decoding to match ESPO's evaluation protocol.
Training Details
Hyperparameters
- Base Model: Dream-7B-Instruct
- Fine-tuning: LoRA (rank=128, α=64)
- Optimizer: Adam (β₁=0.9, β₂=0.99, weight_decay=0.1)
- Learning Rate: 3×10⁻⁵
- Completions per prompt (G): 16
- Denoising Steps: 128 (training)
- Step Budget (K): 8 high-entropy steps per trajectory
- Temperature: 0.9 (0.2 for Countdown)
- Generation Length: 256 tokens (training)
- Decoding: 2 tokens per denoising step
Method-Specific Settings
EGSPO:
- Entropy-guided step selection
- Sequence-level advantages (λ = 0)
EGSPO-SA:
- Entropy-guided step selection
- Stepwise advantages with one-step denoising baseline
- λ schedule per task:
- Sudoku: λ = 1.0 (constant)
- Countdown: λ = 0.2, halved every 500 steps
- GSM8K: λ = 0.5, halved every 500 steps
- MATH500: λ = 0.5, halved every 500 steps
Citation
If you use these checkpoints, please cite our paper:
@article{kunde2026Reinforcement,
title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
journal={arXiv preprint arXiv:2603.12554},
year={2026}
}
License
Apache 2.0
Contact
For questions or issues:
- Open an issue on GitHub
- Contact: Fatemeh Doudi (fatemehdoudi@tamu.edu)