EGSPO & EGSPO-SA: Entropy-Guided Stepwise Policy Optimization for Diffusion LLMs

This repository contains task-specific checkpoints for EGSPO and EGSPO-SA methods from our paper for DREAM-7B base model:

"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages"

📄 Paper (arXiv:2603.12554) | 💻 Code (GitHub)

Overview

We introduce principled RL methods for diffusion language models that:

Formulate diffusion generation as a finite-horizon MDP
Derive exact policy gradients decomposed over denoising steps
Use entropy-guided step selection for compute-efficient training
Estimate stepwise advantages via one-step denoising completions

Base Model: Dream-7B-Instruct (Diffusion Language Model adapted from Qwen2.5 via continued pretraining)

Available Checkpoints

We provide checkpoints for two methods across four tasks:

Task	EGSPO	EGSPO-SA	Best Performance
Sudoku	✅ `sudoku/egspo`	✅ `sudoku/egspo-sa`	90.3% / 90.6%
Countdown	✅ `countdown/egspo`	✅ `countdown/egspo-sa`	76.6% / 82.4%
GSM8K	✅ `gsm8k/egspo`	✅ `gsm8k/egspo-sa`	82.3% / 86.1%
MATH500	✅ `math/egspo`	✅ `math/egspo-sa`	50.4% / 47.8%

GSM8K and MATH500 are evaluated with 1-token/step decoding to match the ESPO baseline protocol; our models are trained with 2-token/step decoding.

Method Comparison

EGSPO: Entropy-guided step selection with sequence-level advantages
EGSPO-SA: EGSPO + stepwise advantages for improved credit assignment

EGSPO-SA generally achieves better performance, especially on logical reasoning tasks.

Usage

Loading a Checkpoint

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    torch_dtype="auto",
    device_map="auto"
)

# Load EGSPO-SA checkpoint for GSM8K
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-dream-7b",
    subfolder="gsm8k/egspo-sa"
)

tokenizer = AutoTokenizer.from_pretrained("Dream-org/Dream-v0-Instruct-7B")

# Example inference
prompt = "Solve: John has 5 apples and gives 2 to Mary. How many does he have left?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_length=256,
    temperature=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading Different Tasks/Methods

# EGSPO for Sudoku
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-dream-7b",
    subfolder="sudoku/egspo"
)

# EGSPO-SA for Countdown
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-dream-7b",
    subfolder="countdown/egspo-sa"
)

Performance Results

Comparison with Baselines

Method	Sudoku	Countdown	GSM8K	MATH500
Dream-7B-Instruct	14.0	17.4	81.3	48.0
ESPO	72.3	68.8	82.3	50.3
EGSPO (Ours)	90.3	76.6	82.3	50.4
EGSPO-SA (Ours)	90.6	82.4	86.1	47.8

All results are 0-shot at the best generation length (128/256/512). GSM8K and MATH500 use 1-token/step decoding to match ESPO's evaluation protocol.

Training Details

Hyperparameters

Base Model: Dream-7B-Instruct
Fine-tuning: LoRA (rank=128, α=64)
Optimizer: Adam (β₁=0.9, β₂=0.99, weight_decay=0.1)
Learning Rate: 3×10⁻⁵
Completions per prompt (G): 16
Denoising Steps: 128 (training)
Step Budget (K): 8 high-entropy steps per trajectory
Temperature: 0.9 (0.2 for Countdown)
Generation Length: 256 tokens (training)
Decoding: 2 tokens per denoising step

Method-Specific Settings

EGSPO:

Entropy-guided step selection
Sequence-level advantages (λ = 0)

EGSPO-SA:

Entropy-guided step selection
Stepwise advantages with one-step denoising baseline
λ schedule per task:
- Sudoku: λ = 1.0 (constant)
- Countdown: λ = 0.2, halved every 500 steps
- GSM8K: λ = 0.5, halved every 500 steps
- MATH500: λ = 0.5, halved every 500 steps

Citation

If you use these checkpoints, please cite our paper:

@article{kunde2026Reinforcement,
  title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
  author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
  journal={arXiv preprint arXiv:2603.12554},
  year={2026}
}

License

Apache 2.0

Contact

For questions or issues:

Open an issue on GitHub
Contact: Fatemeh Doudi (fatemehdoudi@tamu.edu)

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for fatemehdoudi97/egspo-dream-7b

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Paper • 2603.12554 • Published Mar 13