EGSPO & EGSPO-SA: Entropy-Guided Stepwise Policy Optimization for Diffusion LLMs

This repository contains task-specific checkpoints for EGSPO and EGSPO-SA methods from our paper for DREAM-7B base model:

"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages"

📄 Paper (arXiv:2603.12554) | 💻 Code (GitHub)

Overview

We introduce principled RL methods for diffusion language models that:

  • Formulate diffusion generation as a finite-horizon MDP
  • Derive exact policy gradients decomposed over denoising steps
  • Use entropy-guided step selection for compute-efficient training
  • Estimate stepwise advantages via one-step denoising completions

Base Model: Dream-7B-Instruct (Diffusion Language Model adapted from Qwen2.5 via continued pretraining)

Available Checkpoints

We provide checkpoints for two methods across four tasks:

Task EGSPO EGSPO-SA Best Performance
Sudoku sudoku/egspo sudoku/egspo-sa 90.3% / 90.6%
Countdown countdown/egspo countdown/egspo-sa 76.6% / 82.4%
GSM8K gsm8k/egspo gsm8k/egspo-sa 82.3% / 86.1%
MATH500 math/egspo math/egspo-sa 50.4% / 47.8%

GSM8K and MATH500 are evaluated with 1-token/step decoding to match the ESPO baseline protocol; our models are trained with 2-token/step decoding.

Method Comparison

  • EGSPO: Entropy-guided step selection with sequence-level advantages
  • EGSPO-SA: EGSPO + stepwise advantages for improved credit assignment

EGSPO-SA generally achieves better performance, especially on logical reasoning tasks.

Usage

Loading a Checkpoint

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    torch_dtype="auto",
    device_map="auto"
)

# Load EGSPO-SA checkpoint for GSM8K
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-dream-7b",
    subfolder="gsm8k/egspo-sa"
)

tokenizer = AutoTokenizer.from_pretrained("Dream-org/Dream-v0-Instruct-7B")

# Example inference
prompt = "Solve: John has 5 apples and gives 2 to Mary. How many does he have left?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_length=256,
    temperature=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading Different Tasks/Methods

# EGSPO for Sudoku
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-dream-7b",
    subfolder="sudoku/egspo"
)

# EGSPO-SA for Countdown
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-dream-7b",
    subfolder="countdown/egspo-sa"
)

Performance Results

Comparison with Baselines

Method Sudoku Countdown GSM8K MATH500
Dream-7B-Instruct 14.0 17.4 81.3 48.0
ESPO 72.3 68.8 82.3 50.3
EGSPO (Ours) 90.3 76.6 82.3 50.4
EGSPO-SA (Ours) 90.6 82.4 86.1 47.8

All results are 0-shot at the best generation length (128/256/512). GSM8K and MATH500 use 1-token/step decoding to match ESPO's evaluation protocol.

Training Details

Hyperparameters

  • Base Model: Dream-7B-Instruct
  • Fine-tuning: LoRA (rank=128, α=64)
  • Optimizer: Adam (β₁=0.9, β₂=0.99, weight_decay=0.1)
  • Learning Rate: 3×10⁻⁵
  • Completions per prompt (G): 16
  • Denoising Steps: 128 (training)
  • Step Budget (K): 8 high-entropy steps per trajectory
  • Temperature: 0.9 (0.2 for Countdown)
  • Generation Length: 256 tokens (training)
  • Decoding: 2 tokens per denoising step

Method-Specific Settings

EGSPO:

  • Entropy-guided step selection
  • Sequence-level advantages (λ = 0)

EGSPO-SA:

  • Entropy-guided step selection
  • Stepwise advantages with one-step denoising baseline
  • λ schedule per task:
    • Sudoku: λ = 1.0 (constant)
    • Countdown: λ = 0.2, halved every 500 steps
    • GSM8K: λ = 0.5, halved every 500 steps
    • MATH500: λ = 0.5, halved every 500 steps

Citation

If you use these checkpoints, please cite our paper:

@article{kunde2026Reinforcement,
  title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
  author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
  journal={arXiv preprint arXiv:2603.12554},
  year={2026}
}

License

Apache 2.0

Contact

For questions or issues:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for fatemehdoudi97/egspo-dream-7b