PasoDoble: Better LLM Reasoning via Dual-Play
This repository hosts models developed within the PasoDoble framework, a novel LLM dual-play approach presented in the paper Better LLM Reasoning via Dual-Play.
PasoDoble is designed to improve the reasoning performance of Large Language Models (LLMs) by adversarially training two models: a "Proposer" which generates challenging questions with ground-truth answers, and a "Solver" which attempts to solve them. This framework enables LLMs to iteratively learn from themselves, fostering sustained competition and mutual evolution, thus reducing reliance on external supervision.
Project Page: https://hcy123902.github.io/PasoDoble Code Repository: https://github.com/HCY123902/PasoDoble
Abstract Summary
PasoDoble addresses the reliance of LLMs on external supervision by introducing a dual-play adversarial learning framework. It trains a Proposer to generate challenging questions with ground-truth answers and a Solver to solve them. The Proposer is rewarded for generating valid, difficult questions, while the Solver is rewarded for correct answers, with both updated jointly to prevent reward hacking. An optional offline paradigm further enhances training stability. This self-play approach improves LLM reasoning performance without external supervision during training.
Setup
To explore the PasoDoble project's core implementation and reproduce experiments, follow these setup steps:
conda create -n pasodoble python=3.10.16
conda activate pasodoble
git clone https://github.com/PasoDoble-Cornell/PasoDoble.git
cd PasoDoble
pip install -r requirements.txt
# Install flash-attention separately
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# (Optional) If your current binutils version is lower than 2.38, upgrade with
conda install -c conda-forge binutils=2.40
mkdir history_record
Sample Usage
The PasoDoble models can be loaded and used with the transformers library for text generation. Below is an example using the PasoDoble-Cornell/Qwen2.5-3b-solver-online model. Remember to replace model_id with the specific model checkpoint you intend to use.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "PasoDoble-Cornell/Qwen2.5-3b-solver-online" # Example model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "user", "content": "What is the capital of France?"},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=50,
temperature=0.7,
do_sample=True,
eos_token_id=tokenizer.eos_token_id
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
For more details on training and advanced usage, please refer to the official GitHub repository.
Trained Checkpoints
The following PasoDoble Solver checkpoints are available:
| Model | Training | Download |
|---|---|---|
| PasoDoble Qwen2.5-0.5B | online | π€ HuggingFace |
| PasoDoble Qwen2.5-0.5B | offline | π€ HuggingFace |
| PasoDoble Qwen2.5-1.5B | online | π€ HuggingFace |
| PasoDoble Qwen2.5-1.5B | offline | π€ HuggingFace |
| PasoDoble Qwen2.5-3B | online | π€ HuggingFace |
| PasoDoble Qwen2.5-3B | offline | π€ HuggingFace |
| PasoDoble Qwen3-0.6B | online | π€ HuggingFace |
| PasoDoble Qwen3-0.6B | offline | π€ HuggingFace |
| PasoDoble Qwen3-1.7B | online | π€ HuggingFace |
| PasoDoble Qwen3-1.7B | offline | π€ HuggingFace |
| PasoDoble Qwen3-4B | online | π€ HuggingFace |
| PasoDoble Qwen3-4B | offline | π€ HuggingFace |
Citation
If you find PasoDoble useful for your research, please cite our paper:
@article{zhang2025pasodoble,
title={Better LLM Reasoning via Dual-Play},
author={Zhengxin Zhang and Chengyu Huang and Aochong Oliver Li and Claire Cardie},
eprint={2511.11881},
archivePrefix={arXiv},
year={2025},
url={https://arxiv.org/abs/2511.11881}
}
- Downloads last month
- 17