PasoDoble: Better LLM Reasoning via Dual-Play

This repository hosts models developed within the PasoDoble framework, a novel LLM dual-play approach presented in the paper Better LLM Reasoning via Dual-Play.

PasoDoble is designed to improve the reasoning performance of Large Language Models (LLMs) by adversarially training two models: a "Proposer" which generates challenging questions with ground-truth answers, and a "Solver" which attempts to solve them. This framework enables LLMs to iteratively learn from themselves, fostering sustained competition and mutual evolution, thus reducing reliance on external supervision.

Project Page: https://hcy123902.github.io/PasoDoble Code Repository: https://github.com/HCY123902/PasoDoble

Abstract Summary

PasoDoble addresses the reliance of LLMs on external supervision by introducing a dual-play adversarial learning framework. It trains a Proposer to generate challenging questions with ground-truth answers and a Solver to solve them. The Proposer is rewarded for generating valid, difficult questions, while the Solver is rewarded for correct answers, with both updated jointly to prevent reward hacking. An optional offline paradigm further enhances training stability. This self-play approach improves LLM reasoning performance without external supervision during training.

Setup

To explore the PasoDoble project's core implementation and reproduce experiments, follow these setup steps:

conda create -n pasodoble python=3.10.16
conda activate pasodoble

git clone https://github.com/PasoDoble-Cornell/PasoDoble.git
cd PasoDoble
pip install -r requirements.txt

# Install flash-attention separately
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

# (Optional) If your current binutils version is lower than 2.38, upgrade with
conda install -c conda-forge binutils=2.40

mkdir history_record

Sample Usage

The PasoDoble models can be loaded and used with the transformers library for text generation. Below is an example using the PasoDoble-Cornell/Qwen2.5-3b-solver-online model. Remember to replace model_id with the specific model checkpoint you intend to use.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "PasoDoble-Cornell/Qwen2.5-3b-solver-online" # Example model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What is the capital of France?"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=50,
    temperature=0.7,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

For more details on training and advanced usage, please refer to the official GitHub repository.

Trained Checkpoints

The following PasoDoble Solver checkpoints are available:

Model	Training	Download
PasoDoble Qwen2.5-0.5B	online	🤗 HuggingFace
PasoDoble Qwen2.5-0.5B	offline	🤗 HuggingFace
PasoDoble Qwen2.5-1.5B	online	🤗 HuggingFace
PasoDoble Qwen2.5-1.5B	offline	🤗 HuggingFace
PasoDoble Qwen2.5-3B	online	🤗 HuggingFace
PasoDoble Qwen2.5-3B	offline	🤗 HuggingFace
PasoDoble Qwen3-0.6B	online	🤗 HuggingFace
PasoDoble Qwen3-0.6B	offline	🤗 HuggingFace
PasoDoble Qwen3-1.7B	online	🤗 HuggingFace
PasoDoble Qwen3-1.7B	offline	🤗 HuggingFace
PasoDoble Qwen3-4B	online	🤗 HuggingFace
PasoDoble Qwen3-4B	offline	🤗 HuggingFace

Citation

If you find PasoDoble useful for your research, please cite our paper:

@article{zhang2025pasodoble,
  title={Better LLM Reasoning via Dual-Play},
  author={Zhengxin Zhang and Chengyu Huang and Aochong Oliver Li and Claire Cardie},
  eprint={2511.11881},
  archivePrefix={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2511.11881}
}

Downloads last month: 17

Safetensors

Model size

2B params

Tensor type

BF16