Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper
β’
2305.18290
β’
Published
β’
64
We introduce Moreh AI Model Hub with AMD GPU, an ai model host platform powered by AMD MI250 GPUs. You can now test live-inference of this model at Moreh AI Model Hub.
MoMo-72B-lora-1.8.7-DPO is trained via Direct Preference Optimization(DPO) from MoMo-72B-LoRA-V1.4 as its base model, with several optimizations in hyperparameters.
MoMo-72B-LoRA-V1.4 is trained via Supervised Fine-Tuning (SFT) using LoRA, with the QWEN-72B model as its base-model.
Note that we did not exploit any form of weight merge.
For leaderboard submission, the trained weight is realigned for compatibility with llama.
MoMo-72B is trained using Moreh's MoAI platform, which simplifies the training of large-scale models, and AMD's MI250 GPU.
| Model | ARC | MMLU | TruthfulQA | GSM8K |
|---|---|---|---|---|
| V1.8.7(result < 0.1, %) | TBU | TBU | 0.44 | 0.47 |
# pip install transformers==4.35.2
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("moreh/MoMo-72B-lora-1.8.7-DPO")
model = AutoModelForCausalLM.from_pretrained(
"moreh/MoMo-72B-lora-1.8.7-DPO"
)