Skip to content

Commit

Permalink
Add DPO support for DeepSpeed-Chat (#828)
Browse files Browse the repository at this point in the history
* Add label_smoothing while calculating step2 DPO loss in DeepSpeed-Chat.

* Add training scripts for step2 DPO in DeepSpeed-Chat.

* Remove unused packages and format the code of step2 DPO in DeepSpeed-Chat.

* Update training scripts of step2 DPO in DeepSpeed-Chat.

* Follow upstream fixes.

* Update README.md for Step2 DPO finetuning.

* Add opt 350M training log demo for step 2 dpo finetuning in DeepSpeed-Chat.

* Address the formatting issue in step2 dpo finetuning in DeepSpeed-Chat.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
  • Loading branch information
3 people authored Jan 6, 2025
1 parent 476f600 commit 1842b4f
Show file tree
Hide file tree
Showing 12 changed files with 7,216 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# πŸ• Direct Preference Optimization (DPO) finetuning
[Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290) is a novel approach to preference learning, which directly optimizes the policy without explicit reward modeling or reinforcement learning. It leverages a specific parameterization of the reward model that enables the extraction of the corresponding optimal policy in closed form. By using a simple classification loss, DPO aligns language models with human preferences, avoiding the complexity and instability often associated with RLHF.

As the paper says, "Your Language Model is Secretly a Reward Model." Therefore, the training arguments and the training process of DPO are mostly the same as the reward model, as shown in [step2 "Reward Model (RM) finetuning"](../step2_reward_model_finetuning/README.md). After the training of DPO, you will get a model that has been aligned with human preferences.

## πŸƒ How to train the model

We provide the script for OPT-350m, which you can test by launching the command

```bash
training_scripts/opt/single_node/run_350m.sh
```

We also provide the script for llama2, which you can test by launching the command

```bash
training_scripts/llama2/run_llama2_7b.sh
```

## πŸƒ How to evaluate the DPO checkpoint?

The checkpoint of DPO is exactly the language model that can be evaluated as [step1 "Supervised Finetuning"](../step1_supervised_finetuning/README.md).

## πŸ’ Datasets

Because DPO treats the language model as a reward model, the dataset for DPO is in the same format as that used for reward model fine-tuning. Each item in the dataset includes one "chosen" and one "rejected" output for the same input.
Loading

0 comments on commit 1842b4f

Please sign in to comment.