Skip to content

Commit

Permalink
πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§ GRPO (#2565)
Browse files Browse the repository at this point in the history
* init grpo [ci skip]

* initial version

* refine args defs

* model card

* initial doc

* fix badges

* fix spaces

* try link to super in doc

* temperature, fix indexing, and std=0.0

* grpo script for cli

* peft support

* move data preparation in `compute_loss`

* weird doc trial

* fix device and some logging

* unwrap_model_for_generation for distributed setting

* Compat with distrib training

* revert grpo config doc trial (didn't work)

* test

* allow model to be str and processing_class to be none; fix loss computation

* advantage is always 0.0: don't log

* fix peft not installed

* proper reward model for testing

* fix script for cli

* add trl grpo to cli doc

* test peft

* flush left

* fix reward calculation

* new reward model

* support any reward model

* fix reward processing class def

* log reward std

* fix reward logging

* fix grad computation

* skip embed layer in test

* remove optimizer_cls_and_kwargs

* improve GRPO default args

* reduce mem usage for grpo test

* reduce mem usage in test grpo

* reduce memory usage for test

* Fix the test

* remove redondant

* fix min version

* Update test_grpo_trainer.py

* Update test_grpo_trainer.py

* Fix test, finally found the solution!

* some doc

* Update doc-builder workflow to use specific commit sha

* more doc

* advantages

* drop cancel fo no grad

* logged metrics [ci skip]

* completion col is ignored [ci skip]

* fix latex

* double space? ~?

* try a latex fix

* with branch

* Empty commit

* Empty commit

* double space seems to be the solution
  • Loading branch information
qgallouedec authored Jan 20, 2025
1 parent 88514d5 commit 0f5ffad
Show file tree
Hide file tree
Showing 18 changed files with 975 additions and 14 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build_pr_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ concurrency:

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@e4fcf608695cf4bddb8c7f4f72aa15fa14110a94
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
Expand Down
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@
title: Online DPO
- local: gkd_trainer
title: GKD
- local: grpo_trainer
title: GRPO
- local: kto_trainer
title: KTO
- local: nash_md_trainer
Expand Down
1 change: 1 addition & 0 deletions docs/source/clis.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Currently supported CLIs are:
#### Training commands

- `trl dpo`: fine-tune a LLM with DPO
- `trl grpo`: fine-tune a LLM with GRPO
- `trl kto`: fine-tune a LLM with KTO
- `trl sft`: fine-tune a LLM with SFT

Expand Down
1 change: 1 addition & 0 deletions docs/source/dataset_formats.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,7 @@ Choosing the right dataset type depends on the task you are working on and the s
| [`CPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
| [`DPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
| [`GKDTrainer`] | [Prompt-completion](#prompt-completion) |
| [`GRPOTrainer`] | [Prompt-only](#prompt-only) |
| [`IterativeSFTTrainer`] | [Unpaired preference](#unpaired-preference) |
| [`KTOTrainer`] | [Unpaired preference](#unpaired-preference) or [Preference (explicit prompt recommended)](#preference) |
| [`NashMDTrainer`] | [Prompt-only](#prompt-only) |
Expand Down
123 changes: 123 additions & 0 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# GRPO Trainer

[![](https://img.shields.io/badge/All_models-GRPO-blue)](https://huggingface.co/models?other=grpo,trl)

## Overview

TRL supports the GRPO Trainer for training language models, as described in the paper [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300) by [Zhihong Shao](https://huggingface.co/syhia), [Peiyi Wang](https://huggingface.co/peiyiwang89), [Qihao Zhu](https://huggingface.co/zqh11), Runxin Xu, [Junxiao Song](https://huggingface.co/haha-point), Mingchuan Zhang, Y. K. Li, Y. Wu, [Daya Guo](https://huggingface.co/guoday).

The abstract from the paper is the following:

> Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

This post-training method was contributed by [Quentin GallouΓ©dec](https://huggingface.co/qgallouedec).

## Quick start

This example demonstrates how to train a model using the GRPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B) as the base model and the [RM-Gemma-2B model](https://huggingface.co/weqweasdas/RM-Gemma-2B) as the reward model. We use the prompts from the [TLDR dataset](https://huggingface.co/datasets/trl-lib/tldr) (completion column is ingored!). You can view the data in the dataset here:

<iframe
src="https://huggingface.co/datasets/trl-lib/tldr/embed/viewer/default/train?row=0"
frameborder="0"
width="100%"
height="560px"
></iframe>
Below is the script to train the model. We use PEFT to reduce the memory requirements.

```python
# train_grpo.py
from datasets import load_dataset
from peft import LoraConfig
from trl import GRPOConfig, GRPOTrainer

# Load the dataset
dataset = load_dataset("trl-lib/tldr", split="train")

training_args = GRPOConfig(
output_dir="Qwen2-0.5B-GRPO",
learning_rate=1e-5,
logging_steps=10,
gradient_accumulation_steps=16,
max_completion_length=128,
)
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_model="weqweasdas/RM-Gemma-2B",
args=training_args,
train_dataset=dataset,
peft_config=LoraConfig(task_type="CAUSAL_LM"),
)

trainer.train()
```

Execute the script using the following command:

```bash
accelerate launch train_grpo.py
```

Distributed across 8 GPUs, the training takes approximately 1 day.

![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo_curves.png)

## Looking deeper into the GRPO method

GRPO is an online learning algorithm, meaning it improves iteratively by using the data generated by the trained model itself during training. The intuition behind GRPO objective is to maximize the advantage of the generated completions, while ensuring that the model remains close to the reference policy. To understand how GRPO works, it can be broken down into four main steps: **Generating completions**, **computing the advantage**, **estimating the KL divergence**, and **computing the loss**.

![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo_visual.png)

### Generating completions

At each training step, we sample a batch of prompts and generate a set of \\( G \\) completions for each prompt (denoted as \\( o_i \\)).

### Computing the advantage

For each of the \\( G \\) sequences, we compute the reward using a reward model. To align with the comparative nature of reward modelsβ€”typically trained on datasets of comparisons between outputs for the same questionβ€”the advantage is calculated to reflect these relative comparisons. It is normalized as follows:

$$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$

This approach gives the method its name: **Group Relative Policy Optimization (GRPO)**.

### Estimating the KL divergence

KL divergence is estimated using the approximator introduced by [Schulman et al. (2020)](http://joschu.net/blog/kl-approx.html). The approximator is defined as follows:

$$\mathbb{D}_{\text{KL}}\left[\pi_\theta \|\pi_{\text{ref}}\right] = \frac{\pi_{\text{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - \log \frac{\pi_{\text{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - 1,
$$

### Computing the loss

The objective is to maximize the advantage while ensuring that the model remains close to the reference policy. Consequently, the loss is defined as follows:

$$
\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
$$

where the first term represents the scaled advantage and the second term penalizes deviations from the reference policy through KL divergence.

In the original paper, this formulation is generalized to account for multiple updates after each generation by leveraging the **clipped surrogate objective**:

$$
\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min \left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})} \hat{A}_{i,t}, \, \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right],
$$

where \\(\text{clip}(\cdot, 1 - \epsilon, 1 + \epsilon) \\) ensures that updates do not deviate excessively from the reference policy by bounding the policy ratio between \\( 1 - \epsilon \\) and \\( 1 + \epsilon \\).
In TRL though, as in the original paper, we only do one update per generation, so we can simplify the loss to the first form.

## Logged metrics

The GRPO Trainer logs the following metrics:

- `reward`: The average reward.
- `reward_std` : The average standard deviation within reward groups.
- `kl` : The average KL divergence between the model and the reference model calculated on completions.

## GRPOTrainer

[[autodoc]] GRPOTrainer

## GRPOConfig

[[autodoc]] GRPOConfig
2 changes: 1 addition & 1 deletion docs/source/kto_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ Each choice of `beta` has a maximum learning rate it can tolerate before learnin
### Imbalanced data

The `desirable_weight` and `undesirable_weight` of the [`KTOConfig`] refer to the weights placed on the losses for desirable/positive and undesirable/negative examples.
By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (`desirable_weight` \\(\times\\) number of positives) to (`undesirable_weight` \\(\times\\) number of negatives) is in the range 1:1 to 4:3.
By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (`desirable_weight` \\(\times\\) number of positives) to (`undesirable_weight` \\(\times\\) number of negatives) is in the range 1:1 to 4:3.

## Logged metrics

Expand Down
23 changes: 23 additions & 0 deletions scripts/generate_tiny_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
Idefics2ForConditionalGeneration,
LlamaConfig,
LlamaForCausalLM,
LlamaForSequenceClassification,
LlavaConfig,
LlavaForConditionalGeneration,
LlavaNextConfig,
Expand All @@ -57,6 +58,7 @@
Phi3ForCausalLM,
Qwen2Config,
Qwen2ForCausalLM,
Qwen2ForSequenceClassification,
SiglipVisionConfig,
T5Config,
T5ForConditionalGeneration,
Expand Down Expand Up @@ -131,6 +133,7 @@ def push_to_hub(model, tokenizer, prefix=None, suffix=None):
model = model_class(config)
push_to_hub(model, tokenizer, "tiny", suffix)


# A slightly bigger model, required for vLLM testing
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")
config = Qwen2Config(
Expand All @@ -144,6 +147,26 @@ def push_to_hub(model, tokenizer, prefix=None, suffix=None):
model = Qwen2ForCausalLM(config)
push_to_hub(model, tokenizer, "small", "2.5")


# Reward models
for model_id, config_class, model_class, suffix in [
("meta-llama/Llama-3.2-1B-Instruct", LlamaConfig, LlamaForSequenceClassification, "3.2"),
("Qwen/Qwen2.5-32B-Instruct", Qwen2Config, Qwen2ForSequenceClassification, "2.5"),
]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
config = config_class(
vocab_size=tokenizer.vocab_size + len(tokenizer.added_tokens_encoder.keys()),
hidden_size=8,
num_attention_heads=4,
num_key_value_heads=2,
num_hidden_layers=2,
intermediate_size=32,
num_labels=1,
)
model = model_class(config)
push_to_hub(model, tokenizer, "tiny", suffix)


# Encoder-decoder models
for model_id, config_class, model_class, suffix in [
("google/flan-t5-small", T5Config, T5ForConditionalGeneration, None),
Expand Down
6 changes: 6 additions & 0 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,12 @@ def test_env(self, mock_stdout):
main()
self.assertIn("TRL version: ", mock_stdout.getvalue().strip())

def test_grpo(self):
with tempfile.TemporaryDirectory() as tmp_dir: # Create a temporary directory
command = f"trl grpo --output_dir {tmp_dir} --model_name_or_path trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 --reward_model_name_or_path trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5 --dataset_name trl-internal-testing/zen --dataset_config standard_prompt_only --num_generations 3 --max_completion_length 32 --report_to none"
with patch("sys.argv", command.split(" ")):
main()

def test_kto(self):
with tempfile.TemporaryDirectory() as tmp_dir: # Create a temporary directory
command = f"trl kto --output_dir {tmp_dir} --model_name_or_path trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 --dataset_name trl-internal-testing/zen --dataset_config standard_unpaired_preference --report_to none"
Expand Down
148 changes: 148 additions & 0 deletions tests/test_grpo_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import tempfile
import unittest

import torch
from datasets import load_dataset
from parameterized import parameterized
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from transformers.testing_utils import require_peft
from transformers.utils import is_peft_available

from trl import GRPOConfig, GRPOTrainer


if is_peft_available():
from peft import LoraConfig


class GRPOTrainerTester(unittest.TestCase):
def test_init_minimal(self):
# Test that GRPOTrainer can be instantiated with only model, reward_model and train_dataset
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_model="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
train_dataset=dataset,
)

@parameterized.expand([("standard_prompt_only",), ("conversational_prompt_only",)])
def test_training(self, config_name):
dataset = load_dataset("trl-internal-testing/zen", config_name, split="train")

with tempfile.TemporaryDirectory() as tmp_dir:
training_args = GRPOConfig(
output_dir=tmp_dir,
learning_rate=0.1, # increase the learning rate to speed up the test
per_device_train_batch_size=2, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=32, # reduce the completion length to reduce memory usage
report_to="none",
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_model="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])

# Check that the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
self.assertFalse(torch.equal(param, new_param), f"Parameter {n} has not changed.")

@require_peft
def test_training_peft(self):
model = AutoModelForCausalLM.from_pretrained("trl-internal-testing/tiny-Qwen2ForCausalLM-2.5")
base_param_names = [f"base_model.model.{n}" for n, _ in model.named_parameters()]
dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")

with tempfile.TemporaryDirectory() as tmp_dir:
training_args = GRPOConfig(
output_dir=tmp_dir,
learning_rate=0.1, # increase the learning rate to speed up the test
per_device_train_batch_size=2, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=32, # reduce the completion length to reduce memory usage
report_to="none",
)
trainer = GRPOTrainer(
model=model,
reward_model="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
args=training_args,
train_dataset=dataset,
peft_config=LoraConfig(),
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])

# Check the peft params have changed and the base model params have not changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
if n in base_param_names: # We expect the base model params to be the same
self.assertTrue(torch.allclose(param, new_param), f"Parameter {n} has changed.")
elif "base_layer" not in n: # We expect the peft params to be different (except for the base layer)
self.assertFalse(torch.allclose(param, new_param), f"Parameter {n} has not changed.")

def test_training_different_reward_model(self):
# Use a reward model different from the model: different chat template, tokenization, etc.
dataset = load_dataset("trl-internal-testing/zen", "conversational_prompt_only", split="train")
reward_model_id = "trl-internal-testing/tiny-LlamaForSequenceClassification-3.2"
reward_model = AutoModelForSequenceClassification.from_pretrained(reward_model_id)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_id)
# By default, the trainer uses the eos token as the padding token. However, for Llama models, the eos token
# appears in the chat template. Using it as a pad token disrupts the reward calculation, as the calculation
# considers the score of the last token before the first pad token. To ensure correct reward calculations,
# we use a separate pad token instead.
reward_tokenizer.pad_token = "<|finetune_right_pad_id|>"

with tempfile.TemporaryDirectory() as tmp_dir:
training_args = GRPOConfig(
output_dir=tmp_dir,
learning_rate=0.1, # increase the learning rate to speed up the test
per_device_train_batch_size=2, # reduce the batch size to reduce memory usage
num_generations=3, # reduce the number of generations to reduce memory usage
max_completion_length=32, # reduce the completion length to reduce memory usage
report_to="none",
)
trainer = GRPOTrainer(
model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
reward_model=reward_model,
args=training_args,
train_dataset=dataset,
reward_processing_class=reward_tokenizer,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])

# Check the params have changed
for n, param in previous_trainable_params.items():
new_param = trainer.model.get_parameter(n)
self.assertFalse(torch.equal(param, new_param), f"Parameter {n} has not changed.")
Loading

0 comments on commit 0f5ffad

Please sign in to comment.