Skip to content

Commit

Permalink
Merge branch 'main' into fix-grpo-logits-calc
Browse files Browse the repository at this point in the history
  • Loading branch information
qgallouedec authored Jan 31, 2025
2 parents 3332a22 + bf69191 commit 41f195f
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ trainer.train()

### `GRPOTrainer`

`GRPOTrainer` implements a [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models]([https://huggingface.co/papers/2402.14740](https://huggingface.co/papers/2402.03300)) for reinforcement learning. Group Relative Policy Optimization (GRPO) is more performant than PPO and was used to train [Deepseek AI's R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).
`GRPOTrainer` implements the [Group Relative Policy Optimization (GRPO) algorithm](https://huggingface.co/papers/2402.03300) that is more memory-efficient than PPO and was used to train [Deepseek AI's R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).

```python
from datasets import load_dataset
Expand Down

0 comments on commit 41f195f

Please sign in to comment.