WIP: RLOOV2 #2724

mnoukhov · 2025-01-31T19:53:38Z

What does this PR do?

Following #2567, introduces an RLOOv2 that

follows the structure of OnlineDPOTrainer to leverage Trainer's train and training_step
adds new GRPOTrainer reward_func so it can be used for reasoning / math
follows SFTTrainer to automatically preprocess the dataset

and particularly useful for open r1 it follows the current RLOOTrainer

allows for generating multiple minibatches of data and then training for multiple "ppo epochs" on that data , which doesn't seem to be present in GRPOTrainer

Overall, it updates RLOOTrainer to have all the benefits of Trainer, using the latest format in GRPOTrainer, and maintains all its functionality

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x ] Did you read the contributor guideline, Pull Request section?
[ x] Was this discussed/approved via a GitHub issue? Please add a to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

combines online dpo, grpo and adds multiple ppo epochs and off-policy learning

mnoukhov added 2 commits January 31, 2025 14:19

init rloov2 from grpo

85aa087

first wip of rloov2

bbc81be

combines online dpo, grpo and adds multiple ppo epochs and off-policy learning

mnoukhov mentioned this pull request Jan 31, 2025

WIP: Base Online Trainer #2567

Closed

4 tasks