-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ReMax Algorithm #2955
base: main
Are you sure you want to change the base?
Support ReMax Algorithm #2955
Conversation
Thank for this great work!! It seems very close to GRPO, can you summarize the key differences to make the reviewing a bit easier for me? |
Hi, ReMax differs from GRPO in two key aspects: baseline estimation and advantage calculation. Key Conceptual DifferencesBaseline Estimation:
Advantage Calculation:
Implementation DetailsThe implementation of Key Modifications
Additional Changes
I also provide an introduction to ReMax at the docs. If you have additional questions, please feel free to let me know. |
Thanks! Can you try to integrate the very latest changes in GRPO? |
incorporate latest changes in grpo to remax
Hi, I’ve integrated the latest changes from GRPO. Below is a summary of the updates:
Let me know if you have any questions or need further details! |
currently, the remax trainer file is a copy of the remax config file... is that a mistake? |
Hi @kashif, thank you for pointing that out! It was my mistake to copy the wrong content. I’ve now fixed it. |
What does this PR do?
This PR adds the ReMax component to TRL, implementing the algorithm from the ICML 2024 paper ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models.
Key features include:
Before submitting
Who can review?
Anyone with experience in reinforcement learning for language models would be great to review this PR. The implementation involves both training and generation components.
Test Results
All tests have been successfully executed:
Click to see the test details
Empirical Effectiveness
The effectiveness of ReMax is compared with GRPO on fine-tuning Qwen-2.5-3B-Instruct on GSM8K dataset with the accuracy reward.
Click to see the training code