Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPO: compute rewards over eval set #5

Merged
merged 4 commits into from
Feb 17, 2025
Merged

GRPO: compute rewards over eval set #5

merged 4 commits into from
Feb 17, 2025

Conversation

sidnarayanan
Copy link
Collaborator

@sidnarayanan sidnarayanan commented Feb 14, 2025

Basically what the title says. Trainer's eval loop is built around losses/logits, so I am abusing the return and call signatures of prediction_step and compute_metrics respectively. This lets us compute+log rewards without rewriting the whole eval loop and logging mechanism.

I also pulled in the changes from huggingface#2776, which are needed to unwrap the model for generation in the eval loop.

Most of the code diff is me pulling code out of compute_loss into separate methods, so they can be reused in prediction_step

Comment on lines -177 to -200

# From https://github.com/huggingface/trl/pull/2700/files
sync_ref_model: bool = field(
default=False,
metadata={
"help": "Whether to synchronize the reference model with the active model every `ref_model_sync_steps` "
"steps, using the `ref_model_mixup_alpha` parameter."
},
)
ref_model_mixup_alpha: float = field(
default=0.9,
metadata={
"help": "α parameter from the TR-DPO paper, which controls the mix between the current policy and the "
"previous reference policy during updates. The reference policy is updated according to the equation: "
"`π_ref = α * π_θ + (1 - α) * π_ref_prev`. To use this parameter, you must set `sync_ref_model=True`."
},
)
ref_model_sync_steps: int = field(
default=64,
metadata={
"help": "τ parameter from the TR-DPO paper, which determines how frequently the current policy is "
"synchronized with the reference policy. To use this parameter, you must set `sync_ref_model=True`."
},
)
Copy link
Collaborator Author

@sidnarayanan sidnarayanan Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for large models, and checkpointing the best epochs (which eval metrics enables) lets us do the reset manually. So I'm removing the configuration.

Copy link

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving

I am wondering, do you run their unit tests, or does this break them?

@sidnarayanan
Copy link
Collaborator Author

I am wondering, do you run their unit tests, or does this break them?
I've not been, but I don't think this will - they only test GRPO eval in one place, and don't actually check the outputs - just that it runs.

@sidnarayanan sidnarayanan merged commit 46dd496 into rollback Feb 17, 2025
@sidnarayanan sidnarayanan deleted the rl-eval branch February 17, 2025 00:33
@sidnarayanan sidnarayanan changed the title GRPO: compute losses over eval set GRPO: compute rewards over eval set Feb 19, 2025
jamesbraza added a commit that referenced this pull request Feb 26, 2025
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__

* Decomposed _compute_rewards_per_func out of _generate_and_score_completions

* Decomposed _extract_completions from _generate_and_score_completions

* Implemented a custom generation_config route in _generate_and_score_completions

* Decomposed _generate, and plugged it into prediction_step
jamesbraza added a commit that referenced this pull request Feb 26, 2025
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__

* Decomposed _compute_rewards_per_func out of _generate_and_score_completions

* Decomposed _extract_completions from _generate_and_score_completions

* Implemented a custom generation_config route in _generate_and_score_completions

* Decomposed _generate, and plugged it into prediction_step
jamesbraza added a commit that referenced this pull request Feb 27, 2025
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__

* Decomposed _compute_rewards_per_func out of _generate_and_score_completions

* Decomposed _extract_completions from _generate_and_score_completions

* Implemented a custom generation_config route in _generate_and_score_completions

* Decomposed _generate, and plugged it into prediction_step
jamesbraza added a commit that referenced this pull request Feb 28, 2025
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__

* Decomposed _compute_rewards_per_func out of _generate_and_score_completions

* Decomposed _extract_completions from _generate_and_score_completions

* Implemented a custom generation_config route in _generate_and_score_completions

* Decomposed _generate, and plugged it into prediction_step
jamesbraza added a commit that referenced this pull request Mar 8, 2025
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__

* Decomposed _compute_rewards_per_func out of _generate_and_score_completions

* Decomposed _extract_completions from _generate_and_score_completions

* Implemented a custom generation_config route in _generate_and_score_completions

* Decomposed _generate, and plugged it into prediction_step
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants