-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPO: compute rewards over eval set #5
Conversation
|
||
# From https://github.com/huggingface/trl/pull/2700/files | ||
sync_ref_model: bool = field( | ||
default=False, | ||
metadata={ | ||
"help": "Whether to synchronize the reference model with the active model every `ref_model_sync_steps` " | ||
"steps, using the `ref_model_mixup_alpha` parameter." | ||
}, | ||
) | ||
ref_model_mixup_alpha: float = field( | ||
default=0.9, | ||
metadata={ | ||
"help": "α parameter from the TR-DPO paper, which controls the mix between the current policy and the " | ||
"previous reference policy during updates. The reference policy is updated according to the equation: " | ||
"`π_ref = α * π_θ + (1 - α) * π_ref_prev`. To use this parameter, you must set `sync_ref_model=True`." | ||
}, | ||
) | ||
ref_model_sync_steps: int = field( | ||
default=64, | ||
metadata={ | ||
"help": "τ parameter from the TR-DPO paper, which determines how frequently the current policy is " | ||
"synchronized with the reference policy. To use this parameter, you must set `sync_ref_model=True`." | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work for large models, and checkpointing the best epochs (which eval metrics enables) lets us do the reset manually. So I'm removing the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving
I am wondering, do you run their unit tests, or does this break them?
|
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__ * Decomposed _compute_rewards_per_func out of _generate_and_score_completions * Decomposed _extract_completions from _generate_and_score_completions * Implemented a custom generation_config route in _generate_and_score_completions * Decomposed _generate, and plugged it into prediction_step
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__ * Decomposed _compute_rewards_per_func out of _generate_and_score_completions * Decomposed _extract_completions from _generate_and_score_completions * Implemented a custom generation_config route in _generate_and_score_completions * Decomposed _generate, and plugged it into prediction_step
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__ * Decomposed _compute_rewards_per_func out of _generate_and_score_completions * Decomposed _extract_completions from _generate_and_score_completions * Implemented a custom generation_config route in _generate_and_score_completions * Decomposed _generate, and plugged it into prediction_step
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__ * Decomposed _compute_rewards_per_func out of _generate_and_score_completions * Decomposed _extract_completions from _generate_and_score_completions * Implemented a custom generation_config route in _generate_and_score_completions * Decomposed _generate, and plugged it into prediction_step
* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__ * Decomposed _compute_rewards_per_func out of _generate_and_score_completions * Decomposed _extract_completions from _generate_and_score_completions * Implemented a custom generation_config route in _generate_and_score_completions * Decomposed _generate, and plugged it into prediction_step
Basically what the title says. Trainer's eval loop is built around losses/logits, so I am abusing the return and call signatures of
prediction_step
andcompute_metrics
respectively. This lets us compute+log rewards without rewriting the whole eval loop and logging mechanism.I also pulled in the changes from huggingface#2776, which are needed to unwrap the model for generation in the eval loop.
Most of the code diff is me pulling code out of
compute_loss
into separate methods, so they can be reused inprediction_step