GRPO: compute rewards over eval set #5

sidnarayanan · 2025-02-14T06:17:54Z

Basically what the title says. Trainer's eval loop is built around losses/logits, so I am abusing the return and call signatures of prediction_step and compute_metrics respectively. This lets us compute+log rewards without rewriting the whole eval loop and logging mechanism.

I also pulled in the changes from huggingface#2776, which are needed to unwrap the model for generation in the eval loop.

Most of the code diff is me pulling code out of compute_loss into separate methods, so they can be reused in prediction_step

sidnarayanan · 2025-02-14T06:19:06Z

trl/trainer/grpo_config.py

-
-    # From  https://github.com/huggingface/trl/pull/2700/files
-    sync_ref_model: bool = field(
-        default=False,
-        metadata={
-            "help": "Whether to synchronize the reference model with the active model every `ref_model_sync_steps` "
-            "steps, using the `ref_model_mixup_alpha` parameter."
-        },
-    )
-    ref_model_mixup_alpha: float = field(
-        default=0.9,
-        metadata={
-            "help": "α parameter from the TR-DPO paper, which controls the mix between the current policy and the "
-            "previous reference policy during updates. The reference policy is updated according to the equation: "
-            "`π_ref = α * π_θ + (1 - α) * π_ref_prev`. To use this parameter, you must set `sync_ref_model=True`."
-        },
-    )
-    ref_model_sync_steps: int = field(
-        default=64,
-        metadata={
-            "help": "τ parameter from the TR-DPO paper, which determines how frequently the current policy is "
-            "synchronized with the reference policy. To use this parameter, you must set `sync_ref_model=True`."
-        },
-    )


This doesn't work for large models, and checkpointing the best epochs (which eval metrics enables) lets us do the reset manually. So I'm removing the configuration.

jamesbraza

Approving

I am wondering, do you run their unit tests, or does this break them?

sidnarayanan · 2025-02-17T00:33:26Z

I am wondering, do you run their unit tests, or does this break them?
I've not been, but I don't think this will - they only test GRPO eval in one place, and don't actually check the outputs - just that it runs.

* Ported compute_reward_metrics and integrated into GRPOTrainer.__init__ * Decomposed _compute_rewards_per_func out of _generate_and_score_completions * Decomposed _extract_completions from _generate_and_score_completions * Implemented a custom generation_config route in _generate_and_score_completions * Decomposed _generate, and plugged it into prediction_step

sidnarayanan added 3 commits February 9, 2025 00:11

rm SyncRefModelCallback

b8574f0

apply remove/add_hooks fix from trl huggingface#2776

cb94d1f

compute rewards over eval set

5780a6f

sidnarayanan requested review from whitead, jamesbraza and albertbou92 February 14, 2025 06:17

sidnarayanan commented Feb 14, 2025

View reviewed changes

elaborate remove_hooks patch

91d2572

jamesbraza approved these changes Feb 14, 2025

View reviewed changes

sidnarayanan merged commit 46dd496 into rollback Feb 17, 2025

sidnarayanan deleted the rl-eval branch February 17, 2025 00:33

sidnarayanan changed the title ~~GRPO: compute losses over eval set~~ GRPO: compute rewards over eval set Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO: compute rewards over eval set #5

GRPO: compute rewards over eval set #5

sidnarayanan commented Feb 14, 2025 •

edited

Loading

sidnarayanan Feb 14, 2025 •

edited

Loading

jamesbraza left a comment

sidnarayanan commented Feb 17, 2025

GRPO: compute rewards over eval set #5

GRPO: compute rewards over eval set #5

Conversation

sidnarayanan commented Feb 14, 2025 • edited Loading

sidnarayanan Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

jamesbraza left a comment

Choose a reason for hiding this comment

sidnarayanan commented Feb 17, 2025

sidnarayanan commented Feb 14, 2025 •

edited

Loading

sidnarayanan Feb 14, 2025 •

edited

Loading