Process reward models #2241

SalmanMohammadi · 2025-01-07T12:37:29Z

Fixing `num_labels` for reward models

Reward models should be initialized with num_labels=1. I've added a num_labels field which is set in cfg_kwargs, rather than model_kwargs, as this field is parsed from config.num_labels in transformers. This also allows us to correctly initialize process reward models with num_labels=2.

Adding support for process reward model training

I've added support for the PRMTrainer from trl, and also for the appropriate dataset format. Please see a screenshot of a successful training run below.

Resulting in the following trained model: https://huggingface.co/smohammadi/Qwen2.5-3B-MathShepherd

src/axolotl/prompt_strategies/stepwise_supervised.py

winglian

A couple of minor things, but should be good to go after.

src/axolotl/core/trainer_builder.py

src/axolotl/prompt_strategies/stepwise_supervised.py

src/axolotl/utils/config/models/input/v0_4_1/__init__.py

tests/e2e/test_process_reward_model_llama.py

winglian

Looks great to me. Thanks!

SalmanMohammadi added 2 commits January 7, 2025 12:34

adding model_cfg to set num_labels

2ca689c

using a num_labels field instead

9f5a8e0

SalmanMohammadi changed the title ~~Setting num_labels for reward models~~ Process reward models Jan 7, 2025

SalmanMohammadi added 4 commits January 7, 2025 13:57

Merge branch 'main' into fix_reward_model

9229435

linting

3baaa76

WIP stepwise prompt tokenizer

f81b174

this should work?

0630baa

SalmanMohammadi commented Jan 8, 2025

View reviewed changes

src/axolotl/prompt_strategies/stepwise_supervised.py Show resolved Hide resolved

SalmanMohammadi added 8 commits January 8, 2025 21:21

trainer working?

796fd14

pushing to runpod

a6ee075

fixing saving

57050d4

updating conf

0f94239

merging main

1291d22

updating config, adding docs

3107e2a

adding stepwise supervision docpage

034f303

updating tests

0f0662b

SalmanMohammadi requested a review from winglian January 22, 2025 18:47

SalmanMohammadi marked this pull request as ready for review January 23, 2025 11:24

SalmanMohammadi commented Jan 23, 2025

View reviewed changes

src/axolotl/prompt_strategies/stepwise_supervised.py Show resolved Hide resolved

SalmanMohammadi added 3 commits January 23, 2025 20:33

adding test for dataset

a302348

fixing tests

71b0f39

linting

5ee3876

winglian reviewed Jan 24, 2025

View reviewed changes

SalmanMohammadi added 4 commits January 25, 2025 19:33

addressing some comments

bba11c7

adding additional cfg fields support

f55fb7d

updating tests, fixing cfg

88fddfc

fixing tests

f27ca55

SalmanMohammadi requested a review from winglian January 28, 2025 11:08

SalmanMohammadi added 2 commits January 28, 2025 12:09

updating loss

9d5bb17

Update test_process_reward_model_smollm2.py

b2e5ac7

SalmanMohammadi added 2 commits January 28, 2025 14:59

updating loss values and seed

5b53586

dumb pre-commit

b88d37a

winglian approved these changes Jan 28, 2025

View reviewed changes

Merge branch 'main' into fix_reward_model

2ccf31f

SalmanMohammadi added the ready to merge label Jan 28, 2025

winglian merged commit 54dd7ab into main Jan 29, 2025
11 checks passed

winglian deleted the fix_reward_model branch January 29, 2025 05:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process reward models #2241

Process reward models #2241

SalmanMohammadi commented Jan 7, 2025 •

edited

Loading

winglian left a comment

winglian left a comment

Process reward models #2241

Process reward models #2241

Conversation

SalmanMohammadi commented Jan 7, 2025 • edited Loading

Fixing num_labels for reward models

Adding support for process reward model training

winglian left a comment

Choose a reason for hiding this comment

winglian left a comment

Choose a reason for hiding this comment

SalmanMohammadi commented Jan 7, 2025 •

edited

Loading

Fixing `num_labels` for reward models