Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reward trainer section documentation section adding-a-margin-to-the-loss should mention about Llama3.1 #2493

Open
Kallinteris-Andreas opened this issue Dec 17, 2024 · 0 comments

Comments

@Kallinteris-Andreas
Copy link

https://huggingface.co/docs/trl/en/reward_trainer#adding-a-margin-to-the-loss
suggest a technique used in llama2, but that technique was not used again in llama3, so it no longer a state-of-the-art technique (could still be kept as it may be useful for something)

From the Llama3.1 paper:
4.1.2
Reward Modeling
We train a reward model (RM) covering different capabilities on top of the pre-trained checkpoint. The
training objective is the same as Llama 2 except that we remove the margin term in the loss, as we observe
diminishing improvements after data scaling.
Following Llama 2, we use all of our preference data for reward
modeling after filtering out samples with similar responses. In addition to standard preference pair of (chosen,
rejected) response, annotations also create a third “edited response” for some prompts, where the chosen
response from the pair is further edited for improvement (see Section 4.2.1). Hence, each preference ranking
sample has two or three responses with clear ranking (edited > chosen > rejected). We concatenate the
prompt and multiple responses into a single row during training with responses randomly shuffled. This is an
approximation to the standard scenario of putting the responses in separate rows and computing the scores,
but in our ablations, this approach improves training efficiency without a loss in accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant