You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From the Llama3.1 paper:
4.1.2
Reward Modeling
We train a reward model (RM) covering different capabilities on top of the pre-trained checkpoint. The
training objective is the same as Llama 2 except that we remove the margin term in the loss, as we observe
diminishing improvements after data scaling. Following Llama 2, we use all of our preference data for reward
modeling after filtering out samples with similar responses. In addition to standard preference pair of (chosen,
rejected) response, annotations also create a third “edited response” for some prompts, where the chosen
response from the pair is further edited for improvement (see Section 4.2.1). Hence, each preference ranking
sample has two or three responses with clear ranking (edited > chosen > rejected). We concatenate the
prompt and multiple responses into a single row during training with responses randomly shuffled. This is an
approximation to the standard scenario of putting the responses in separate rows and computing the scores,
but in our ablations, this approach improves training efficiency without a loss in accuracy.
The text was updated successfully, but these errors were encountered:
https://huggingface.co/docs/trl/en/reward_trainer#adding-a-margin-to-the-loss
suggest a technique used in llama2, but that technique was not used again in llama3, so it no longer a state-of-the-art technique (could still be kept as it may be useful for something)
From the Llama3.1 paper:
4.1.2
Reward Modeling
We train a reward model (RM) covering different capabilities on top of the pre-trained checkpoint. The
training objective is the same as Llama 2 except that we remove the margin term in the loss, as we observe
diminishing improvements after data scaling. Following Llama 2, we use all of our preference data for reward
modeling after filtering out samples with similar responses. In addition to standard preference pair of (chosen,
rejected) response, annotations also create a third “edited response” for some prompts, where the chosen
response from the pair is further edited for improvement (see Section 4.2.1). Hence, each preference ranking
sample has two or three responses with clear ranking (edited > chosen > rejected). We concatenate the
prompt and multiple responses into a single row during training with responses randomly shuffled. This is an
approximation to the standard scenario of putting the responses in separate rows and computing the scores,
but in our ablations, this approach improves training efficiency without a loss in accuracy.
The text was updated successfully, but these errors were encountered: