fix doc

huggingface · Jan 17, 2025 · 67adbfe · 67adbfe
1 parent b2f017f
commit 67adbfe
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/docs/source/rloo_trainer.md b/docs/source/rloo_trainer.md
@@ -276,7 +276,7 @@ The [Reinforce++](https://hijkzzz.notion.site/reinforce-plus-plus) report by Jia
 - Clipping rewards: limiting reward values within a specific range to mitigate the impact of extreme rewards on model updates, thus preventing gradient explosion
 - Normalizing rewards: scaling rewards to have a mean of 0 and a standard deviation of 1, which helps in stabilizing the training process
 - Normalizing advantages: scaling advantages to have a mean of 0 and a standard deviation of 1, which helps in stabilizing the training process
-- Using token-level KL penalty that  vs. sequence-level KL penalty (default)
+- Using token-level KL penalty that is defined as equation (1) of the report vs. sequence-level KL penalty (default)
 
 These options are available via the appropriate arguments in the [`RLOOConfig`] class.