-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about actor_criterion
#9
Comments
Good catch! This part is a simplified version of PPO, basically I only take one sample at the time which makes it a strict on-policy method. Then the ratio between new and old is 1 and the clamping from Policy Loss isn't really useful. But I need to save memory when I make experiences so can't just keep the gradients in the first pass. In the full fledged version of PPO (from my naive understanding), it may make many samples (experiences) first, and then use these experiences to update the actor. Because these samples are now generated before updating the model, we now have to deal with the off-policy issue, and that is when the trusted region clamping (the surrogate objective) becomes useful. So this is more about improving the sample efficiency. Again, thanks for the suggestion. I will update this part of the code to make multiple experiences and add a sampler in the next few days. Updates: |
Looks like DeepSpeed team released a better implementation recently |
Nice! I will look at that too. |
Hi @ethanyanjiali
Thanks for your great repo. I enjoy reading your code.
In the
fit
function ofPPOTrainer
class of filetrainers.py
, when you calculate the loss for actorIt seems like the
curr_actor_log_probs
and theexperience.actor_log_probs
are the same here?curr_actor_log_probs
is calculated by thisexperience.actor_log_prob
is calculated by the same function call inside theself.make_experience
function with the same input values ( experience.completion, experience.attention_mask, experience.num_actions).Could you please double-check that? May be I misunderstood in some step?
The text was updated successfully, but these errors were encountered: