Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about actor_criterion #9

Open
vuthede opened this issue Mar 28, 2023 · 3 comments
Open

Question about actor_criterion #9

vuthede opened this issue Mar 28, 2023 · 3 comments

Comments

@vuthede
Copy link

vuthede commented Mar 28, 2023

Hi @ethanyanjiali
Thanks for your great repo. I enjoy reading your code.
In the fit function of PPOTrainer class of file trainers.py, when you calculate the loss for actor

 actor_loss = self.actor_criterion(curr_actor_log_probs,
                                                      experience.actor_log_probs,
                                                      experience.advantage,
                                                      experience.action_mask

It seems like the curr_actor_log_probs and the experience.actor_log_probs are the same here?

  • The curr_actor_log_probs is calculated by this
self.actor.forward_actor(
                        experience.completion, experience.attention_mask, experience.num_actions)
  • The experience.actor_log_prob is calculated by the same function call inside the self.make_experience function with the same input values ( experience.completion, experience.attention_mask, experience.num_actions).
    Could you please double-check that? May be I misunderstood in some step?
@ethanyanjiali
Copy link
Owner

ethanyanjiali commented Mar 29, 2023

Good catch! This part is a simplified version of PPO, basically I only take one sample at the time which makes it a strict on-policy method. Then the ratio between new and old is 1 and the clamping from Policy Loss isn't really useful. But I need to save memory when I make experiences so can't just keep the gradients in the first pass.

In the full fledged version of PPO (from my naive understanding), it may make many samples (experiences) first, and then use these experiences to update the actor. Because these samples are now generated before updating the model, we now have to deal with the off-policy issue, and that is when the trusted region clamping (the surrogate objective) becomes useful. So this is more about improving the sample efficiency.

Again, thanks for the suggestion. I will update this part of the code to make multiple experiences and add a sampler in the next few days.


Updates:
Also, if anyone wants to help and update, I'll be more than happy to help review and test. I probably won't have time until Sunday to work on this.

@ethanyanjiali
Copy link
Owner

Looks like DeepSpeed team released a better implementation recently
https://github.com/microsoft/DeepSpeedExamples/blob/d570b2cc8a8fd4207c9424744669437d4c68ec43/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py#L425-L426

@vuthede
Copy link
Author

vuthede commented Apr 16, 2023

Nice! I will look at that too.
btw, Thanks again. I used your code for studying how to train and it helped me understand the flow of training chatgpt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants