Question about `actor_criterion` #9

vuthede · 2023-03-28T07:21:36Z

Hi @ethanyanjiali
Thanks for your great repo. I enjoy reading your code.
In the fit function of PPOTrainer class of file trainers.py, when you calculate the loss for actor

 actor_loss = self.actor_criterion(curr_actor_log_probs,
                                                      experience.actor_log_probs,
                                                      experience.advantage,
                                                      experience.action_mask

It seems like the curr_actor_log_probs and the experience.actor_log_probs are the same here?

The curr_actor_log_probs is calculated by this

self.actor.forward_actor(
                        experience.completion, experience.attention_mask, experience.num_actions)

The experience.actor_log_prob is calculated by the same function call inside the self.make_experience function with the same input values ( experience.completion, experience.attention_mask, experience.num_actions).
Could you please double-check that? May be I misunderstood in some step?

The text was updated successfully, but these errors were encountered:

ethanyanjiali · 2023-03-29T05:59:05Z

Good catch! This part is a simplified version of PPO, basically I only take one sample at the time which makes it a strict on-policy method. Then the ratio between new and old is 1 and the clamping from Policy Loss isn't really useful. But I need to save memory when I make experiences so can't just keep the gradients in the first pass.

In the full fledged version of PPO (from my naive understanding), it may make many samples (experiences) first, and then use these experiences to update the actor. Because these samples are now generated before updating the model, we now have to deal with the off-policy issue, and that is when the trusted region clamping (the surrogate objective) becomes useful. So this is more about improving the sample efficiency.

Again, thanks for the suggestion. I will update this part of the code to make multiple experiences and add a sampler in the next few days.

Updates:
Also, if anyone wants to help and update, I'll be more than happy to help review and test. I probably won't have time until Sunday to work on this.

ethanyanjiali · 2023-04-14T19:46:18Z

Looks like DeepSpeed team released a better implementation recently
https://github.com/microsoft/DeepSpeedExamples/blob/d570b2cc8a8fd4207c9424744669437d4c68ec43/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py#L425-L426

vuthede · 2023-04-16T03:52:10Z

Nice! I will look at that too.
btw, Thanks again. I used your code for studying how to train and it helped me understand the flow of training chatgpt!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about `actor_criterion` #9

Question about `actor_criterion` #9

vuthede commented Mar 28, 2023

ethanyanjiali commented Mar 29, 2023 •

edited

Loading

ethanyanjiali commented Apr 14, 2023

vuthede commented Apr 16, 2023

Question about actor_criterion #9

Question about actor_criterion #9

Comments

vuthede commented Mar 28, 2023

ethanyanjiali commented Mar 29, 2023 • edited Loading

ethanyanjiali commented Apr 14, 2023

vuthede commented Apr 16, 2023

Question about `actor_criterion` #9

Question about `actor_criterion` #9

ethanyanjiali commented Mar 29, 2023 •

edited

Loading