-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question About Gradient Synchronization During the Accelerate Process #3285
Comments
From what I've read and seen, it should be equivalent/fine. However for gradient synchronization, we're explicitly avoiding the wrapped |
Asking around internally to get a solid answer on this, because its never been one I've looked into for an exact answer before |
The answer is it's quite a bit complex in the end. If we are not doing gradient accumulation, you need to something like so: model.train()
for x, y in train_loader:
optimizer.zero_grad()
outputs = model.module.model(x)
model.reducer.prepare_for_backward([])
model._clear_grad_buffer()
loss = criterion(outputs, y.unsqueeze(1))
loss.backward()
optimizer.step()
loss = gather(loss.detach()).mean()
state.print(loss) If we are, there's added checks we need to do and it requires really digging into the hidden calls inside of So the answer is no, it is not at all because you're explicitly avoiding what DDP does under the hood. Is this in conjunction with gradient accumulation or no? |
Thank you for your patient response! I can simplify my current situation: the model I pass to However, during the training phase, I am trying to compute the loss using only the forward pass of one of the models. This forces me to use So, would a better solution be as follows:
Thank you again for your help. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hello, I have some questions regarding gradient synchronization that I hope you can help clarify.
In distributed training, we use model = accelerate.prepare(model) to wrap the model.
According to the documentation, we should use the wrapped model from accelerator for forward propagation. However, due to some project constraints, I might not directly use loss = model(inputs) during the forward pass, but instead use loss = model.module(inputs).
I would like to know if this will affect gradient synchronization when using accelerator.backward(loss). Or, when updating parameters, is it essentially equivalent to using loss = model(inputs) even if I use loss = model.module(inputs)?
Thank you for your help.
The text was updated successfully, but these errors were encountered: