Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question About Gradient Synchronization During the Accelerate Process #3285

Closed
Klein-Lan opened this issue Dec 10, 2024 · 5 comments
Closed

Comments

@Klein-Lan
Copy link

Hello, I have some questions regarding gradient synchronization that I hope you can help clarify.

In distributed training, we use model = accelerate.prepare(model) to wrap the model.

According to the documentation, we should use the wrapped model from accelerator for forward propagation. However, due to some project constraints, I might not directly use loss = model(inputs) during the forward pass, but instead use loss = model.module(inputs).

I would like to know if this will affect gradient synchronization when using accelerator.backward(loss). Or, when updating parameters, is it essentially equivalent to using loss = model(inputs) even if I use loss = model.module(inputs)?

Thank you for your help.

@Klein-Lan Klein-Lan changed the title Question about gradient synchronization during the accelerate process Question About Gradient Synchronization During the Accelerate Process Dec 10, 2024
@muellerzr
Copy link
Collaborator

From what I've read and seen, it should be equivalent/fine. However for gradient synchronization, we're explicitly avoiding the wrapped DistributedDataParallel wrapper, which should result in the same slowdown/synchronization I believe

@muellerzr
Copy link
Collaborator

Asking around internally to get a solid answer on this, because its never been one I've looked into for an exact answer before

@muellerzr
Copy link
Collaborator

The answer is it's quite a bit complex in the end. If we are not doing gradient accumulation, you need to something like so:

model.train()
for x, y in train_loader:
    optimizer.zero_grad()
    outputs = model.module.model(x)
    model.reducer.prepare_for_backward([])
    model._clear_grad_buffer()

    loss = criterion(outputs, y.unsqueeze(1))
    loss.backward()
    optimizer.step()
    loss = gather(loss.detach()).mean()
    state.print(loss)

If we are, there's added checks we need to do and it requires really digging into the hidden calls inside of DistributedDataParallel.

So the answer is no, it is not at all because you're explicitly avoiding what DDP does under the hood. Is this in conjunction with gradient accumulation or no?

@Klein-Lan
Copy link
Author

Klein-Lan commented Dec 11, 2024

The answer is it's quite a bit complex in the end. If we are not doing gradient accumulation, you need to something like so:

model.train()
for x, y in train_loader:
    optimizer.zero_grad()
    outputs = model.module.model(x)
    model.reducer.prepare_for_backward([])
    model._clear_grad_buffer()

    loss = criterion(outputs, y.unsqueeze(1))
    loss.backward()
    optimizer.step()
    loss = gather(loss.detach()).mean()
    state.print(loss)

If we are, there's added checks we need to do and it requires really digging into the hidden calls inside of DistributedDataParallel.

So the answer is no, it is not at all because you're explicitly avoiding what DDP does under the hood. Is this in conjunction with gradient accumulation or no?

Thank you for your patient response!

I can simplify my current situation: the model I pass to accelerator.prepare is actually composed of two smaller models (input -> model1 -> model2 -> output) concatenated together. So the model I pass to accelerator.prepare is a wrapper around these two smaller models.

However, during the training phase, I am trying to compute the loss using only the forward pass of one of the models. This forces me to use model.module.model1 for the forward inference. As a result, I am concerned about potential issues with gradient synchronization, especially since I am not currently using gradient accumulation (though I might in the future).

So, would a better solution be as follows:

model1 = accelerator.prepare(model1)
model2 = accelerator.prepare(model2)

intermediate_results = model1(input)
output = model2(intermediate_results)
loss1 = loss_fn(output, ground_truth)

another_output = model1(input)
loss2 = loss_fn(another_output, another_result)

loss = loss1 + loss2

Thank you again for your help.

Copy link

github-actions bot commented Jan 9, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants