Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create _preprare_fsdp to pre- prepare fsdp model training #3213

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

eljandoubi
Copy link
Contributor

Fixes #3209

Who can review?

@eljandoubi eljandoubi changed the title creeate _preprare_fsdp to pre- prepare fsdp model training create _preprare_fsdp to pre- prepare fsdp model training Nov 7, 2024
@eljandoubi
Copy link
Contributor Author

@muellerzr have you any feedback?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this important issue.

I wonder on the overall approach to solving this issue. Right now, IIUC, the idea is to re-initialize the optimizer(s) using the FSDP-wrapped models. This could potentially be error prone and ideally should be well tested before we merge.

Would it be possible instead to check if this has happened and raise an error, then direct users towards not initializing the optimizer prematurely, and handle the initialization with accelerate so that it's in the right order? Maybe that can't work, but I think it's worth considering different solutions.

Also, in case this is really the same issue as this one reported in PEFT, it means that it used to work correctly in previous versions of accelerate/transformers. In that case, I wonder what changed that resulted in the issue.

Comment on lines +2094 to +2105
# Validate the presence of models and optimizers
if not models and not optimizers:
return args

# Flattening weights implies that the optimizers have already been processed.
if next(next(iter(models.values())).named_parameters())[0].endswith("_flat_param"):
return args

if len(models) != len(optimizers):
raise ValueError(
f"The number of models ({len(models)}) must be equal to the number of optimizers ({len(optimizers)})."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move these checks to the very start of the method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan The method! Do you mean .prepare? What are the benefits of doing so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant _prepare_fsdp. It is a common pattern to perform all checks as early as possible, so following this makes the code easier to understand for readers. This is especially so if there are early returns.

The reason why it's common is so that we don't do any unnecessary work if the checks fail anyway. In this case, there is no need to determine models or optimizers if we're going to raise an error later. By skipping the unnecessary work, we ensure faster execution and prevent possibly unwanted side-effects (this might not be relevant here right now but code will change in the future and then it could be true).

# Clear parameter lists.
for opt in optimizers.values():
for group in opt.param_groups:
group["params"] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to call group["params"].clear()? That would affect references.

@eljandoubi
Copy link
Contributor Author

@BenjaminBossan

  • You are right; the solution needs more thorough testing. I have conducted a small test in a multi-GPU environment, but it should be tested against other configurations and models.
  • Raising an error when the optimizer is initialized prematurely could be a solution, but two cases emerge:
    1. If Accelerate initializes the optimizer internally, either it creates a default optimizer with limited flexibility, or the user provides optimizer specifications. This would require external special treatment for FSDP, causing Accelerate to lose its unified interface for distributed training.
    2. If the user initializes the optimizer post-FSDP setup, Accelerate also loses generalization in this case.
  • For PEFT training, which involves mixing frozen and non-frozen parameters, use_orig_params=True must be used. This ensures non-flattened parameters, meaning no changes are needed in the optimizer (I assume).
  • The Transformers Trainer internally uses delay_optimizer_creation and creates the optimizer after FSDP wrapping.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates and explaining your testing and reasoning further.

Raising an error when the optimizer is initialized prematurely could be a solution, but two cases emerge

These are valid concerns. At the end of the day, it's a trade off and we need to decide which cost we'd rather pay.

The Transformers Trainer internally uses delay_optimizer_creation and creates the optimizer after FSDP wrapping.

I wonder if this logic could be moved/copied to accelerate.

For PEFT training, which involves mixing frozen and non-frozen parameters, use_orig_params=True must be used

I don't think this is strictly necessary, we have examples in PEFT with use_orig_params=False. But I have to admit I don't know what exactly changes under the hood in FSDP when this parameter is set. Note also that in the linked PEFT issue, I tried setting use_orig_params=True and it didn't help.

Comment on lines +2094 to +2105
# Validate the presence of models and optimizers
if not models and not optimizers:
return args

# Flattening weights implies that the optimizers have already been processed.
if next(next(iter(models.values())).named_parameters())[0].endswith("_flat_param"):
return args

if len(models) != len(optimizers):
raise ValueError(
f"The number of models ({len(models)}) must be equal to the number of optimizers ({len(optimizers)})."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant _prepare_fsdp. It is a common pattern to perform all checks as early as possible, so following this makes the code easier to understand for readers. This is especially so if there are early returns.

The reason why it's common is so that we don't do any unnecessary work if the checks fail anyway. In this case, there is no need to determine models or optimizers if we're going to raise an error later. By skipping the unnecessary work, we ensure faster execution and prevent possibly unwanted side-effects (this might not be relevant here right now but code will change in the future and then it could be true).

@eljandoubi
Copy link
Contributor Author

eljandoubi commented Nov 26, 2024

  • The PyTorch FSDP documentation mentions that mixed frozen and non-frozen parameters are only supported when use_orig_params=True. I believe this is due to the flattening process — we cannot flatten parameters with gradients together with those without. My contribution does not handle the case where use_orig_params=True because, in that scenario, the parameters remain unchanged.
  • Another solution would be to adapt an approach similar to the DeepSpeed integration. This involves using a single model and optimizer, and after wrapping the model with model = FSDP(model), we create the optimizer as optimizer = optimizer_class(model.parameters(), **defaults). See Select the DeepSpeedCPUOptimizer based on the original optimizer class. #3255

@BenjaminBossan
Copy link
Member

BenjaminBossan commented Nov 28, 2024

The PyTorch FSDP documentation mentions that mixed frozen and non-frozen parameters are only supported when use_orig_params=True. I believe this is due to the flattening process — we cannot flatten parameters with gradients together with those without.

This is true but using the right FSDP auto wrap policy, trainable and frozen parameters should be prevented from being mixed.

Another solution would be to adapt an approach similar to the DeepSpeed integration.

If this is an option here, it sounds like the more robust solution to me. Let's wait for @muellerzr return to office and get his opinion on this.

@muellerzr
Copy link
Collaborator

This feels a bit overengineered and the commit to transformers broke a lot of users code with saving checkpoints. For now let's not move forward with this IMO, since otherwise this would imply that the prior change broke all FSDP code, which isn't the case.

@BenjaminBossan
Copy link
Member

@eljandoubi Could you please check if huggingface/transformers#35212 solved the issue (installing transformers 4.47.1 should contain the fix)?

@eljandoubi
Copy link
Contributor Author

@BenjaminBossan Since the trainer delays the creation of the optimizer, this resolves the issue for transformers. On the other hand, the core concept of accelerate is to pass the model, optimizer, dataloader, and other components simultaneously.

@BenjaminBossan
Copy link
Member

@eljandoubi Right. As another user noted, even with Trainer, it can still be problematic if users pass their own optimizer instance. I'm not sure if we can find a simpler solution, I think it's worth evaluating the possibilities. @muellerzr do you have a better idea for this situation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The optimizer is not receiving the FSDP model parameters.
4 participants