Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring FSDP. #1586

Merged
merged 5 commits into from
Dec 26, 2023
Merged

Conversation

AdamLouly
Copy link
Contributor

We are refactoring the FSDP implementation.

Fortunately, there aren't many changes required, as most of the modifications are in the Transformers Trainer. We don't need to duplicate those changes in the Optimum Trainer.

@JingyaHuang JingyaHuang added training gpu-test trigger GPU tests labels Dec 12, 2023
@JingyaHuang
Copy link
Contributor

Hi @AdamLouly, thanks for the PR 🙏 .

The changes look good to me. The CI for ORT Training failed among which the nightly_test_trainer.py passed which means that the the ORTTrainer API is working, whereas nightly_test_examples.py , probably some training examples need to be updated as well.

Would you mind updating these training examples as well, otherwise we can get this PR in and target a new PR for updating those training examples.

cc. @prathikr

@JingyaHuang
Copy link
Contributor

I think we can get this fix in and program another PR for updating those ORT training examples. Hi @AdamLouly, could you rebase the branch, as other maintainers have fixed some CIs which failed.

@JingyaHuang
Copy link
Contributor

With the commits I submitted, examples are running well in the docker container I built locally, but the CI failed with following error:

Error: 2-26 13:55:32,610] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 280) of binary: /home/onnxruntimedev/miniconda3/bin/python

I will investigate further, but it's irrelevant to the fix so I will merge it anyway.

Thanks again for the contribution @AdamLouly !

Copy link
Contributor

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JingyaHuang JingyaHuang merged commit 5017d06 into huggingface:main Dec 26, 2023
37 of 46 checks passed
echarlaix pushed a commit that referenced this pull request Jan 19, 2024
* refactor fsdp

* add trainer

* remove hidden layers

* update dockerfile

---------

Co-authored-by: Adam Louly <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: JingyaHuang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu-test trigger GPU tests training
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants