Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix calculate_shard_storages to handle optimizer correctly #2652

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tiankongdeguiji
Copy link
Contributor

@tiankongdeguiji tiankongdeguiji commented Dec 23, 2024

The parameter no longer possesses the optimizer_class attribute; instead, it has been updated to optimizer_classes. However, the current implementation of the EmbeddingStorageEstimator still relies on the outdated optimizer_class attribute to determine whether the parameter includes an optimizer. As a result, the EmbeddingStorageEstimator fails to estimate the storage requirements for the optimizer, leading to CUDA out-of-memory errors during training.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 23, 2024
@tiankongdeguiji
Copy link
Contributor Author

Hi, @henrylhtsang @joshuadeng @PaulZhang12 @TroyGarden can you take a look?

@tiankongdeguiji
Copy link
Contributor Author

hi, @sarckk @dstaay-fb can you take a look?

@sarckk sarckk self-assigned this Dec 31, 2024
@facebook-github-bot
Copy link
Contributor

@sarckk has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@sarckk
Copy link
Member

sarckk commented Dec 31, 2024

@tiankongdeguiji thanks for the fix! looks good to me. I'll help you land the changes (EDIT: probably after 1st Jan)

@tiankongdeguiji
Copy link
Contributor Author

@tiankongdeguiji thanks for the fix! looks good to me. I'll help you land the changes

thx!

@tiankongdeguiji
Copy link
Contributor Author

hi, @sarckk can you help me land the changes? Are there any existing issues I should address?

@sarckk
Copy link
Member

sarckk commented Jan 7, 2025

hi, @sarckk can you help me land the changes? Are there any existing issues I should address?

Hi sorry for the delay. This increases HBM and DDR estimates across the board so I'm running extra checks to avoid any regressions internally. I will land it by this week

@tiankongdeguiji
Copy link
Contributor Author

hi, @sarckk can you help me land the changes? Are there any existing issues I should address?

Hi sorry for the delay. This increases HBM and DDR estimates across the board so I'm running extra checks to avoid any regressions internally. I will land it by this week

thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants