Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding grouper distinguish between prefetched vs non-prefetched table #1859

Closed
wants to merge 1 commit into from

Conversation

levythu
Copy link
Contributor

@levythu levythu commented Apr 9, 2024

Summary:
If prefetch_pipeline is as fused parameter, training pipeline will try to call prefetch() in a separate stream one batch ahead of time. This process, is unfortunately consuming lots of extra memory. Practically it consumes 8~9x the size of input tensor at peak.

Therefore, we wish to minimize the input tensor size to prefetch() call as much as possible. To achieve that, we don't want to mix tables that require prefetch and doesn't to be grouped to the same TBE.

This diff will not change behavior for any jobs without cached embedding offloading.

For embedding-offloaded jobs, this diff will slightly decrease the performance of TBE lookup as it result in more TBEs (and subsequently more kernels in forward and backward). but greatly increase the memory efficiency.

Differential Revision: D55901328

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55901328

…ble (pytorch#1859)

Summary:

If `prefetch_pipeline` is as fused parameter, training pipeline will try to call `prefetch()` in a separate stream one batch ahead of time. This process, is unfortunately consuming lots of extra memory. Practically it consumes 8~9x the size of input tensor at peak.

Therefore, we wish to minimize the input tensor size to `prefetch()` call as much as possible. To achieve that, we don't want to mix tables that require prefetch and doesn't to be grouped to the same TBE.

This diff will not change behavior for any jobs without cached embedding offloading.

For embedding-offloaded jobs, this diff will slightly decrease the performance of TBE lookup as it result in more TBEs (and subsequently more kernels in forward and backward). but greatly increase the memory efficiency.

Differential Revision: D55901328
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55901328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants