Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mean pooling divisor to cuda stream #1863

Closed
wants to merge 1 commit into from

Conversation

zainhuda
Copy link

Summary:
Initial mean pooling implementation was not attending to the appropriate CUDA stream properly with respect to train pipeline. We now register the divisor tensor into the CUDA stream in context.

The key insight:
Tensors used on a different stream than their origin, the memory allocator may reuse the memory unexpectedly.

We also split the callback function into two (create divisor, apply mean pooling). Change the context from holding a callable to divisor tensor instead. This is because recording non tensors into a CUDA stream is non trivial, whereas recording a tensor into a CUDA stream is easily supported. This has no perf regressions from the original implementation nor lack of clarity.

Differential Revision: D55945969

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 10, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55945969

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55945969

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Apr 10, 2024
Summary:
Pull Request resolved: pytorch#1863

Initial mean pooling implementation was not attending to the appropriate CUDA stream properly with respect to train pipeline. We now register the divisor tensor into the CUDA stream in context.

The key insight:
Tensors used on a different stream than their origin, the memory allocator may reuse the memory unexpectedly.

We also split the callback function into two (create divisor, apply mean pooling). Change the context from holding a callable to divisor tensor instead. This is because recording non tensors into a CUDA stream is non trivial, whereas recording a tensor into a CUDA stream is easily supported. This has no perf regressions from the original implementation nor lack of clarity.

Differential Revision: D55945969
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55945969

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Apr 10, 2024
Summary:
Pull Request resolved: pytorch#1863

Initial mean pooling implementation was not attending to the appropriate CUDA stream properly with respect to train pipeline. We now register the divisor tensor into the CUDA stream in context.

The key insight:
Tensors used on a different stream than their origin, the memory allocator may reuse the memory unexpectedly.

We also split the callback function into two (create divisor, apply mean pooling). Change the context from holding a callable to divisor tensor instead. This is because recording non tensors into a CUDA stream is non trivial, whereas recording a tensor into a CUDA stream is easily supported. This has no perf regressions from the original implementation nor lack of clarity.

Differential Revision: D55945969
zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Apr 10, 2024
Summary:

Initial mean pooling implementation was not attending to the appropriate CUDA stream properly with respect to train pipeline. We now register the divisor tensor into the CUDA stream in context.

The key insight:
Tensors used on a different stream than their origin, the memory allocator may reuse the memory unexpectedly.

We also split the callback function into two (create divisor, apply mean pooling). Change the context from holding a callable to divisor tensor instead. This is because recording non tensors into a CUDA stream is non trivial, whereas recording a tensor into a CUDA stream is easily supported. This has no perf regressions from the original implementation nor lack of clarity.

Reviewed By: joshuadeng

Differential Revision: D55945969
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55945969

Summary:

Initial mean pooling implementation was not attending to the appropriate CUDA stream properly with respect to train pipeline. We now register the divisor tensor into the CUDA stream in context.

The key insight:
Tensors used on a different stream than their origin, the memory allocator may reuse the memory unexpectedly.

We also split the callback function into two (create divisor, apply mean pooling). Change the context from holding a callable to divisor tensor instead. This is because recording non tensors into a CUDA stream is non trivial, whereas recording a tensor into a CUDA stream is easily supported. This has no perf regressions from the original implementation nor lack of clarity.

Reviewed By: joshuadeng

Differential Revision: D55945969
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55945969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants