Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2 residency control helper functions #197

Closed

Conversation

sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Aug 2, 2022

This PR adds helper functions that allow for pinning a region in global memory into L2$.

Cherry-picked from #101 so it can be reviewed separately.

Also related: #101 (comment)

Comment on lines 37 to 39
template <typename Iterator>
void register_l2_persistence(
cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we use an iterator pair and assume they represent a continuous memory segment. This is going to look a lot cleaner once we can use span.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a description of why this is necessary vs. annotated_ptr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most elegant solution to this problem would be to wrap the storage of the filter as an annotated_ptr. However, this first requires a solution to the annotated_ptr + cuda::atomic/cuda::atomic_ref problem, which is not available yet.

The next best approach would be to set the access property manually using cuda::apply_access_property, which I implemented in the previous version of the benchmark script:

template <typename T>
__global__ void pin_memory(T* ptr, nvbench::int64_t size)
{
#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)
auto g = cooperative_groups::this_grid();
for (int idx = g.thread_rank(); idx < size; idx += g.size())
cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::persisting{});
#endif
}
template <typename T>
__global__ void unpin_memory(T* ptr, nvbench::int64_t size)
{
#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)
auto g = cooperative_groups::this_grid();
for (int idx = g.thread_rank(); idx < size; idx += g.size())
cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::normal{});
#endif
}

This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that cuda::apply_access_property does not allow hybrid access policies, e.g. 60% cuda::access_property_persisting + 40% cuda::access_property_normal, which showed the best performance in the benchmarks.

tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once annotated_ptr works as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.

Yes, I can do that for the sake of consistency. I initially put it in the front since it's an in/out parameter, i.e., the stream attributes are mutated.

However, this reminds me that I find the placement of the stream parameter in our API unfortunate. I would put the stream in front of the hashers and equality operators, since you are likely to set this one explicitly more often than the others.
It also has a shorter default value compared to the others.

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I can't put the stream at the end since there are additional defaulted parameters.

Copy link

@sicario001 sicario001 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most elegant solution to this problem would be to wrap the storage of the filter as an annotated_ptr. However, this first requires a solution to the annotated_ptr + cuda::atomic/cuda::atomic_ref problem, which is not available yet.

The next best approach would be to set the access property manually using cuda::apply_access_property, which I implemented in the previous version of the benchmark script:

template <typename T>
__global__ void pin_memory(T* ptr, nvbench::int64_t size)
{
#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)
auto g = cooperative_groups::this_grid();
for (int idx = g.thread_rank(); idx < size; idx += g.size())
cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::persisting{});
#endif
}
template <typename T>
__global__ void unpin_memory(T* ptr, nvbench::int64_t size)
{
#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)
auto g = cooperative_groups::this_grid();
for (int idx = g.thread_rank(); idx < size; idx += g.size())
cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::normal{});
#endif
}

This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that cuda::apply_access_property does not allow hybrid access policies, e.g. 60% cuda::access_property_persisting + 40% cuda::access_property_normal, which showed the best performance in the benchmarks.

tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once annotated_ptr works as expected.

Hey, apologies for bringing up this old PR. I am also testing the cuda::apply_access_property API (on https://leimao.github.io/blog/CUDA-L2-Persistent-Cache/) and in general the prefetch.global.L2::evict_last PTX instruction that it internally uses, but I am not able to see any effects of this. However, when I use the accessPolicyWindow method from the CUDA API to configure L2 persistence for the same scenario, I do observe performance improvements. Did you figure out the reason or any possible resolution for this issue? Any help would be greatly appreciated.

// Overwrite the access policy attribute of CUDA Stream
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);
// Remove any persistent lines in L2$
cudaCtxResetPersistingL2Cache();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resets the cache globally. Maybe undesirable.

@sleeepyjack
Copy link
Collaborator Author

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the file to include/cuco since it's exposed to the public?

include/cuco/detail/cache_residency_control.cuh Outdated Show resolved Hide resolved
Comment on lines 37 to 39
template <typename Iterator>
void register_l2_persistence(
cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.

@PointKernel PointKernel added type: feature request New feature request Needs Review Awaiting reviews before merging labels Aug 2, 2022
@sleeepyjack
Copy link
Collaborator Author

Should we move the file to include/cuco since it's exposed to the public?

I also had this in mind (that's why I added the doxygen docs). I would say yes. However, I don't know how broad the range of applications for this functionality really is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review Awaiting reviews before merging type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants