-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2 residency control helper functions #197
L2 residency control helper functions #197
Conversation
template <typename Iterator> | ||
void register_l2_persistence( | ||
cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we use an iterator pair and assume they represent a continuous memory segment. This is going to look a lot cleaner once we can use span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include a description of why this is necessary vs. annotated_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you place stream
as the last parameter? Not necessary but try to be consistent with other cuco APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most elegant solution to this problem would be to wrap the storage of the filter as an annotated_ptr
. However, this first requires a solution to the annotated_ptr
+ cuda::atomic
/cuda::atomic_ref
problem, which is not available yet.
The next best approach would be to set the access property manually using cuda::apply_access_property
, which I implemented in the previous version of the benchmark script:
cuCollections/benchmarks/bloom_filter/bloom_filter_bench.cu
Lines 199 to 217 in a1ea293
template <typename T> | |
__global__ void pin_memory(T* ptr, nvbench::int64_t size) | |
{ | |
#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR) | |
auto g = cooperative_groups::this_grid(); | |
for (int idx = g.thread_rank(); idx < size; idx += g.size()) | |
cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::persisting{}); | |
#endif | |
} | |
template <typename T> | |
__global__ void unpin_memory(T* ptr, nvbench::int64_t size) | |
{ | |
#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR) | |
auto g = cooperative_groups::this_grid(); | |
for (int idx = g.thread_rank(); idx < size; idx += g.size()) | |
cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::normal{}); | |
#endif | |
} |
This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that cuda::apply_access_property
does not allow hybrid access policies, e.g. 60% cuda::access_property_persisting
+ 40% cuda::access_property_normal
, which showed the best performance in the benchmarks.
tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once annotated_ptr
works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you place
stream
as the last parameter? Not necessary but try to be consistent with other cuco APIs.
Yes, I can do that for the sake of consistency. I initially put it in the front since it's an in/out parameter, i.e., the stream attributes are mutated.
However, this reminds me that I find the placement of the stream parameter in our API unfortunate. I would put the stream in front of the hashers and equality operators, since you are likely to set this one explicitly more often than the others.
It also has a shorter default value compared to the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I can't put the stream at the end since there are additional defaulted parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most elegant solution to this problem would be to wrap the storage of the filter as an
annotated_ptr
. However, this first requires a solution to theannotated_ptr
+cuda::atomic
/cuda::atomic_ref
problem, which is not available yet.The next best approach would be to set the access property manually using
cuda::apply_access_property
, which I implemented in the previous version of the benchmark script:cuCollections/benchmarks/bloom_filter/bloom_filter_bench.cu
Lines 199 to 217 in a1ea293
template <typename T> __global__ void pin_memory(T* ptr, nvbench::int64_t size) { #if defined(CUCO_HAS_CUDA_ANNOTATED_PTR) auto g = cooperative_groups::this_grid(); for (int idx = g.thread_rank(); idx < size; idx += g.size()) cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::persisting{}); #endif } template <typename T> __global__ void unpin_memory(T* ptr, nvbench::int64_t size) { #if defined(CUCO_HAS_CUDA_ANNOTATED_PTR) auto g = cooperative_groups::this_grid(); for (int idx = g.thread_rank(); idx < size; idx += g.size()) cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::normal{}); #endif } This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that
cuda::apply_access_property
does not allow hybrid access policies, e.g. 60%cuda::access_property_persisting
+ 40%cuda::access_property_normal
, which showed the best performance in the benchmarks.tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once
annotated_ptr
works as expected.
Hey, apologies for bringing up this old PR. I am also testing the cuda::apply_access_property API (on https://leimao.github.io/blog/CUDA-L2-Persistent-Cache/) and in general the prefetch.global.L2::evict_last PTX instruction that it internally uses, but I am not able to see any effects of this. However, when I use the accessPolicyWindow method from the CUDA API to configure L2 persistence for the same scenario, I do observe performance improvements. Did you figure out the reason or any possible resolution for this issue? Any help would be greatly appreciated.
// Overwrite the access policy attribute of CUDA Stream | ||
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute); | ||
// Remove any persistent lines in L2$ | ||
cudaCtxResetPersistingL2Cache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resets the cache globally. Maybe undesirable.
A usage example can be found here (#101): https://github.com/NVIDIA/cuCollections/blob/6b61279f8870142e81875f8c7a1fbdea04d619b4/examples/bloom_filter/l2_residency_example.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move the file to include/cuco
since it's exposed to the public?
template <typename Iterator> | ||
void register_l2_persistence( | ||
cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you place stream
as the last parameter? Not necessary but try to be consistent with other cuco APIs.
I also had this in mind (that's why I added the doxygen docs). I would say yes. However, I don't know how broad the range of applications for this functionality really is. |
This PR adds helper functions that allow for pinning a region in global memory into L2$.
Cherry-picked from #101 so it can be reviewed separately.
Also related: #101 (comment)