L2 residency control helper functions #197

sleeepyjack · 2022-08-02T08:55:56Z

This PR adds helper functions that allow for pinning a region in global memory into L2$.

Cherry-picked from #101 so it can be reviewed separately.

sleeepyjack · 2022-08-02T09:03:00Z

include/cuco/detail/cache_residency_control.cuh

+template <typename Iterator>
+void register_l2_persistence(
+  cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f)


For now we use an iterator pair and assume they represent a continuous memory segment. This is going to look a lot cleaner once we can use span.

Can you include a description of why this is necessary vs. annotated_ptr?

Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.

The most elegant solution to this problem would be to wrap the storage of the filter as an annotated_ptr. However, this first requires a solution to the annotated_ptr + cuda::atomic/cuda::atomic_ref problem, which is not available yet.

The next best approach would be to set the access property manually using cuda::apply_access_property, which I implemented in the previous version of the benchmark script:

cuCollections/benchmarks/bloom_filter/bloom_filter_bench.cu

Lines 199 to 217 in a1ea293

template <typename T>

__global__ void pin_memory(T* ptr, nvbench::int64_t size)

{

#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)

auto g = cooperative_groups::this_grid();

for (int idx = g.thread_rank(); idx < size; idx += g.size())

cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::persisting{});

#endif

}

template <typename T>

__global__ void unpin_memory(T* ptr, nvbench::int64_t size)

{

#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)

auto g = cooperative_groups::this_grid();

for (int idx = g.thread_rank(); idx < size; idx += g.size())

cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::normal{});

#endif

}

This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that cuda::apply_access_property does not allow hybrid access policies, e.g. 60% cuda::access_property_persisting + 40% cuda::access_property_normal, which showed the best performance in the benchmarks.

tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once annotated_ptr works as expected.

Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.

Yes, I can do that for the sake of consistency. I initially put it in the front since it's an in/out parameter, i.e., the stream attributes are mutated.

However, this reminds me that I find the placement of the stream parameter in our API unfortunate. I would put the stream in front of the hashers and equality operators, since you are likely to set this one explicitly more often than the others.
It also has a shorter default value compared to the others.

Wait, I can't put the stream at the end since there are additional defaulted parameters.

The most elegant solution to this problem would be to wrap the storage of the filter as an annotated_ptr. However, this first requires a solution to the annotated_ptr + cuda::atomic/cuda::atomic_ref problem, which is not available yet.

The next best approach would be to set the access property manually using cuda::apply_access_property, which I implemented in the previous version of the benchmark script:

cuCollections/benchmarks/bloom_filter/bloom_filter_bench.cu

Lines 199 to 217 in a1ea293

template <typename T>

__global__ void pin_memory(T* ptr, nvbench::int64_t size)

{

#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)

auto g = cooperative_groups::this_grid();

for (int idx = g.thread_rank(); idx < size; idx += g.size())

cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::persisting{});

#endif

}

template <typename T>

__global__ void unpin_memory(T* ptr, nvbench::int64_t size)

{

#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)

auto g = cooperative_groups::this_grid();

for (int idx = g.thread_rank(); idx < size; idx += g.size())

cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::normal{});

#endif

}

This however showed little to no performance impact. I have not been able to quite get to the bottom of why this is happening. Another disadvantage of this method is that cuda::apply_access_property does not allow hybrid access policies, e.g. 60% cuda::access_property_persisting + 40% cuda::access_property_normal, which showed the best performance in the benchmarks.

tl;dr This is a workaround that is flexible enough for our purposes (can also be used with other cuco map types) but should be replaced once annotated_ptr works as expected.

Hey, apologies for bringing up this old PR. I am also testing the cuda::apply_access_property API (on https://leimao.github.io/blog/CUDA-L2-Persistent-Cache/) and in general the prefetch.global.L2::evict_last PTX instruction that it internally uses, but I am not able to see any effects of this. However, when I use the accessPolicyWindow method from the CUDA API to configure L2 persistence for the same scenario, I do observe performance improvements. Did you figure out the reason or any possible resolution for this issue? Any help would be greatly appreciated.

sleeepyjack · 2022-08-02T09:03:29Z

include/cuco/detail/cache_residency_control.cuh

+  // Overwrite the access policy attribute of CUDA Stream
+  cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);
+  // Remove any persistent lines in L2$
+  cudaCtxResetPersistingL2Cache();


This resets the cache globally. Maybe undesirable.

sleeepyjack · 2022-08-02T09:07:18Z

A usage example can be found here (#101): https://github.com/NVIDIA/cuCollections/blob/6b61279f8870142e81875f8c7a1fbdea04d619b4/examples/bloom_filter/l2_residency_example.cu

PointKernel

Should we move the file to include/cuco since it's exposed to the public?

include/cuco/detail/cache_residency_control.cuh

PointKernel · 2022-08-02T13:59:24Z

include/cuco/detail/cache_residency_control.cuh

+template <typename Iterator>
+void register_l2_persistence(
+  cudaStream_t& stream, Iterator begin, Iterator end, float hit_rate = 0.6f, float carve_out = 1.0f)


Could you place stream as the last parameter? Not necessary but try to be consistent with other cuco APIs.

sleeepyjack · 2022-08-02T14:20:45Z

Should we move the file to include/cuco since it's exposed to the public?

I also had this in mind (that's why I added the doxygen docs). I would say yes. However, I don't know how broad the range of applications for this functionality really is.

Add helper functions for L2 residency control.

a4de32f

sleeepyjack commented Aug 2, 2022

View reviewed changes

PointKernel reviewed Aug 2, 2022

View reviewed changes

PointKernel added type: feature request New feature request Needs Review Awaiting reviews before merging labels Aug 2, 2022

sleeepyjack added 3 commits August 2, 2022 18:51

Fixup in docs.

cfdc5b7

Move cache residency control helpers to public headers.

8ecac23

Change parameter order in cache residency control helper function.

63634da

sleeepyjack closed this Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L2 residency control helper functions #197

L2 residency control helper functions #197

sleeepyjack commented Aug 2, 2022 •

edited

Loading

sleeepyjack Aug 2, 2022

jrhemstad Aug 2, 2022

PointKernel Aug 2, 2022

sleeepyjack Aug 2, 2022

sleeepyjack Aug 2, 2022

sleeepyjack Aug 2, 2022 •

edited

Loading

sicario001 Nov 17, 2023 •

edited

Loading

sleeepyjack Aug 2, 2022

sleeepyjack commented Aug 2, 2022

PointKernel left a comment

PointKernel Aug 2, 2022

sleeepyjack commented Aug 2, 2022

	template <typename T>
	__global__ void pin_memory(T* ptr, nvbench::int64_t size)
	{
	#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)
	auto g = cooperative_groups::this_grid();
	for (int idx = g.thread_rank(); idx < size; idx += g.size())
	cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::persisting{});
	#endif
	}

	template <typename T>
	__global__ void unpin_memory(T* ptr, nvbench::int64_t size)
	{
	#if defined(CUCO_HAS_CUDA_ANNOTATED_PTR)
	auto g = cooperative_groups::this_grid();
	for (int idx = g.thread_rank(); idx < size; idx += g.size())
	cuda::apply_access_property(ptr + idx, sizeof(T), cuda::access_property::normal{});
	#endif
	}

L2 residency control helper functions #197

L2 residency control helper functions #197

Conversation

sleeepyjack commented Aug 2, 2022 • edited Loading

sleeepyjack Aug 2, 2022

Choose a reason for hiding this comment

jrhemstad Aug 2, 2022

Choose a reason for hiding this comment

PointKernel Aug 2, 2022

Choose a reason for hiding this comment

sleeepyjack Aug 2, 2022

Choose a reason for hiding this comment

sleeepyjack Aug 2, 2022

Choose a reason for hiding this comment

sleeepyjack Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

sicario001 Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

sleeepyjack Aug 2, 2022

Choose a reason for hiding this comment

sleeepyjack commented Aug 2, 2022

PointKernel left a comment

Choose a reason for hiding this comment

PointKernel Aug 2, 2022

Choose a reason for hiding this comment

sleeepyjack commented Aug 2, 2022

sleeepyjack commented Aug 2, 2022 •

edited

Loading

sleeepyjack Aug 2, 2022 •

edited

Loading

sicario001 Nov 17, 2023 •

edited

Loading