Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster strided-batched to batched wrapper #2592

Closed
THargreaves opened this issue Dec 15, 2024 · 4 comments · Fixed by #2608
Closed

Faster strided-batched to batched wrapper #2592

THargreaves opened this issue Dec 15, 2024 · 4 comments · Fixed by #2608
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request performance How fast can we go?

Comments

@THargreaves
Copy link
Contributor

THargreaves commented Dec 15, 2024

CUBLAS only implements strided-batched methods (acting on a M x N x B array) for a subset of operations. For other cases, batched operations are available that act on a vector of 2D CuArrays.

CUDA.jl offers a convenient wrapper that takes a strided-batched array and creates a vector of pointers to its individual matrices which can then be pass into the batched kernel. An example of this is CUDA.CUBLAS.getrf_strided_batched! which calls CUDA.CUBLAS.getrf_batched! under the hood.

The pointer conversion function is CUDA.CUBLAS.unsafe_strided_batch and is defined as

@inline function unsafe_strided_batch(strided::DenseCuArray{T}) where {T}
    batchsize = last(size(strided))
    stride = prod(size(strided)[1:end-1])
    ptrs = [pointer(strided, (i-1)*stride + 1) for i in 1:batchsize]
    return CuArray(ptrs)
end

This operation is very slow, especially in the setting of large batches of small matrices.

D = 4;
N = 1000000;
A = CUDA.rand(Float32, D, D, N);

println("Naive pointer conversion")
@benchmark CUDA.@sync CUDA.CUBLAS.unsafe_strided_batch($A)

println("Batched getrf")
@benchmark CUDA.CUBLAS.getrf_batched!($D, $Aptrs, $D, true, nothing)

gives

Naive pointer conversion

BenchmarkTools.Trial: 23 samples with 1 evaluation.
 Range (min … max):  203.631 ms … 269.869 ms  ┊ GC (min … max): 0.90% … 9.20%
 Time  (median):     219.725 ms               ┊ GC (median):    1.97%
 Time  (mean ± σ):   224.935 ms ±  19.125 ms  ┊ GC (mean ± σ):  4.69% ± 4.81%

  █▁ ▁          █              ▁    ▁                            
  ██▁█▁▁▁▁▁▁▁▁▁▁█▁▁▆▁▆▁▁▁▁▁▁▁▁▆█▁▆▁▁█▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▆ ▁
  204 ms           Histogram: frequency by time          270 ms <

 Memory estimate: 83.92 MiB, allocs estimate: 5000013.

Batched getrf

BenchmarkTools.Trial: 3695 samples with 8 evaluations.
 Range (min … max):    3.923 μs … 436.164 μs  ┊ GC (min … max): 0.00% … 36.24%
 Time  (median):     171.996 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   168.987 μs ±  32.501 μs  ┊ GC (mean ± σ):  0.66% ±  3.31%

  ▂                                                           █  
  █▄▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▄▃▃▄▃▄▄▄█ █
  3.92 μs       Histogram: log(frequency) by time        173 μs <

 Memory estimate: 672 bytes, allocs estimate: 26.

As you can see, the pointer creation time is roughly 1000 times slower than the operation itself.

I'm not exactly sure why this is but I assume it comes down to the repeated access of strided when creating the pointers. Since strided matrices have regular memory addresses, a faster way would be

first_address = UInt64(pointer(A));
println("Pointer creation time (no reference to A)")
@benchmark [CuPtr{Float32}($first_address + (i - 1) * $stride * sizeof(Float32)) for i in 1:($N)]

giving

Pointer creation time (no reference to A)

BenchmarkTools.Trial: 6672 samples with 1 evaluation.
 Range (min … max):  149.006 μs … 3.956 ms  ┊ GC (min … max):  0.00% … 24.57%
 Time  (median):     180.368 μs             ┊ GC (median):     0.00%
 Time  (mean ± σ):   746.292 μs ± 1.021 ms  ┊ GC (mean ± σ):  20.54% ± 22.02%

  █▄▅                        ▁▂▁       ▁▁        ▃▃       ▁▃▂ ▁
  ███▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▇▅▄▁▁▆███▇▆▃▁▃▁▆████▆▃▁▃▁▆███ █
  149 μs       Histogram: log(frequency) by time      3.08 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

now at the same order of magnitude of speed as the kernel itself.

Better yet, the GPU can be used for this.

Aptrs_gpu = CuArray{CuPtr{Float32}}(undef, N);

function create_ptrs_kernel!(Aptrs_gpu, initial_ptr_address::UInt64, stride)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i in index:stride:length(Aptrs_gpu)
        Aptrs_gpu[i] = CuPtr{Float32}(initial_ptr_address + (i - 1) * stride * sizeof(Float32))
    end
    return nothing
end

println("Pointer creation time (GPU)")
@benchmark CUDA.@sync @cuda threads = $threads blocks = $blocks create_ptrs_kernel!(
    $Aptrs_gpu, UInt64(pointer($A)), $stride
)

at which point the pointer creation time is negligible:

Pointer creation time (GPU)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  9.023 μs …  44.357 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.795 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.836 μs ± 475.868 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▂▃▅█▅▅▇▂▁▂                       
  ▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▃▄▆██████████▇▆▆▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  9.02 μs         Histogram: frequency by time        10.4 μs <

 Memory estimate: 256 bytes, allocs estimate: 12.

I'd be happy to implement these changes but wanted to run it by the team first to make sure there weren't any potential issues with this approach.

@THargreaves THargreaves added the enhancement New feature or request label Dec 15, 2024
@maleadt maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Dec 16, 2024
@maleadt
Copy link
Member

maleadt commented Dec 16, 2024

I'm not exactly sure why this is but I assume it comes down to the repeated access of strided when creating the pointers.

Deriving a pointer is a fairly costly operation:

CUDA.jl/src/memory.jl

Lines 530 to 598 in 4e9513b

function Base.convert(::Type{CuPtr{T}}, managed::Managed{M}) where {T,M}
# let null pointers pass through as-is
ptr = convert(CuPtr{T}, managed.mem)
if ptr == CU_NULL
return ptr
end
# accessing memory during stream capture: taint the memory so that we always synchronize
state = active_state()
if is_capturing(state.stream)
managed.captured = true
end
# accessing memory on another device: ensure the data is ready and accessible
if M == DeviceMemory && state.context != managed.mem.ctx
maybe_synchronize(managed)
source_device = managed.mem.dev
# enable peer-to-peer access
if maybe_enable_peer_access(state.device, source_device) != 1
throw(ArgumentError(
"""cannot take the GPU address of inaccessible device memory.
You are trying to use memory from GPU $(deviceid(source_device)) on GPU $(deviceid(state.device)).
P2P access between these devices is not possible; either switch to GPU $(deviceid(source_device))
by calling `CUDA.device!($(deviceid(source_device)))`, or copy the data to an array allocated on device $(deviceid(state.device))."""))
end
# set pool visibility
if stream_ordered(source_device)
pool = pool_create(source_device)
access!(pool, state.device, ACCESS_FLAGS_PROT_READWRITE)
end
end
# accessing memory on another stream: ensure the data is ready and take ownership
if managed.stream != state.stream
maybe_synchronize(managed)
managed.stream = state.stream
end
managed.dirty = true
return ptr
end
function Base.convert(::Type{Ptr{T}}, managed::Managed{M}) where {T,M}
# let null pointers pass through as-is
ptr = convert(Ptr{T}, managed.mem)
if ptr == C_NULL
return ptr
end
# accessing memory on the CPU: only allowed for host or unified allocations
if M == DeviceMemory
throw(ArgumentError(
"""cannot take the CPU address of GPU memory.
You are probably falling back to or otherwise calling CPU functionality
with GPU array inputs. This is not supported by regular device memory;
ensure this operation is supported by CUDA.jl, and if it isn't, try to
avoid it or rephrase it in terms of supported operations. Alternatively,
you can consider using GPU arrays backed by unified memory by
allocating using `cu(...; unified=true)`."""))
end
# make sure any work on the memory has finished.
maybe_synchronize(managed)
return ptr
end

As you can perform arithmetic with pointers, it's probably better to only call that function once, as you suggested, and inline the offset calculation (maybe by calling Base._memory_offset(x, i) to avoid duplicating logic).

Better yet, the GPU can be used for this.

Hah, fascinating. What is your batch size? I wonder if there's a threshold to be determined here, where a CPU-based calculation (necessitating a memory copy) is still faster than a kernel.

@THargreaves
Copy link
Contributor Author

Batch size here is 1000000, but intuition tells me it should never be that much slower to always do it on the GPU. Since you need the pointers on the GPU and the FLOPS required to compute a pointer is negligible, the operation is bound by the GPU memory bandwidth regardless of which way you do it.

E.g. for N = 128 and including a conversion to CuArray in the CPU benchmark the times are 4.5 μs on CPU, 5.7 μs on GPU.

I actually would have expected the GPU to be at least as fast as the CPU by my logic, so maybe there's something I'm not getting, but in practice, it seems reasonable to use the GPU regardless of batch size.

@maleadt
Copy link
Member

maleadt commented Dec 20, 2024

You're not considering the launch overhead of a kernel, which I normally think of as taking around 20us. You can process a lot of items during that time span (it's curious that on your system the launch overhead is closer to 5us). Then again, by doing the processing on the GPU, you don't have to memcpy the pointer values to the GPU (which I presume takes a similar time to queue as a kernel launch does), so in this specific case it's probably fine to always compute on the GPU.

Thinking some more though, we need to be careful that the "slow" code performed by pointer on the CPU is at least triggered once, because it's responsible for configuring the memory pool and exposing memory when using unified memory and/or multiple devices. Passing the source array should do that (the conversion to a device array when passing as an argument requires a pointer conversion), but this should be double checked.

@THargreaves
Copy link
Contributor Author

Thinking some more though, we need to be careful that the "slow" code performed by pointer on the CPU is at least triggered once, because it's responsible for configuring the memory pool and exposing memory when using unified memory and/or multiple devices. Passing the source array should do that (the conversion to a device array when passing as an argument requires a pointer conversion), but this should be double checked.

I don't really follow you here.

Either way, I've started a PR. Very happy for you to make any changes as you see fit.

@maleadt maleadt linked a pull request Jan 8, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request performance How fast can we go?
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants