-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching allocator interface #576
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some high-level design description to the PR?
As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free!
at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.
src/host/allocations_cache.jl
Outdated
# Array was manually freed via `unsafe_free!`. | ||
storage(tmp).freed && continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily only by calling unsafe_free!
, but also by the GC, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case only by unsafe_free!
, because cached arrays are stored either in free
or in busy
arrays, thus preventing GC from collecting them.
Yeah, I'm planning to add both detailed PR description and documentation. |
@maleadt, I've updated the PR. Let me know what you think. |
src/host/allocations_cache.jl
Outdated
Given KernelAbstractions `backend`, return corresponding `PerDeviceCacheAllocator` for it. | ||
Each GPU backend must implement this. | ||
""" | ||
cache_allocator(::Backend) = error("Not implemented.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing in a KA back-end, what about keying everything on the GPU array type? It seems more natural, as the KA back-end doesn't seem to be used anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for passing backend is because I was using it constantly when writing backend-agnostic code as this is something that allows KA kernel compilation (kernel(kab)(args...)
) and transfer to GPU/CPU (adapt(kab, host_array)
).
So this was my main "anchor" around backend-agnostic code.
Maybe we should also add an interface to obtain KA backend from the array type
KA.get_backend(::Type{AMDGPU.ROCArray}) = ROCBackend()
Because currently we only have an interface for obtaining backend from the actual array
KA.get_backend(::AMDGPU.ROCArray) = ROCBackend()
Which is cumbersome and leads to a situations where you have to create 0-sized arrays to obtain KA backend.
Alternatively, when keying on the array type, I think I could live with a device(::AbstractGPUArray) interface method.
Same here, I'd rather query on ::Type{AbstractGPUArray}
, not on the actual array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also add an interface to obtain KA backend from the array type
That seems reasonable to me, cc @vchuravy. It's something that exists in Base as well for functions like eltype
.
On the other hand, it may not be possible when the back-end encodes specific information like the device being used, but I don't know if that was ever the intent.
src/host/allocations_cache.jl
Outdated
`skip_free::Bool` is used together with `PerDeviceCacheAllocator.free_immediately`. | ||
When `true` arrays are bulk-freed instead of stored in cache. | ||
In this case `alloc!` will avoid looking into "free" part of `cache` | ||
and execute `alloc_f` immediately, storing allocation for future bulk-freeing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip_free
seems like an implementation detail?
The free_immediately
flag also feels like a hack; have you benchmarked the difference between relying on CUDA's allocator vs. fully caching in GPUArrays? I'd be happy to not support this at all (i.e., have CUDA.jl also cache allocations in GPUArrays.jl instead of relying on the CUDA caching allocator) to simplify the API, especially if it would make sense performance-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is pretty minor, around 2% improvement with free_immediately=false
.
Should we drop the flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's drop it. That way it also works for people who run with the caching allocator disabled (i.e., HPC folks relying on UCX/MPI).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've also uncommented tests so that JLArrays runs them. But other backends will fail for now on them.
051cd6d
to
d6a74b0
Compare
One difference I've found between Julia 1.10 and Julia 1.11:
julia> GPUArrays.AllocCache.@enable CuArray :loop begin
x1 = CuArray(rand(Float32, 1))
end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.680597
julia> x1
ERROR: UndefVarError: `x1` not defined
julia> GPUArrays.AllocCache.@enable CuArray :loop begin
x1 = CuArray(rand(Float32, 1))
end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.7703809
julia> x1
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.7703809 Not sure where is it coming from. |
Since Julia's GC is not aware of GPU memory, in scenarios with lots of allocations we end up in either OOM situations or in excessively high memory usage.
Even though the program may require only fraction of it.
To help with GPU memory utilizaton in a program with repeating blocks of code, we can wrap those regions in a scope that will utilize caching allocator every time the program enters this scope during the execution.
For example, this is especially useful when training models, where you compute loss, gradients w.r.t. loss and perform in-place parameter update of the model.
The caching allocator is defined by its name and is per-device (it will use current TLS device).
Example
In the following example we apply caching allocator at every iteration of the for-loop.
Every iteration requires 2 GiB of gpu memory, without caching allocator
GC wouldn't be able to free arrays in time resulting in higher memory usage.
With caching allocator, memory usage stays at exactly 2 GiB.
After the loop, we free all cached memory if there's any (e.g. CUDA.jl will bulk-free immediately after execution of expression inside
@cache_scope
, because it has performant allocator).Backend differences
expr
execution, instead of caching the arrays (free_immediately=true
).free_immediately=false
) until user invalidates the cache.Performance impact
Executing GaussianSplatting.jl benchmark (1k training iterations) on RX 7900XTX:
59.656476
seconds46.365646
secondsTODO
PRs for other GPU backends