Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching allocator interface #576

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Add caching allocator interface #576

wants to merge 13 commits into from

Conversation

pxl-th
Copy link
Member

@pxl-th pxl-th commented Dec 15, 2024

Since Julia's GC is not aware of GPU memory, in scenarios with lots of allocations we end up in either OOM situations or in excessively high memory usage.
Even though the program may require only fraction of it.

To help with GPU memory utilizaton in a program with repeating blocks of code, we can wrap those regions in a scope that will utilize caching allocator every time the program enters this scope during the execution.

For example, this is especially useful when training models, where you compute loss, gradients w.r.t. loss and perform in-place parameter update of the model.

model = ...
for i in 1:1000
    GPUArrays.@cache_scope kab :loop begin
        loss, grads = ...
        update!(optimizer, model, grads)
    end
end

The caching allocator is defined by its name and is per-device (it will use current TLS device).

Example

In the following example we apply caching allocator at every iteration of the for-loop.
Every iteration requires 2 GiB of gpu memory, without caching allocator
GC wouldn't be able to free arrays in time resulting in higher memory usage.
With caching allocator, memory usage stays at exactly 2 GiB.

After the loop, we free all cached memory if there's any (e.g. CUDA.jl will bulk-free immediately after execution of expression inside @cache_scope, because it has performant allocator).

kab = CUDABackend()
n = 1024^3
CUDA.@sync for i in 1:1000
    GPUArrays.@cache_scope kab :loop begin
        sin.(CUDA.rand(Float32, n))
    end
end
GPUArrays.invalidate_cache_allocator!(kab, :loop)

Backend differences

  • Because CUDA has more performant allocator, CUDA.jl will bulk-free arrays at the end of expr execution, instead of caching the arrays (free_immediately=true).
  • AMDGPU.jl instead caches them (free_immediately=false) until user invalidates the cache.

Performance impact

Executing GaussianSplatting.jl benchmark (1k training iterations) on RX 7900XTX:

Without caching allocator With caching allocator
GPU memory utilization image image
Time 59.656476 seconds 46.365646 seconds

TODO

  • Support for 1.10.
  • Support bulk-freeing instead of caching.
  • Add PR description.
  • Documentation.
  • Tests.

PRs for other GPU backends

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some high-level design description to the PR?

As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free! at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.

Comment on lines 37 to 38
# Array was manually freed via `unsafe_free!`.
storage(tmp).freed && continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily only by calling unsafe_free!, but also by the GC, right?

Copy link
Member Author

@pxl-th pxl-th Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case only by unsafe_free!, because cached arrays are stored either in free or in busy arrays, thus preventing GC from collecting them.

@pxl-th
Copy link
Member Author

pxl-th commented Dec 16, 2024

Could you add some high-level design description to the PR?

As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free! at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.

Yeah, I'm planning to add both detailed PR description and documentation.
And to allow batch-freeing instead of caching the arrays (which can be just an option in the caching allocator).

@pxl-th
Copy link
Member Author

pxl-th commented Dec 18, 2024

@maleadt, I've updated the PR.
Also, I've added tests but they are not enabled right now, because no backend currently has the implementation merged (including JLArrays, because tests use released version of it).
However, they pass locally on my machines.

Let me know what you think.

@pxl-th pxl-th marked this pull request as ready for review December 18, 2024 16:59
src/host/allocations_cache.jl Outdated Show resolved Hide resolved
docs/src/interface.md Outdated Show resolved Hide resolved
Given KernelAbstractions `backend`, return corresponding `PerDeviceCacheAllocator` for it.
Each GPU backend must implement this.
"""
cache_allocator(::Backend) = error("Not implemented.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing in a KA back-end, what about keying everything on the GPU array type? It seems more natural, as the KA back-end doesn't seem to be used anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for passing backend is because I was using it constantly when writing backend-agnostic code as this is something that allows KA kernel compilation (kernel(kab)(args...)) and transfer to GPU/CPU (adapt(kab, host_array)).
So this was my main "anchor" around backend-agnostic code.

Maybe we should also add an interface to obtain KA backend from the array type
KA.get_backend(::Type{AMDGPU.ROCArray}) = ROCBackend()
Because currently we only have an interface for obtaining backend from the actual array
KA.get_backend(::AMDGPU.ROCArray) = ROCBackend()

Which is cumbersome and leads to a situations where you have to create 0-sized arrays to obtain KA backend.

Alternatively, when keying on the array type, I think I could live with a device(::AbstractGPUArray) interface method.

Same here, I'd rather query on ::Type{AbstractGPUArray}, not on the actual array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also add an interface to obtain KA backend from the array type

That seems reasonable to me, cc @vchuravy. It's something that exists in Base as well for functions like eltype.

On the other hand, it may not be possible when the back-end encodes specific information like the device being used, but I don't know if that was ever the intent.

src/host/allocations_cache.jl Outdated Show resolved Hide resolved
Comment on lines 38 to 41
`skip_free::Bool` is used together with `PerDeviceCacheAllocator.free_immediately`.
When `true` arrays are bulk-freed instead of stored in cache.
In this case `alloc!` will avoid looking into "free" part of `cache`
and execute `alloc_f` immediately, storing allocation for future bulk-freeing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip_free seems like an implementation detail?

The free_immediately flag also feels like a hack; have you benchmarked the difference between relying on CUDA's allocator vs. fully caching in GPUArrays? I'd be happy to not support this at all (i.e., have CUDA.jl also cache allocations in GPUArrays.jl instead of relying on the CUDA caching allocator) to simplify the API, especially if it would make sense performance-wise.

Copy link
Member Author

@pxl-th pxl-th Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is pretty minor, around 2% improvement with free_immediately=false.
Should we drop the flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's drop it. That way it also works for people who run with the caching allocator disabled (i.e., HPC folks relying on UCX/MPI).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've also uncommented tests so that JLArrays runs them. But other backends will fail for now on them.

@pxl-th pxl-th force-pushed the pxl-th/cache-alloc branch from 051cd6d to d6a74b0 Compare January 6, 2025 21:00
@pxl-th pxl-th self-assigned this Jan 6, 2025
@pxl-th pxl-th requested a review from maleadt January 6, 2025 21:15
@pxl-th
Copy link
Member Author

pxl-th commented Jan 7, 2025

One difference I've found between Julia 1.10 and Julia 1.11:

  • Julia 1.10:
julia> GPUArrays.AllocCache.@enable CuArray :loop begin
           x1 = CuArray(rand(Float32, 1))
       end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.680597

julia> x1
ERROR: UndefVarError: `x1` not defined
  • Julia 1.11:
julia> GPUArrays.AllocCache.@enable CuArray :loop begin
           x1 = CuArray(rand(Float32, 1))
       end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.7703809

julia> x1
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.7703809

Not sure where is it coming from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants