Add caching allocator interface #576

pxl-th · 2024-12-15T14:22:15Z

Since Julia's GC is not aware of GPU memory, in scenarios with lots of allocations we end up in either OOM situations or in excessively high memory usage.
Even though the program may require only fraction of it.

To help with GPU memory utilizaton in a program with repeating blocks of code, we can wrap those regions in a scope that will utilize caching allocator every time the program enters this scope during the execution.

For example, this is especially useful when training models, where you compute loss, gradients w.r.t. loss and perform in-place parameter update of the model.

model = ...
for i in 1:1000
    GPUArrays.@cache_scope kab :loop begin
        loss, grads = ...
        update!(optimizer, model, grads)
    end
end

The caching allocator is defined by its name and is per-device (it will use current TLS device).

Example

In the following example we apply caching allocator at every iteration of the for-loop.
Every iteration requires 2 GiB of gpu memory, without caching allocator
GC wouldn't be able to free arrays in time resulting in higher memory usage.
With caching allocator, memory usage stays at exactly 2 GiB.

After the loop, we free all cached memory if there's any (e.g. CUDA.jl will bulk-free immediately after execution of expression inside @cache_scope, because it has performant allocator).

kab = CUDABackend()
n = 1024^3
CUDA.@sync for i in 1:1000
    GPUArrays.@cache_scope kab :loop begin
        sin.(CUDA.rand(Float32, n))
    end
end
GPUArrays.invalidate_cache_allocator!(kab, :loop)

Backend differences

Because CUDA has more performant allocator, CUDA.jl will bulk-free arrays at the end of expr execution, instead of caching the arrays (free_immediately=true).
AMDGPU.jl instead caches them (free_immediately=false) until user invalidates the cache.

Performance impact

Executing GaussianSplatting.jl benchmark (1k training iterations) on RX 7900XTX:

	Without caching allocator	With caching allocator
GPU memory utilization
Time	`59.656476` seconds	`46.365646` seconds

TODO

PRs for other GPU backends

maleadt

Could you add some high-level design description to the PR?

As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free! at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.

maleadt · 2024-12-16T08:30:33Z

src/host/allocations_cache.jl

+        # Array was manually freed via `unsafe_free!`.
+        storage(tmp).freed && continue


Not necessarily only by calling unsafe_free!, but also by the GC, right?

In this particular case only by unsafe_free!, because cached arrays are stored either in free or in busy arrays, thus preventing GC from collecting them.

pxl-th · 2024-12-16T13:38:59Z

Could you add some high-level design description to the PR?

As I mentioned on Slack, CUDA already has a caching allocator, so I'm not sure if for those back-ends this shouldn't boil down to basically batch-calling unsafe_free! at the end of each iteration, instead of actively caching arrays. Would be good to compare performance, if possible.

Yeah, I'm planning to add both detailed PR description and documentation.
And to allow batch-freeing instead of caching the arrays (which can be just an option in the caching allocator).

pxl-th · 2024-12-18T16:59:53Z

@maleadt, I've updated the PR.
Also, I've added tests but they are not enabled right now, because no backend currently has the implementation merged (including JLArrays, because tests use released version of it).
However, they pass locally on my machines.

Let me know what you think.

src/host/allocations_cache.jl

docs/src/interface.md

maleadt · 2025-01-06T08:22:27Z

src/host/allocations_cache.jl

+Given KernelAbstractions `backend`, return corresponding `PerDeviceCacheAllocator` for it.
+Each GPU backend must implement this.
+"""
+cache_allocator(::Backend) = error("Not implemented.")


Instead of passing in a KA back-end, what about keying everything on the GPU array type? It seems more natural, as the KA back-end doesn't seem to be used anyway.

The reason for passing backend is because I was using it constantly when writing backend-agnostic code as this is something that allows KA kernel compilation (kernel(kab)(args...)) and transfer to GPU/CPU (adapt(kab, host_array)).
So this was my main "anchor" around backend-agnostic code.

Maybe we should also add an interface to obtain KA backend from the array type
KA.get_backend(::Type{AMDGPU.ROCArray}) = ROCBackend()
Because currently we only have an interface for obtaining backend from the actual array
KA.get_backend(::AMDGPU.ROCArray) = ROCBackend()

Which is cumbersome and leads to a situations where you have to create 0-sized arrays to obtain KA backend.

Alternatively, when keying on the array type, I think I could live with a device(::AbstractGPUArray) interface method.

Same here, I'd rather query on ::Type{AbstractGPUArray}, not on the actual array.

Maybe we should also add an interface to obtain KA backend from the array type

That seems reasonable to me, cc @vchuravy. It's something that exists in Base as well for functions like eltype.

On the other hand, it may not be possible when the back-end encodes specific information like the device being used, but I don't know if that was ever the intent.

src/host/allocations_cache.jl

maleadt · 2025-01-06T08:30:23Z

src/host/allocations_cache.jl

+`skip_free::Bool` is used together with `PerDeviceCacheAllocator.free_immediately`.
+When `true` arrays are bulk-freed instead of stored in cache.
+In this case `alloc!` will avoid looking into "free" part of `cache`
+and execute `alloc_f` immediately, storing allocation for future bulk-freeing.


skip_free seems like an implementation detail?

The free_immediately flag also feels like a hack; have you benchmarked the difference between relying on CUDA's allocator vs. fully caching in GPUArrays? I'd be happy to not support this at all (i.e., have CUDA.jl also cache allocations in GPUArrays.jl instead of relying on the CUDA caching allocator) to simplify the API, especially if it would make sense performance-wise.

The difference is pretty minor, around 2% improvement with free_immediately=false.
Should we drop the flag?

Yeah, let's drop it. That way it also works for people who run with the caching allocator disabled (i.e., HPC folks relying on UCX/MPI).

Done. I've also uncommented tests so that JLArrays runs them. But other backends will fail for now on them.

pxl-th · 2025-01-07T12:39:51Z

One difference I've found between Julia 1.10 and Julia 1.11:

Julia 1.10:

julia> GPUArrays.AllocCache.@enable CuArray :loop begin
           x1 = CuArray(rand(Float32, 1))
       end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.680597

julia> x1
ERROR: UndefVarError: `x1` not defined

Julia 1.11:

julia> GPUArrays.AllocCache.@enable CuArray :loop begin
           x1 = CuArray(rand(Float32, 1))
       end
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.7703809

julia> x1
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.7703809

Not sure where is it coming from.

pxl-th added 3 commits December 14, 2024 18:18

Implement allocation cache

352a610

Correctly fetch underlying storage

2c5344c

Add cache sizeof

dd3af70

This was referenced Dec 15, 2024

HandleCache fixes & GPUArrays caching allocator interface implementation JuliaGPU/AMDGPU.jl#710

Merged

Use GPUArrays caching allocator JuliaGPU/CUDA.jl#2593

Draft

maleadt reviewed Dec 16, 2024

View reviewed changes

pxl-th added 4 commits December 17, 2024 12:49

Allow bulk-freeing arrays instead of caching them

40e5447

Add docs

cd7a8da

Add tests

5cd4b70

Update docs

14345c6

pxl-th marked this pull request as ready for review December 18, 2024 16:59

Update docs & disable test for now

27a35b7

pxl-th requested a review from maleadt December 22, 2024 20:08

This was referenced Jan 1, 2025

cuda gpu memory usage increasing in time FluxML/Flux.jl#2523

Open

run GC every epoch CarloLucibello/Tsunami.jl#66

Open

maleadt reviewed Jan 6, 2025

View reviewed changes

pxl-th added 2 commits January 6, 2025 21:23

Use array type instead of KA backend & allow arbitrary keys

d20e5dc

Minor cleanups

d6a74b0

pxl-th force-pushed the pxl-th/cache-alloc branch from 051cd6d to d6a74b0 Compare January 6, 2025 21:00

pxl-th self-assigned this Jan 6, 2025

pxl-th requested a review from maleadt January 6, 2025 21:15

pxl-th added 2 commits January 7, 2025 13:47

Remove 'free_immediately' param

d0115fd

Limit caching allocator tests to AbstractGPUArray

3e743d6

Fix tests for 1.10

bc6dcd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add caching allocator interface #576

Add caching allocator interface #576

pxl-th commented Dec 15, 2024 •

edited

Loading

maleadt left a comment

maleadt Dec 16, 2024

pxl-th Dec 18, 2024 •

edited

Loading

pxl-th commented Dec 16, 2024 •

edited

Loading

pxl-th commented Dec 18, 2024 •

edited

Loading

maleadt Jan 6, 2025

pxl-th Jan 6, 2025

maleadt Jan 6, 2025

maleadt Jan 6, 2025

pxl-th Jan 6, 2025 •

edited

Loading

maleadt Jan 7, 2025

pxl-th Jan 7, 2025

pxl-th commented Jan 7, 2025

		# Array was manually freed via `unsafe_free!`.
		storage(tmp).freed && continue

Add caching allocator interface #576

Are you sure you want to change the base?

Add caching allocator interface #576

Conversation

pxl-th commented Dec 15, 2024 • edited Loading

Example

Backend differences

Performance impact

TODO

PRs for other GPU backends

maleadt left a comment

Choose a reason for hiding this comment

maleadt Dec 16, 2024

Choose a reason for hiding this comment

pxl-th Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

pxl-th commented Dec 16, 2024 • edited Loading

pxl-th commented Dec 18, 2024 • edited Loading

maleadt Jan 6, 2025

Choose a reason for hiding this comment

pxl-th Jan 6, 2025

Choose a reason for hiding this comment

maleadt Jan 6, 2025

Choose a reason for hiding this comment

maleadt Jan 6, 2025

Choose a reason for hiding this comment

pxl-th Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

maleadt Jan 7, 2025

Choose a reason for hiding this comment

pxl-th Jan 7, 2025

Choose a reason for hiding this comment

pxl-th commented Jan 7, 2025

pxl-th commented Dec 15, 2024 •

edited

Loading

pxl-th Dec 18, 2024 •

edited

Loading

pxl-th commented Dec 16, 2024 •

edited

Loading

pxl-th commented Dec 18, 2024 •

edited

Loading

pxl-th Jan 6, 2025 •

edited

Loading