-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse-mode VJPs with mixed scalar+vector GPU code #632
Comments
Could you elaborate what you mean by |
What if we refactor |
That could be worth a try. I didn't try any Buffer solutions here.
I mean function f(ca, Z, t)
a = ca[2:end]
a_unit = a / sum(a)
w_unit = Z*a_unit
Ka_unit = Z'*w_unit
z_unit = dot(abs.(Ka_unit), a_unit)
aKa_over_z = a .* Ka_unit / z_unit
[sum(aKa_over_z) / m; -abs.(aKa_over_z)] |> gpu
end If we are making an array on the CPU and sending it to the GPU with every |
Yeah, that's not going to perform well. I guess CuArray could be made to support some available space before and after the array such that mutating operations like that could perform well, but that seems like opening another can of worms (how much space to reserve? should all GPU arrays have this, or do we need special constructors again? etc). Is it not an option to do this manually, by over-allocating and e.g. using a view for the current data? |
With SciML/SciMLSensitivity.jl#498 and JuliaGPU/GPUArrays.jl#379 together, the simple out-of-place code works: using OrdinaryDiffEq
using DiffEqSensitivity
using LinearAlgebra
using Flux
using CUDA
using Random
rng = MersenneTwister(1234)
m = 32
n = 16
Z = randn(rng, Float32, (n,m)) |> gpu
𝒯 = 2.0
Δτ = 0.1
ca_init = [zeros(1) ; ones(m)] |> gpu
function f(ca, Z, t)
a = ca[2:end]
a_unit = a / sum(a)
w_unit = Z*a_unit
Ka_unit = Z'*w_unit
z_unit = dot(abs.(Ka_unit), a_unit)
aKa_over_z = a .* Ka_unit / z_unit
[sum(aKa_over_z) / m; -abs.(aKa_over_z)]
end
function c(Z)
prob = ODEProblem(f, ca_init, (0.,𝒯), Z, saveat=Δτ)
sol = solve(prob, Tsit5(), sensealg=BacksolveAdjoint(), saveat=Δτ)
#try this:
return last(sol.u)[1]
#or this:
#return sol.u[20][1]
end
println("forward:", c(Z))
println("backward: ", Zygote.gradient(c, Z)) |
In-place has been isolated to EnzymeAD/Enzyme.jl#144 |
The problem there is really just mutation in general + Zygote. The better solution then is probably to get the Enzyme+CUDA.jl stack working together, which I know @wsmoses has looked into. |
All that's left is upstreamed. |
Very specific issue for a very specific case, but here's how it shows up. An example model:
Reason this fails at first is because ReverseDiff.jl does not work on GPUs. So we should first improve the auto-VJP choice to take that into account. Cool, but then we end up in a larger problem. In-place differentiation requires one of the scalarizing reverse mode tape forms, and so those won't work on GPUs no matter what. Enzyme could be a solution, but @wsmoses how close is it to automatically handling CUDA.jl kernels?
So okay, we could do what we do with Neural ODEs on GPUs, which is namely to make it out-of-place and use Zygote vjps. Out of place form:
But this hits two issues. First of all,
[sum(aKa_over_z) / m; -abs.(aKa_over_z)]
is surprisingly not on the GPU which seems like a CUDA.jl issue.JuliaGPU/CUDA.jl#1162
But secondly, if we try to say
|> gpu
inside of the rhs function (which would be slow, but hopefully work?), then we hit:FluxML/Zygote.jl#1080
So fully GPU algorithms work in this form, but algorithms which have some scalar values don't have a nice way of recreating a GPU-based array in a way that is differentiable, hence this issue.
@DhairyaLGandhi just an interesting thing to note.
The text was updated successfully, but these errors were encountered: