Scalar indexing error from GPU matmul against Zygote.OneElement #1005

ChrisRackauckas · 2021-06-21T12:27:17Z

MWE:

using Zygote, CUDA
CUDA.allowscalar(false)
W = CuArray(rand(4,4))
x = Zygote.OneElement(1f0,(1,),axes(rand(4)))
W' * x # Scalar indexing

From:

using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots
u0 = [1.1; 1.1] |> gpu
tspan = (0.0f0,25.0f0)
ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1))
p1 = initial_params(ann)
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]
function dudt_(u,p,t)
    x, y = u
    pend = cpu(p[end-1:end])
    @show typeof(p[1:length(p1)])
    @show typeof(gpu(u))
    @show cpu(ann(gpu(u),p[1:length(p1)]))[1]
    @show pend[1]*y + pend[2]*x
    [cpu(ann(gpu(u),p[1:length(p1)]))[1],pend[1]*y + pend[2]*x]
end
prob = ODEProblem{false}(dudt_,u0,tspan,p3)
function predict_adjoint(θ)
  gpu(Array(solve(prob,Tsit5(),u0=cpu(θ[1:2]),p=θ[3:end],saveat=0.0:1:25.0,sensealg=QuadratureAdjoint())))
end
loss_adjoint(θ) = sum(abs2,predict_adjoint(θ)[2,:].-1)
l = loss_adjoint(θ)
cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end
loss1 = loss_adjoint(θ)
Zygote.gradient(loss_adjoint,θ)

SciML/DiffEqFlux.jl#571

ChrisRackauckas · 2021-06-21T12:35:29Z

@mcabbott

mcabbott · 2021-06-21T16:42:56Z

CUDA.allowscalar(false) means that you can't get a OneElement from the gradient of W[1,1]. The key point of this example is that cpu(...)[1] does the indexing on a CPU array, but in the gradient the OneElement gets mixed up with GPU objects.

Possible fixes are

To start overloading *(::OneElement, ::AbstractMatrix) etc. These are obviously very simple, and even on the CPU could be made more efficient than generic_matmul. The difficulty is that the dispatch for * is a minefield of type ambiguities.
To use Adapt.jl to translate OneElement to a CuArray, so that the gradient of cpu literally moves it to the GPU. That would pretty much restore pre-RFC: more efficient ∇getindex #962 behaviour.

Re 1. we should also wonder a bit what operations besides * might need overloading.

ChrisRackauckas · 2021-06-21T17:00:27Z

* and + I think would go pretty far for this.

DhairyaLGandhi · 2021-06-21T17:00:43Z

CUDA.jl now uses cuda's allocator which is actually much higher overhead than before. Its especially bad for small arrays, possibly hurting more of Flux.

mcabbott · 2021-06-23T00:16:03Z

I think this is the actual MWE:

using Zygote, Flux, CUDA
CUDA.allowscalar(false)

Zygote.gradient(x -> cpu(2 .* gpu(x))[1], Float32[1,2,3]) == ([2,0,0],)  
# dot(x::Zygote.OneElement, y::CuArray)

Zygote.gradient(x -> cpu(gpu(x) * gpu(x))[1,2], Float32[1 2 3; 4 5 6; 7 8 9]) == ([2 6 8; 0 2 0; 0 3 0],) 
# generic_matmatmul!(C::CuArray, ..., A::Zygote.OneElement, ...)

And this is the simplest attempt at an Adapt solution, but it doesn't get called. Why?

using Adapt

a34 = Zygote.OneElement(3.4f0, (2,3), axes(rand(3,4)))
adapt(CuArray, a34) # isa OneElement

function Adapt.adapt(AT::Type{<: CUDA.AbstractGPUArray}, A::Zygote.OneElement{T2,N}) where {T2, N}
    B = fill!(similar(AT{T2}, axes(A)), zero(T2))
    CUDA.@allowscalar B[A.ind...] = A.val
    @show A.val
    B
end
adapt(CuArray, a34) # isa CuArray

mcabbott · 2021-06-29T02:03:44Z

For the alternative plan, there are some attempts to overload * methods etc. here:

https://gist.github.com/mcabbott/4ea43bea49a25c198a20f55f590735c4

But, as promised, it gets tricky to avoid ambiguities.

DhairyaLGandhi · 2021-07-19T08:40:18Z

Why does OneElement need arithmetic overloads anyway? It shouldn't leak to user facing code at all.

mcabbott · 2021-08-22T12:41:12Z

Simpler failure case with no indexing -- did cpu/gpu ever work inside gradients?

julia> using CUDA, Zygote, Flux
julia> CUDA.allowscalar(false)
julia> a = rand(Float32, 4, 4); ca = cu(rand(4, 4));

julia> gradient(x -> sum(abs, cpu(ca * gpu(a * x))), a)
ERROR: ArgumentError: cannot take the CPU address of a CuArray{Float32, 2}

julia> gradient(x -> sum(abs, collect(ca * cu(a * x))), a)
ERROR: ArgumentError: cannot take the CPU address of a CuArray{Float32, 2}

I see there's exactly one test of this in https://github.com/FluxML/Zygote.jl/blob/master/test/cuda.jl, from #929 I think. But it doesn't do the obvious test of the reverse order, which fails:

julia> gradient(x -> sum(cu(x)), [1 2 3.0])[1]
1×3 Matrix{Float32}:
 1.0  1.0  1.0

julia> gradient(x -> sum(cpu(x)), cu([1 2 3.0]))[1]
1×3 Fill{Float32}, with entries equal to 1.0

julia> gradient(x -> sum(abs, cpu(x)), cu([1 2 3.0]))[1]
1×3 Matrix{Float32}:
 1.0  1.0  1.0

DhairyaLGandhi · 2021-08-22T13:03:36Z

It did work just fine. We have movement tests in flux as well.

mcabbott · 2021-08-22T13:28:57Z

It did work just fine.

Great, on which version exactly did you see gradient(x -> sum(abs, cpu(ca * gpu(a * x))), a) working? Then we can bisect.

We have movement tests in flux as well.

This would be https://github.com/FluxML/Flux.jl/blob/master/test/cuda/cuda.jl ? Which line exactly tests this? Why doesn't it catch gradient(x -> sum(abs, cpu(x)), cu([1 2 3.0]))[1] which seems the most obvious possible test?

DhairyaLGandhi · 2021-08-22T13:45:23Z

Surely we can add tests, besides the test you suggest can be broken down into smaller chunks (specifically sum(cpu(x)) removing the abs).

mcabbott · 2021-08-22T14:01:43Z

removing the abs

Indeed. You may even notice that I included that case above.

But again, on which version, exactly, did these work?

DhairyaLGandhi · 2021-08-22T14:10:41Z

I would start with before the broadcasting and accumulation changes iirc. But I'm going to have check the versions to see where it's expected to work.

mcabbott mentioned this issue Aug 22, 2021

Problems with a mixed CPU/GPU model FluxML/Flux.jl#1695

Closed

mcabbott mentioned this issue Aug 31, 2021

Gradient definitions for cpu & gpu FluxML/Flux.jl#1704

Merged

bors bot closed this as completed in FluxML/Flux.jl@2468a06 Sep 6, 2021

bors bot closed this as completed in FluxML/Flux.jl#1704 Sep 6, 2021

mcabbott mentioned this issue Sep 28, 2021

Zygote.OneElement does not properly reshape #1080

Closed

CarloLucibello mentioned this issue Nov 3, 2024

scalar indexing of gpu array in Zygote gradient LuxDL/Lux.jl#1016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalar indexing error from GPU matmul against Zygote.OneElement #1005

Scalar indexing error from GPU matmul against Zygote.OneElement #1005

ChrisRackauckas commented Jun 21, 2021

ChrisRackauckas commented Jun 21, 2021

mcabbott commented Jun 21, 2021

ChrisRackauckas commented Jun 21, 2021

DhairyaLGandhi commented Jun 21, 2021 •

edited

Loading

mcabbott commented Jun 23, 2021

mcabbott commented Jun 29, 2021

DhairyaLGandhi commented Jul 19, 2021

mcabbott commented Aug 22, 2021

DhairyaLGandhi commented Aug 22, 2021

mcabbott commented Aug 22, 2021

DhairyaLGandhi commented Aug 22, 2021

mcabbott commented Aug 22, 2021

DhairyaLGandhi commented Aug 22, 2021

Scalar indexing error from GPU matmul against Zygote.OneElement #1005

Scalar indexing error from GPU matmul against Zygote.OneElement #1005

Comments

ChrisRackauckas commented Jun 21, 2021

ChrisRackauckas commented Jun 21, 2021

mcabbott commented Jun 21, 2021

ChrisRackauckas commented Jun 21, 2021

DhairyaLGandhi commented Jun 21, 2021 • edited Loading

mcabbott commented Jun 23, 2021

mcabbott commented Jun 29, 2021

DhairyaLGandhi commented Jul 19, 2021

mcabbott commented Aug 22, 2021

DhairyaLGandhi commented Aug 22, 2021

mcabbott commented Aug 22, 2021

DhairyaLGandhi commented Aug 22, 2021

mcabbott commented Aug 22, 2021

DhairyaLGandhi commented Aug 22, 2021

DhairyaLGandhi commented Jun 21, 2021 •

edited

Loading