Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use contextual dispatch for device functions. #750

Merged
merged 3 commits into from
Mar 17, 2021
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Mar 5, 2021

Builds on JuliaGPU/GPUCompiler.jl#151, requires JuliaLang/julia#39697 on Julia 1.7 but should also work on 1.6. Supersedes JuliaGPU/CUDAnative.jl#334.

Fixes #60

julia> CUDA.saturate(1f0)
ERROR: This function is not intended for use on the CPU
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] saturate(x::Float32)
   @ CUDA ~/Julia/pkg/CUDA/src/device/intrinsics.jl:23
 [3] top-level scope
   @ REPL[10]:1

Fixes #42

julia> a = CuArray([Complex(1f0,2f0)])
1-element CuArray{ComplexF32, 1}:
 1.0f0 + 2.0f0im

julia> sincos.(a)
1-element CuArray{Tuple{ComplexF32, ComplexF32}, 1}:
 (3.1657786f0 + 1.9596009f0im, 2.032723f0 - 3.0518978f0im)

Fixes #659

julia> x = cu([true false; false true])
2×2 CuArray{Bool, 2}:
 1  0
 0  1

julia> argmax(x)
CartesianIndex(1, 1)

Fixes #140:

julia> using DualNumbers
[ Info: Precompiling DualNumbers [fa6b7ba4-c1ee-5f82-b5fc-ecf0adba8f74]

julia> p = CUDA.zeros(Dual128, 6)
6-element CuArray{Dual128, 1}:
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ

julia> exp.(p)
6-element CuArray{Dual128, 1}:
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ

Fixes #76:

julia> bar(x) = 1.0^x
bar (generic function with 1 method)

julia> bar.(CuArray([1]))
1-element CuArray{Float64, 1}:
 1.0

Fixes #71:

julia> function kernel_vpow(a, b)
           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
           b[i] = a[i]^1.5
           return nothing
       end
kernel_vpow (generic function with 1 method)

julia> a = round.(rand(Float32, (3, 4)) * 100);

julia> d_a = CuArray(a);

julia> d_b = similar(d_a);

julia> @cuda kernel_vpow(d_a, d_b)
CUDA.HostKernel{kernel_vpow, Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}}}(CuContext(0x00000000025d73b0, instance a0f57c0b433bf9bd), CuModule(Ptr{Nothing} @0x000000000c4fc0b0, CuContext(0x00000000025d73b0, instance a0f57c0b433bf9bd)), CuFunction(Ptr{Nothing} @0x000000000c28c6f0, CuModule(Ptr{Nothing} @0x000000000c4fc0b0, CuContext(0x00000000025d73b0, instance a0f57c0b433bf9bd))))

julia> 

Fixes #171:

julia> x = rand(Float32, 100);

julia> x2 = x |> cu;

julia> x2 .^ 4
100-element CuArray{Float32, 1}:
...

Should fix #169, but I can't reproduce the original issue.

Should fix #130.

@maleadt maleadt added enhancement New feature or request cuda kernels Stuff about writing CUDA kernels. labels Mar 5, 2021
@maleadt maleadt force-pushed the tb/contextual_dispatch branch from 708a1b4 to ea8c0dd Compare March 5, 2021 15:08
@maleadt maleadt mentioned this pull request Mar 5, 2021
2 tasks
@maleadt maleadt force-pushed the tb/contextual_dispatch branch from ea8c0dd to 322d8c4 Compare March 10, 2021 11:21
@codecov
Copy link

codecov bot commented Mar 10, 2021

Codecov Report

Merging #750 (90b9366) into master (187ff96) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #750      +/-   ##
==========================================
- Coverage   78.25%   78.21%   -0.04%     
==========================================
  Files         122      121       -1     
  Lines        7269     7211      -58     
==========================================
- Hits         5688     5640      -48     
+ Misses       1581     1571      -10     
Impacted Files Coverage Δ
examples/wmma/high-level.jl 11.11% <ø> (ø)
examples/wmma/low-level.jl 14.28% <ø> (ø)
src/CUDA.jl 100.00% <ø> (ø)
src/accumulate.jl 97.05% <ø> (-0.09%) ⬇️
src/array.jl 90.13% <ø> (+0.40%) ⬆️
src/mapreduce.jl 100.00% <ø> (+2.27%) ⬆️
src/sorting.jl 22.85% <ø> (-0.88%) ⬇️
src/broadcast.jl 85.71% <100.00%> (-0.83%) ⬇️
src/compiler/execution.jl 90.41% <100.00%> (-0.13%) ⬇️
src/compiler/gpucompiler.jl 94.11% <100.00%> (+0.78%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f5562...90b9366. Read the comment docs.

src/device/intrinsics.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member

vchuravy commented Mar 11, 2021

This seems to work at a first glance:

module CUDAKernels

using CUDA

function __init__()
    precompiling = ccall(:jl_generating_output, Cint, ()) != 0
    if !precompiling
        eval(CUDA.overrides)
        empty!(CUDA.overrides.args)
    end
end

f() = 1
CUDA.@device_override f() = 2

function kernel(A)
    A[1] = f()
    nothing
end

end # module

after adding empty! and the $GPUCompiler

@maleadt maleadt force-pushed the tb/contextual_dispatch branch 4 times, most recently from 21c0848 to fa18f3d Compare March 17, 2021 08:18
@maleadt maleadt force-pushed the tb/contextual_dispatch branch from fa18f3d to bcf1b82 Compare March 17, 2021 08:22
@maleadt
Copy link
Member Author

maleadt commented Mar 17, 2021

@vchuravy I didn't include your suggestions; @device_override isn't intended to be reusable, and it isn't safe to call __init__ twice. Since it's just a thin layer over the functionality from GPUArrays.jl (which is intended to be reusable), you can just call that.

@maleadt maleadt marked this pull request as ready for review March 17, 2021 10:08
@maleadt maleadt added the needs tests Tests are requested. label Mar 17, 2021
@maleadt maleadt force-pushed the tb/contextual_dispatch branch from ea52da1 to 90b9366 Compare March 17, 2021 14:16
@maleadt maleadt merged commit 1d40f02 into master Mar 17, 2021
@maleadt maleadt deleted the tb/contextual_dispatch branch March 17, 2021 15:52
@vchuravy
Copy link
Member

Can you add an warning to that effect? E.g. that @device_override is internal to CUDA.jl and not supposed to be used by external users?
On 1.7 it coulde be used by external users right?

and it isn't safe to call init twice.

Yeah no plans here to call __init__ twice, More that eval(CUDA.overrides) could be made robust.

@maleadt
Copy link
Member Author

maleadt commented Mar 17, 2021

If you want to rely on this being available externally, we could try and make that a possibility. Maybe with a convenient register_overrides() function then, to be called at run time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment