Use contextual dispatch for device functions. #750

maleadt · 2021-03-05T14:34:28Z

Builds on JuliaGPU/GPUCompiler.jl#151, requires JuliaLang/julia#39697 on Julia 1.7 but should also work on 1.6. Supersedes JuliaGPU/CUDAnative.jl#334.

Fixes #60

julia> CUDA.saturate(1f0)
ERROR: This function is not intended for use on the CPU
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] saturate(x::Float32)
   @ CUDA ~/Julia/pkg/CUDA/src/device/intrinsics.jl:23
 [3] top-level scope
   @ REPL[10]:1

Fixes #42

julia> a = CuArray([Complex(1f0,2f0)])
1-element CuArray{ComplexF32, 1}:
 1.0f0 + 2.0f0im

julia> sincos.(a)
1-element CuArray{Tuple{ComplexF32, ComplexF32}, 1}:
 (3.1657786f0 + 1.9596009f0im, 2.032723f0 - 3.0518978f0im)

Fixes #659

julia> x = cu([true false; false true])
2×2 CuArray{Bool, 2}:
 1  0
 0  1

julia> argmax(x)
CartesianIndex(1, 1)

Fixes #140:

julia> using DualNumbers
[ Info: Precompiling DualNumbers [fa6b7ba4-c1ee-5f82-b5fc-ecf0adba8f74]

julia> p = CUDA.zeros(Dual128, 6)
6-element CuArray{Dual128, 1}:
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ
 0.0 + 0.0ɛ

julia> exp.(p)
6-element CuArray{Dual128, 1}:
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ
 1.0 + 0.0ɛ

Fixes #76:

julia> bar(x) = 1.0^x
bar (generic function with 1 method)

julia> bar.(CuArray([1]))
1-element CuArray{Float64, 1}:
 1.0

Fixes #71:

julia> function kernel_vpow(a, b)
           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
           b[i] = a[i]^1.5
           return nothing
       end
kernel_vpow (generic function with 1 method)

julia> a = round.(rand(Float32, (3, 4)) * 100);

julia> d_a = CuArray(a);

julia> d_b = similar(d_a);

julia> @cuda kernel_vpow(d_a, d_b)
CUDA.HostKernel{kernel_vpow, Tuple{CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}}}(CuContext(0x00000000025d73b0, instance a0f57c0b433bf9bd), CuModule(Ptr{Nothing} @0x000000000c4fc0b0, CuContext(0x00000000025d73b0, instance a0f57c0b433bf9bd)), CuFunction(Ptr{Nothing} @0x000000000c28c6f0, CuModule(Ptr{Nothing} @0x000000000c4fc0b0, CuContext(0x00000000025d73b0, instance a0f57c0b433bf9bd))))

julia>

Fixes #171:

julia> x = rand(Float32, 100);

julia> x2 = x |> cu;

julia> x2 .^ 4
100-element CuArray{Float32, 1}:
...

Should fix #169, but I can't reproduce the original issue.

Should fix #130.

codecov · 2021-03-10T16:29:47Z

Codecov Report

Merging #750 (90b9366) into master (187ff96) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #750      +/-   ##
==========================================
- Coverage   78.25%   78.21%   -0.04%     
==========================================
  Files         122      121       -1     
  Lines        7269     7211      -58     
==========================================
- Hits         5688     5640      -48     
+ Misses       1581     1571      -10

Impacted Files	Coverage Δ
examples/wmma/high-level.jl	`11.11% <ø> (ø)`
examples/wmma/low-level.jl	`14.28% <ø> (ø)`
src/CUDA.jl	`100.00% <ø> (ø)`
src/accumulate.jl	`97.05% <ø> (-0.09%)`	⬇️
src/array.jl	`90.13% <ø> (+0.40%)`	⬆️
src/mapreduce.jl	`100.00% <ø> (+2.27%)`	⬆️
src/sorting.jl	`22.85% <ø> (-0.88%)`	⬇️
src/broadcast.jl	`85.71% <100.00%> (-0.83%)`	⬇️
src/compiler/execution.jl	`90.41% <100.00%> (-0.13%)`	⬇️
src/compiler/gpucompiler.jl	`94.11% <100.00%> (+0.78%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f5562...90b9366. Read the comment docs.

src/initialization.jl

src/device/intrinsics.jl

vchuravy · 2021-03-11T03:09:04Z

This seems to work at a first glance:

module CUDAKernels

using CUDA

function __init__()
    precompiling = ccall(:jl_generating_output, Cint, ()) != 0
    if !precompiling
        eval(CUDA.overrides)
        empty!(CUDA.overrides.args)
    end
end

f() = 1
CUDA.@device_override f() = 2

function kernel(A)
    A[1] = f()
    nothing
end

end # module

after adding empty! and the $GPUCompiler

maleadt · 2021-03-17T08:22:56Z

@vchuravy I didn't include your suggestions; @device_override isn't intended to be reusable, and it isn't safe to call __init__ twice. Since it's just a thin layer over the functionality from GPUArrays.jl (which is intended to be reusable), you can just call that.

vchuravy · 2021-03-17T16:15:01Z

Can you add an warning to that effect? E.g. that @device_override is internal to CUDA.jl and not supposed to be used by external users?
On 1.7 it coulde be used by external users right?

and it isn't safe to call init twice.

Yeah no plans here to call __init__ twice, More that eval(CUDA.overrides) could be made robust.

maleadt · 2021-03-17T16:23:42Z

If you want to rely on this being available externally, we could try and make that a possibility. Maybe with a convenient register_overrides() function then, to be called at run time.

maleadt added enhancement New feature or request cuda kernels Stuff about writing CUDA kernels. labels Mar 5, 2021

maleadt force-pushed the tb/contextual_dispatch branch from 708a1b4 to ea8c0dd Compare March 5, 2021 15:08

maleadt mentioned this pull request Mar 5, 2021

Add p-norm functionality. #688

Closed

2 tasks

devmotion mentioned this pull request Mar 9, 2021

To the GPU we go cambridge-mlg/Covid19#5

Open

maleadt mentioned this pull request Mar 10, 2021

CUDA.lgamma(x) crashes Julia #758

Closed

maleadt force-pushed the tb/contextual_dispatch branch from ea8c0dd to 322d8c4 Compare March 10, 2021 11:21

vchuravy reviewed Mar 11, 2021

View reviewed changes

src/initialization.jl Show resolved Hide resolved

vchuravy reviewed Mar 11, 2021

View reviewed changes

src/device/intrinsics.jl Outdated Show resolved Hide resolved

maleadt force-pushed the tb/contextual_dispatch branch 4 times, most recently from 21c0848 to fa18f3d Compare March 17, 2021 08:18

maleadt added 2 commits March 17, 2021 09:21

Use contextual dispatch for device functions.

6f322bd

Clean-up version checks.

bcf1b82

maleadt force-pushed the tb/contextual_dispatch branch from fa18f3d to bcf1b82 Compare March 17, 2021 08:22

maleadt marked this pull request as ready for review March 17, 2021 10:08

maleadt added the needs tests Tests are requested. label Mar 17, 2021

Add tests.

90b9366

maleadt force-pushed the tb/contextual_dispatch branch from ea52da1 to 90b9366 Compare March 17, 2021 14:16

maleadt merged commit 1d40f02 into master Mar 17, 2021

maleadt deleted the tb/contextual_dispatch branch March 17, 2021 15:52

maleadt mentioned this pull request Oct 5, 2021

Support for mathematical operations on complex numbers JuliaGPU/GPUArrays.jl#165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use contextual dispatch for device functions. #750

Use contextual dispatch for device functions. #750

maleadt commented Mar 5, 2021 •

edited

Loading

codecov bot commented Mar 10, 2021 •

edited

Loading

vchuravy commented Mar 11, 2021 •

edited

Loading

maleadt commented Mar 17, 2021

vchuravy commented Mar 17, 2021

maleadt commented Mar 17, 2021 •

edited

Loading

Use contextual dispatch for device functions. #750

Use contextual dispatch for device functions. #750

Conversation

maleadt commented Mar 5, 2021 • edited Loading

codecov bot commented Mar 10, 2021 • edited Loading

Codecov Report

vchuravy commented Mar 11, 2021 • edited Loading

maleadt commented Mar 17, 2021

vchuravy commented Mar 17, 2021

maleadt commented Mar 17, 2021 • edited Loading

maleadt commented Mar 5, 2021 •

edited

Loading

codecov bot commented Mar 10, 2021 •

edited

Loading

vchuravy commented Mar 11, 2021 •

edited

Loading

maleadt commented Mar 17, 2021 •

edited

Loading