You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
groupnorm has custom kernels for the forward and backward pass. This is achieved by restructuring all the arrays into 4D and constructing either loops or KA kernels. This gives us around 2-3x performance boost over standard cases (verified against Flux as well).
We should change the default to what Pytorch does. This case is very simple to optimize using LoopVectorization on CPU and KernelAbstractions on GPU.
The general broadcasting case is very hard to optimize, best we could do is fuse into a single GPU kernel but this is not worth it much.
Integration into vendor libraries
MIOpen for AMDGPU batchnorm. With the new batchnorm kernels, we should test whether this is even worth it. Though I would need someone who has access to ROCm capable GPUs to test this.
Seems like the kernels are quite performant, and we are close to cuDNN in performance even with the naive way the kernels are written.
The text was updated successfully, but these errors were encountered:
Current Status
batchnorm
cuDNN
batchnorm
-- similar to what is done forgroupnorm
(added in feat: improved fallback BN implementation LuxLib.jl#106)Native Julia Implementations
groupnorm
has custom kernels for the forward and backward pass. This is achieved by restructuring all the arrays into 4D and constructing either loops or KA kernels. This gives us around 2-3x performance boost over standard cases (verified against Flux as well).instancenorm
-- very similar togroupnorm
.layernorm
Integration into vendor libraries
The text was updated successfully, but these errors were encountered: