Break Kernels based on shell-type and using CUDA Streams #8

Ali-Tehrani · 2024-11-04T17:37:49Z

To greatly increase performance with minimal coding changes, it may be worthwhile to break up the computation of the kernels (e.g., computing atomic orbitals, derivatives of atomic orbitals, second derivatives, etc.) based on the Shell-Type. Currently, all functions are limited by the maximum of 255 registers per thread, which reduces the number of active threads. Based on profiling, I have observed that at most 8 warps are running concurrently. Additionally, breaking up the kernels can lead to reduced branch divergence and better compiler optimizations.

If you break up the kernels so that compute_atomic_orbitals<S>, compute_atomic_orbitals<P> etc, then the S-type can use less registers and more threads can be running at a time. Further, using CUDA Streams would mean that the shell-type functions are runned at the same time. This approach could also eliminate any if-statements by utilizing template specialization.

#define STYPE  0
#define PTYPE 1
...
template<int ShellType>
void compute_atomic_orbitals(...)
    if constexpr (ShellType == STYPE) {
         compute s-type orbitals
     }

To implement this for evaluate_scalar_quantity you'll need either need to change it so that it takes an array of function pointers with each entry the S-type function, P-type function etc or simpler but harder to understand, use templates. CUDA streams would be added here, as well.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break Kernels based on shell-type and using CUDA Streams #8

Break Kernels based on shell-type and using CUDA Streams #8

Ali-Tehrani commented Nov 4, 2024 •

edited

Loading

Break Kernels based on shell-type and using CUDA Streams #8

Break Kernels based on shell-type and using CUDA Streams #8

Comments

Ali-Tehrani commented Nov 4, 2024 • edited Loading

Ali-Tehrani commented Nov 4, 2024 •

edited

Loading