You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To greatly increase performance with minimal coding changes, it may be worthwhile to break up the computation of the kernels (e.g., computing atomic orbitals, derivatives of atomic orbitals, second derivatives, etc.) based on the Shell-Type. Currently, all functions are limited by the maximum of 255 registers per thread, which reduces the number of active threads. Based on profiling, I have observed that at most 8 warps are running concurrently. Additionally, breaking up the kernels can lead to reduced branch divergence and better compiler optimizations.
If you break up the kernels so that compute_atomic_orbitals<S>, compute_atomic_orbitals<P> etc, then the S-type can use less registers and more threads can be running at a time. Further, using CUDA Streams would mean that the shell-type functions are runned at the same time. This approach could also eliminate any if-statements by utilizing template specialization.
To implement this for evaluate_scalar_quantity you'll need either need to change it so that it takes an array of function pointers with each entry the S-type function, P-type function etc or simpler but harder to understand, use templates. CUDA streams would be added here, as well.
The text was updated successfully, but these errors were encountered:
To greatly increase performance with minimal coding changes, it may be worthwhile to break up the computation of the kernels (e.g., computing atomic orbitals, derivatives of atomic orbitals, second derivatives, etc.) based on the Shell-Type. Currently, all functions are limited by the maximum of 255 registers per thread, which reduces the number of active threads. Based on profiling, I have observed that at most 8 warps are running concurrently. Additionally, breaking up the kernels can lead to reduced branch divergence and better compiler optimizations.
If you break up the kernels so that
compute_atomic_orbitals<S>, compute_atomic_orbitals<P>
etc, then the S-type can use less registers and more threads can be running at a time. Further, using CUDA Streams would mean that the shell-type functions are runned at the same time. This approach could also eliminate any if-statements by utilizing template specialization.To implement this for
evaluate_scalar_quantity
you'll need either need to change it so that it takes an array of function pointers with each entry the S-type function, P-type function etc or simpler but harder to understand, use templates. CUDA streams would be added here, as well.The text was updated successfully, but these errors were encountered: