Skip to content

h100 timings for CCO‐Surface50

Eric Bylaska edited this page May 22, 2024 · 10 revisions

CCO-surface50 Timings - 5/21/2024

  • This example is FFT dominant.

Directory: /home/bylaska/PWDFT/QA/CCO-Cu_surface50

The table contains performance timings for the computational task on the given machine with varying numbers of CPU cores (ncpus). The timings are presented in seconds (cputime) and are broken down into different components:

  • non-local: Timings for non-local operations.
  • ffm: Timings for ffm operations.
  • fmf: Timings for fmf operations.
  • fft: Timings for FFT (Fast Fourier Transform) operations.
  • diagonalize: Timings for diagonalize operations.

"In the nvidia binary, FFT operations are exclusively performed using gpu, while BLAS3 operations are executed on the GPU. Additionally, it's important to note that the GPUs become overloaded after reaching a threshold of ncpus=6."

machine ncpus cputime non-local ffm fmf fft diagonalize
h100 8
h100 16
h100 24
h100 32
SYCL 12 4.995e+00 4.499e-01 4.335e-02 1.859e-02 4.185e+00 3.259e-03
SYCL 24 3.066e+00 4.563e-01 4.693e-02 1.231e-02 2.351e+00 3.308e-03
SYCL 48 2.300e+00 4.931e-01 5.110e-02 1.410e-02 1.581e+00 4.477e-03
h100 64
h100 200

The table presents the total and component times for different numbers of CPU cores (ncpus). The optimal timings for each component are indicated by bold values.

Clone this wiki locally