-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't run multi GPU test if enough GPU's aren't present #209
Conversation
Do you mean the ringG algorithm would fail if number of GPU < 3? One may think the ringG algorithm with non-blocking MPI isend/irecv would work even if there is only 1 MPI rank. I think it is also possible for multiple MPI ranks to share the same GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to have an hotfix before having a more generic implementation, but I would find confusing relying on an unrelated property for the switch.
if (DCA_HAVE_CUDA) | ||
EXECUTE_PROCESS(COMMAND bash -c "nvidia-smi -L | awk 'BEGIN { num_gpu=0;} /GPU/ { num_gpu++;} END { printf(\"%d\", num_gpu) }'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ring algorithm should be independendent from the numbers of GPUs on the node. Rather the technical issue is that we rely on cuda aware MPI. Can we have a check similar to hostname==summit for the moment? Assuming automatically detecting if MPI is cuda aware is more complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cvd script will definitely have undefined behavior if there is not 1 GPU per rank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In FAQ for Open-MPI, there is mention of "Can I tell at compile time or runtime whether I have CUDA-aware support?" . It seems to test for MPIX_CUDA_AWARE_SUPPORT
https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-dev
Just FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue added to address mpi with gpu aware dependency
this will prevent multi gpu tests from running if multiple GPU's are not present