Don't run multi GPU test if enough GPU's aren't present #209

PDoakORNL · 2020-07-30T16:09:47Z

this will prevent multi gpu tests from running if multiple GPU's are not present

efdazedo · 2020-07-31T13:22:54Z

Do you mean the ringG algorithm would fail if number of GPU < 3? One may think the ringG algorithm with non-blocking MPI isend/irecv would work even if there is only 1 MPI rank. I think it is also possible for multiple MPI ranks to share the same GPU.

gbalduzz

I agree to have an hotfix before having a more generic implementation, but I would find confusing relying on an unrelated property for the switch.

gbalduzz · 2020-07-31T14:41:42Z

CMakeLists.txt

 if (DCA_HAVE_CUDA)
+  EXECUTE_PROCESS(COMMAND bash -c "nvidia-smi -L | awk 'BEGIN { num_gpu=0;} /GPU/ { num_gpu++;} END { printf(\"%d\", num_gpu) }'"


The ring algorithm should be independendent from the numbers of GPUs on the node. Rather the technical issue is that we rely on cuda aware MPI. Can we have a check similar to hostname==summit for the moment? Assuming automatically detecting if MPI is cuda aware is more complicated

The cvd script will definitely have undefined behavior if there is not 1 GPU per rank.

In FAQ for Open-MPI, there is mention of "Can I tell at compile time or runtime whether I have CUDA-aware support?" . It seems to test for MPIX_CUDA_AWARE_SUPPORT
https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-dev

Just FYI

Added #210 and with that I think this should go in so #208 and #206 can pass CI and go in.

Issue added to address mpi with gpu aware dependency

…PU_tests" This reverts commit f3ac453, reversing changes made to 0460fba.

check number of GPU don't run ringG if < 3

79b4a63

PDoakORNL requested review from biddisco and gbalduzz July 30, 2020 17:23

gbalduzz previously requested changes Jul 31, 2020

View reviewed changes

PDoakORNL merged commit f3ac453 into CompFUSE:master Jul 31, 2020

gbalduzz added a commit to gbalduzz/DCA that referenced this pull request Aug 3, 2020

Revert "Merge pull request CompFUSE#209 from PDoakORNL/prevent_multiG…

bdc474c

…PU_tests" This reverts commit f3ac453, reversing changes made to 0460fba.

gbalduzz mentioned this pull request Aug 3, 2020

G4 ring test is broken #212

Closed

gbalduzz added a commit to gbalduzz/DCA that referenced this pull request Aug 3, 2020

Revert "Merge pull request CompFUSE#209 from PDoakORNL/prevent_multiG…

4d947c3

…PU_tests" This reverts commit f3ac453, reversing changes made to 0460fba.

gbalduzz added a commit to gbalduzz/DCA that referenced this pull request Aug 3, 2020

Revert "Merge pull request CompFUSE#209 from PDoakORNL/prevent_multiG…

13147b5

…PU_tests" This reverts commit f3ac453, reversing changes made to 0460fba.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't run multi GPU test if enough GPU's aren't present #209

Don't run multi GPU test if enough GPU's aren't present #209

PDoakORNL commented Jul 30, 2020

efdazedo commented Jul 31, 2020

gbalduzz left a comment

gbalduzz Jul 31, 2020 •

edited

Loading

PDoakORNL Jul 31, 2020

efdazedo Jul 31, 2020

PDoakORNL Jul 31, 2020

		if (DCA_HAVE_CUDA)
		EXECUTE_PROCESS(COMMAND bash -c "nvidia-smi -L \| awk 'BEGIN { num_gpu=0;} /GPU/ { num_gpu++;} END { printf(\"%d\", num_gpu) }'"

Don't run multi GPU test if enough GPU's aren't present #209

Don't run multi GPU test if enough GPU's aren't present #209

Conversation

PDoakORNL commented Jul 30, 2020

efdazedo commented Jul 31, 2020

gbalduzz left a comment

Choose a reason for hiding this comment

gbalduzz Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

PDoakORNL Jul 31, 2020

Choose a reason for hiding this comment

efdazedo Jul 31, 2020

Choose a reason for hiding this comment

PDoakORNL Jul 31, 2020

Choose a reason for hiding this comment

gbalduzz Jul 31, 2020 •

edited

Loading