-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: runTheMatrix.py should assign a different GPU to each job #47337
Comments
cms-bot internal usage |
A new Issue was created by @fwyzard. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign Configuration/PyReleaseValidation |
assign heterogeneous |
New categories assigned: pdmv,upgrade,heterogeneous @AdrianoDee,@DickyChant,@fwyzard,@makortel,@miquork,@Moanwar,@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Apart from running something like |
No, |
And follow-up question: how do we count the GPUs available on a node? I imagine we should do this outside any |
I would think of building on top of our cmssw/HeterogeneousCore/CUDAUtilities/bin/cudaComputeCapabilities.cpp Lines 25 to 27 in 69407d5
(similarly for |
We can reuse the same mechanism that SCRAM uses to detect the GPUs: if cudaIsEnabled; then
# there is at least one supported NVIDIA GPU
else
# there are no supported NVIDIA GPUs
fi and if rocmIsEnabled; then
# there is at least one supported AMD GPU
else
# there are no supported AMD GPUs
fi Then you can enumerate the GPUs that are available and supported with cudaComputeCapabilities | grep -v unsupported and rocmComputeCapabilities | grep -v unsupported To select what NVIDIA GPU to use (_e.g. CUDA_VISIBLE_DEVICES=0 cmsRun ... and for AMD GPUs (e.g. CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 cmsRun ... |
They are almost equivalent: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html .
On a machine that has only AMD GPUs any of them works. On a machine that has both NVIDIA and AMD GPUs (rare in practice, but I use one for testing) you need to set both
Note that [email protected]:~$ cudaComputeCapabilities
0 8.9 NVIDIA L40S
1 8.9 NVIDIA L4
[email protected]:~$ CUDA_VISIBLE_DEVICES= cudaComputeCapabilities
cudaComputeCapabilities: no CUDA-capable device is detected
[email protected]:~$ rocmComputeCapabilities
0 gfx1100 AMD Radeon PRO W7800
[email protected]:~$ CUDA_VISIBLE_DEVICES= rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected
[email protected]:~$ CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 rocmComputeCapabilities
0 gfx1100 AMD Radeon PRO W7800
[email protected]:~$ CUDA_VISIBLE_DEVICES= ROCR_VISIBLE_DEVICES=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected
[email protected]:~$ CUDA_VISIBLE_DEVICES= GPU_DEVICE_ORDINAL=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected |
A proposed solution in #47377 |
runTheMatrix.py
creates and executes jobs without any kind of GPU assignment.On a machine with a single GPU, his is not an issue.
On a machine with more than one GPU, for example
or
the result is that all jobs try to use all GPUs, which is quite inefficient.
A better approach would be to assign a different GPU to each job, for example in a round-robin fashion.
If there are more concurrent jobs than GPUs, the GPUs will be shared - but to a much lesser extent than now.
The text was updated successfully, but these errors were encountered: