Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: runTheMatrix.py should assign a different GPU to each job #47337

Open
fwyzard opened this issue Feb 12, 2025 · 12 comments
Open

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Feb 12, 2025

runTheMatrix.py creates and executes jobs without any kind of GPU assignment.

On a machine with a single GPU, his is not an issue.

On a machine with more than one GPU, for example

$ rocmComputeCapabilities 
   0    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   1    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   2    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   3    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   4    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   5    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   6    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   7    gfx90a:sramecc+:xnack-    AMD Instinct MI250X

or

$ cudaComputeCapabilities 
   0     8.9    NVIDIA L4
   1     8.9    NVIDIA L4
   2     8.9    NVIDIA L4
   3     8.9    NVIDIA L4

the result is that all jobs try to use all GPUs, which is quite inefficient.

A better approach would be to assign a different GPU to each job, for example in a round-robin fashion.
If there are more concurrent jobs than GPUs, the GPUs will be shared - but to a much lesser extent than now.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 12, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @fwyzard.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign Configuration/PyReleaseValidation

@makortel
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: pdmv,upgrade,heterogeneous

@AdrianoDee,@DickyChant,@fwyzard,@makortel,@miquork,@Moanwar,@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Feb 12, 2025

Apart from running something like CUDA_VISIBLE_DEVICES=X cmsRun config.py do we have any mechanism in cmssw to tell a job to use a specific GPU?

@makortel
Copy link
Contributor

Apart from running something like CUDA_VISIBLE_DEVICES=X cmsRun config.py do we have any mechanism in cmssw to tell a job to use a specific GPU?

No, CUDA_VISIBLE_DEVICES on NVIDIA (and one of HIP_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, ROCR_VISIBLE_DEVICES on AMD; it's not really clear to me which one would be the best choice (link)) would be the way to go (at least presently).

@AdrianoDee
Copy link
Contributor

And follow-up question: how do we count the GPUs available on a node? I imagine we should do this outside any cmsRun and within runTheMatrix.py. What I've found with a dummy StackOverflow search is something like nvidia-smi -L | wc -l (for NVIDIA, didn't check for AMD).

@makortel
Copy link
Contributor

makortel commented Feb 12, 2025

I would think of building on top of our cudaComputeCapabilities program, which will include (unsupported) in the printout if a GPU is not supported by us (for whatever reason), that might be useful to exclude by runTheMatrix.py

if (not isCudaDeviceSupported(i)) {
std::cout << " (unsupported)";
}

(similarly for rocmComputeCapabilities for AMD)

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 12, 2025

And follow-up question: how do we count the GPUs available on a node? I imagine we should do this outside any cmsRun and within runTheMatrix.py.

We can reuse the same mechanism that SCRAM uses to detect the GPUs:

if cudaIsEnabled; then
  # there is at least one supported NVIDIA GPU
else
  # there are no supported NVIDIA GPUs
fi

and

if rocmIsEnabled; then
  # there is at least one supported AMD GPU
else
  # there are no supported AMD GPUs
fi

Then you can enumerate the GPUs that are available and supported with

cudaComputeCapabilities | grep -v unsupported

and

rocmComputeCapabilities | grep -v unsupported

To select what NVIDIA GPU to use (_e.g. 0) I've been using

CUDA_VISIBLE_DEVICES=0 cmsRun ...

and for AMD GPUs (e.g. 0):

CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 cmsRun ...

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 12, 2025

one of HIP_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, ROCR_VISIBLE_DEVICES on AMD; it's not really clear to me which one would be the best choice

They are almost equivalent: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html .

  • ROCR_VISIBLE_DEVICES is applied at the ROCm level
  • HIP_VISIBLE_DEVICES is applied at the HIP runtime level
  • GPU_DEVICE_ORDINAL is applied at the HIP runtime level (and also for OpenCL)
  • CUDA_VISIBLE_DEVICES is also applied at the HIP runtime level for AMD GPUs

On a machine that has only AMD GPUs any of them works.

On a machine that has both NVIDIA and AMD GPUs (rare in practice, but I use one for testing) you need to set both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 cmsRun ...

  • CUDA_VISIBLE_DEVICES= with an empty list hides both NVIDIA and AMD GPUs;
  • HIP_VISIBLE_DEVICES=0 overrides that for AMD GPUs and selects the first one.

Note that ROCR_VISIBLE_DEVICES and GPU_DEVICE_ORDINAL are used at a lower level than CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES, so they cannot override an empty CUDA_VISIBLE_DEVICES=:

[email protected]:~$ cudaComputeCapabilities 
   0     8.9    NVIDIA L40S
   1     8.9    NVIDIA L4

[email protected]:~$ CUDA_VISIBLE_DEVICES= cudaComputeCapabilities
cudaComputeCapabilities: no CUDA-capable device is detected

[email protected]:~$ rocmComputeCapabilities 
   0     gfx1100    AMD Radeon PRO W7800

[email protected]:~$ CUDA_VISIBLE_DEVICES= rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected

[email protected]:~$ CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 rocmComputeCapabilities
   0     gfx1100    AMD Radeon PRO W7800

[email protected]:~$ CUDA_VISIBLE_DEVICES= ROCR_VISIBLE_DEVICES=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected

[email protected]:~$ CUDA_VISIBLE_DEVICES= GPU_DEVICE_ORDINAL=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected

@AdrianoDee
Copy link
Contributor

A proposed solution in #47377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants