feature request: runTheMatrix.py should assign a different GPU to each job #47337

fwyzard · 2025-02-12T13:37:27Z

runTheMatrix.py creates and executes jobs without any kind of GPU assignment.

On a machine with a single GPU, his is not an issue.

On a machine with more than one GPU, for example

$ rocmComputeCapabilities 
   0    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   1    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   2    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   3    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   4    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   5    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   6    gfx90a:sramecc+:xnack-    AMD Instinct MI250X
   7    gfx90a:sramecc+:xnack-    AMD Instinct MI250X

or

$ cudaComputeCapabilities 
   0     8.9    NVIDIA L4
   1     8.9    NVIDIA L4
   2     8.9    NVIDIA L4
   3     8.9    NVIDIA L4

the result is that all jobs try to use all GPUs, which is quite inefficient.

A better approach would be to assign a different GPU to each job, for example in a round-robin fashion.
If there are more concurrent jobs than GPUs, the GPUs will be shared - but to a much lesser extent than now.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2025-02-12T13:37:50Z

cms-bot internal usage

cmsbuild · 2025-02-12T13:37:51Z

A new Issue was created by @fwyzard.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2025-02-12T14:02:06Z

assign Configuration/PyReleaseValidation

makortel · 2025-02-12T14:02:12Z

assign heterogeneous

cmsbuild · 2025-02-12T14:02:32Z

New categories assigned: pdmv,upgrade,heterogeneous

@AdrianoDee,@DickyChant,@fwyzard,@makortel,@miquork,@Moanwar,@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks

AdrianoDee · 2025-02-12T14:45:13Z

Apart from running something like CUDA_VISIBLE_DEVICES=X cmsRun config.py do we have any mechanism in cmssw to tell a job to use a specific GPU?

makortel · 2025-02-12T14:56:33Z

Apart from running something like CUDA_VISIBLE_DEVICES=X cmsRun config.py do we have any mechanism in cmssw to tell a job to use a specific GPU?

No, CUDA_VISIBLE_DEVICES on NVIDIA (and one of HIP_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, ROCR_VISIBLE_DEVICES on AMD; it's not really clear to me which one would be the best choice (link)) would be the way to go (at least presently).

AdrianoDee · 2025-02-12T14:59:40Z

And follow-up question: how do we count the GPUs available on a node? I imagine we should do this outside any cmsRun and within runTheMatrix.py. What I've found with a dummy StackOverflow search is something like nvidia-smi -L | wc -l (for NVIDIA, didn't check for AMD).

makortel · 2025-02-12T15:23:01Z

I would think of building on top of our cudaComputeCapabilities program, which will include (unsupported) in the printout if a GPU is not supported by us (for whatever reason), that might be useful to exclude by runTheMatrix.py

cmssw/HeterogeneousCore/CUDAUtilities/bin/cudaComputeCapabilities.cpp

Lines 25 to 27 in 69407d5

    
           if (not isCudaDeviceSupported(i)) { 
        
             std::cout << " (unsupported)"; 
        
           }

(similarly for rocmComputeCapabilities for AMD)

fwyzard · 2025-02-12T15:55:33Z

And follow-up question: how do we count the GPUs available on a node? I imagine we should do this outside any cmsRun and within runTheMatrix.py.

We can reuse the same mechanism that SCRAM uses to detect the GPUs:

if cudaIsEnabled; then
  # there is at least one supported NVIDIA GPU
else
  # there are no supported NVIDIA GPUs
fi

and

if rocmIsEnabled; then
  # there is at least one supported AMD GPU
else
  # there are no supported AMD GPUs
fi

Then you can enumerate the GPUs that are available and supported with

cudaComputeCapabilities | grep -v unsupported

and

rocmComputeCapabilities | grep -v unsupported

To select what NVIDIA GPU to use (_e.g. 0) I've been using

CUDA_VISIBLE_DEVICES=0 cmsRun ...

and for AMD GPUs (e.g. 0):

CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 cmsRun ...

fwyzard · 2025-02-12T16:04:11Z

one of HIP_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL, ROCR_VISIBLE_DEVICES on AMD; it's not really clear to me which one would be the best choice

They are almost equivalent: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html .

ROCR_VISIBLE_DEVICES is applied at the ROCm level
HIP_VISIBLE_DEVICES is applied at the HIP runtime level
GPU_DEVICE_ORDINAL is applied at the HIP runtime level (and also for OpenCL)
CUDA_VISIBLE_DEVICES is also applied at the HIP runtime level for AMD GPUs

On a machine that has only AMD GPUs any of them works.

On a machine that has both NVIDIA and AMD GPUs (rare in practice, but I use one for testing) you need to set both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 cmsRun ...

CUDA_VISIBLE_DEVICES= with an empty list hides both NVIDIA and AMD GPUs;
HIP_VISIBLE_DEVICES=0 overrides that for AMD GPUs and selects the first one.

Note that ROCR_VISIBLE_DEVICES and GPU_DEVICE_ORDINAL are used at a lower level than CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES, so they cannot override an empty CUDA_VISIBLE_DEVICES=:

[email protected]:~$ cudaComputeCapabilities 
   0     8.9    NVIDIA L40S
   1     8.9    NVIDIA L4

[email protected]:~$ CUDA_VISIBLE_DEVICES= cudaComputeCapabilities
cudaComputeCapabilities: no CUDA-capable device is detected

[email protected]:~$ rocmComputeCapabilities 
   0     gfx1100    AMD Radeon PRO W7800

[email protected]:~$ CUDA_VISIBLE_DEVICES= rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected

[email protected]:~$ CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 rocmComputeCapabilities
   0     gfx1100    AMD Radeon PRO W7800

[email protected]:~$ CUDA_VISIBLE_DEVICES= ROCR_VISIBLE_DEVICES=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected

[email protected]:~$ CUDA_VISIBLE_DEVICES= GPU_DEVICE_ORDINAL=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected

AdrianoDee · 2025-02-17T16:30:55Z

A proposed solution in #47377

cmsbuild added the pending-assignment label Feb 12, 2025

cmsbuild added pending-signatures pdmv-pending upgrade-pending heterogeneous-pending and removed pending-assignment labels Feb 12, 2025

AdrianoDee mentioned this issue Feb 17, 2025

Updates for runTheMatrix.py: input checks, GPUs repartition, input recycling #47377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: runTheMatrix.py should assign a different GPU to each job #47337

feature request: runTheMatrix.py should assign a different GPU to each job #47337

fwyzard commented Feb 12, 2025

cmsbuild commented Feb 12, 2025 •

edited

Loading

cmsbuild commented Feb 12, 2025

makortel commented Feb 12, 2025

makortel commented Feb 12, 2025

cmsbuild commented Feb 12, 2025

AdrianoDee commented Feb 12, 2025 •

edited

Loading

makortel commented Feb 12, 2025

AdrianoDee commented Feb 12, 2025

makortel commented Feb 12, 2025 •

edited

Loading

fwyzard commented Feb 12, 2025

fwyzard commented Feb 12, 2025 •

edited

Loading

AdrianoDee commented Feb 17, 2025

feature request: runTheMatrix.py should assign a different GPU to each job #47337

feature request: runTheMatrix.py should assign a different GPU to each job #47337

Comments

fwyzard commented Feb 12, 2025

cmsbuild commented Feb 12, 2025 • edited Loading

cmsbuild commented Feb 12, 2025

makortel commented Feb 12, 2025

makortel commented Feb 12, 2025

cmsbuild commented Feb 12, 2025

AdrianoDee commented Feb 12, 2025 • edited Loading

makortel commented Feb 12, 2025

AdrianoDee commented Feb 12, 2025

makortel commented Feb 12, 2025 • edited Loading

fwyzard commented Feb 12, 2025

fwyzard commented Feb 12, 2025 • edited Loading

AdrianoDee commented Feb 17, 2025

cmsbuild commented Feb 12, 2025 •

edited

Loading

AdrianoDee commented Feb 12, 2025 •

edited

Loading

makortel commented Feb 12, 2025 •

edited

Loading

fwyzard commented Feb 12, 2025 •

edited

Loading