GPU acceleration on a single machine #900

muffato · 2023-01-11T19:18:10Z

Hello,

This is the equivalent ticket of #887, but for the single_machine batch system

    raise InsufficientSystemResources(requirer, 'accelerators', self.accelerator_identities, details=[
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job LastzRepeatMaskJob is requesting [{'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [] accelerators that SingleMachineBatchSystem was configured with. The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided. Scale is set to 1.0.

According to the release notes:

Upgrade to Toil 5.8.0, which allows GPU counts to be assigned to jobs (which can be passed from cactus via the --gpu option). Toil only currently supports this functionality in single machine mode.

which sounds like it should work ?

I'm running this command:

cactus  $PWD/240_3_js $PWD/data/evolverMammals.txt $PWD/240_3.hal --gpu 1

The text was updated successfully, but these errors were encountered:

glennhickey · 2023-01-11T19:35:15Z

Yeah, it looks like Toil doesn't believe you're on a machine with a GPU. I think Toil is relying on nvidia-smi ie through toil.lib.accelerators.count_nvidia_gpus() to make that determination. Does nvidia-smi work on your system?

Unfortunately, there's no real work-around on the cactus end that I can think of. GPU jobs need to get assigned GPUs via Toil starting in this release.

muffato · 2023-01-11T19:52:36Z

(I should have said I'm using Cactus 2.4.0)

Yes, nvidia-smi works on this machine:

Wed Jan 11 19:51:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                   On |
| N/A   31C    P0    48W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                   On |
| N/A   28C    P0    50W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                   On |
| N/A   29C    P0    48W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                   On |
| N/A   26C    P0    48W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    9   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   13   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

muffato · 2023-01-11T19:55:36Z

and:

$ nvidia-smi -q -x | head -n 7
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v11.dtd">
<nvidia_smi_log>
        <timestamp>Wed Jan 11 19:55:17 2023</timestamp>
        <driver_version>510.108.03</driver_version>
        <cuda_version>11.6</cuda_version>
        <attached_gpus>4</attached_gpus>
``
(`attached_gpus` is the [key toil looks for](https://github.com/DataBiosphere/toil/blob/083ad3b6d40f145e3ba2073fe66c7c0cf3c60ea7/src/toil/lib/accelerators.py#L53))

muffato · 2023-01-11T19:56:43Z

Would I re-raise the issue on toil ?

glennhickey · 2023-01-11T19:59:02Z

Yes, please do, as I'm 99% sure this is on the Toil end. And by the output of your nvidia-smi, it should be detecting your 4 gpus without issue.. I'll ping @adamnovak here too.

glennhickey · 2023-01-11T20:23:06Z

Would also be curious to know what happens when you run with --gpu but without any arguemnts. Cactus should then use that toil function to set all 4 gpus.

muffato · 2023-01-11T20:51:25Z

the argument of --gpu seems mandatory:

[2023-01-11T20:50:29+0000] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: 47f9079cc31a5533ffb76f038480fdec1b6f7c4f
Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus", line 8, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/progressive/cactus_progressive.py", line 372, in main
    config_wrapper.initGPU(options)
  File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/shared/configWrapper.py", line 274, in initGPU
    lastz_gpu = get_gpu_count()
  File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/shared/configWrapper.py", line 261, in get_gpu_count
    raise RuntimeError('Unable to automatically determine number of GPUs: Please set with --gpu N')
RuntimeError: Unable to automatically determine number of GPUs: Please set with --gpu N

glennhickey · 2023-01-11T21:33:05Z

That message happens when cactus can’t get a count from that toil function either. I guess it confirms that that nvidia-smi invocation is failing on both ends when called from python, despite looking fine on the console. If you could open python and try running those commands it may reveal the problem

…

On Wed, Jan 11, 2023 at 3:51 PM Matthieu Muffato ***@***.***> wrote: the argument of --gpu seems mandatory: [2023-01-11T20:50:29+0000] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: 47f9079 Traceback (most recent call last): File "/home/cactus/cactus_env/bin/cactus", line 8, in <module> sys.exit(main()) File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/progressive/cactus_progressive.py", line 372, in main config_wrapper.initGPU(options) File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/shared/configWrapper.py", line 274, in initGPU lastz_gpu = get_gpu_count() File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/shared/configWrapper.py", line 261, in get_gpu_count raise RuntimeError('Unable to automatically determine number of GPUs: Please set with --gpu N') RuntimeError: Unable to automatically determine number of GPUs: Please set with --gpu N — Reply to this email directly, view it on GitHub <#900 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG373TGYLHKRNPJBGYFQMTWR4MNPANCNFSM6AAAAAATYMT46I> . You are receiving this because you commented.Message ID: ***@***.***>

muffato · 2023-01-11T21:49:28Z

I've found the problem ! nvidia-smi is in /usr/bin. I run Cactus through Singularity, so /usr/bin of the host is not available

Thanks for the help 🤝

adamnovak · 2023-01-11T22:13:08Z

I think you need to run Cactus in a top-level Singularity container that has GPUs available, and the nvidia userspace binaries (like nvidia-smi) to access them. Can't you mount those in somehow?

glennhickey · 2023-01-11T22:44:49Z

Ah ok, running in Singularity is an important detail. Still strange because the gpu-enabled cactus image (quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu) contains /usr/bin/nvidia-smi. If you're running from the regular release (quay.io/comparative-genomics-toolkit/cactus:v2.4.0), even if you could detect GPUs it wouldn't work since SegAlign isn't included.

muffato · 2023-01-11T23:39:51Z

Sorry guys, I forgot the --nv flag 🤦🏼🤦🏼 !
With it, singularity mounts nvidia-smi (and other nvidia libraries / executables) into the container, and toil can successfully detect the GPUs

muffato mentioned this issue Jan 11, 2023

GPU acceleration on a single machine through Singularity DataBiosphere/toil#4311

Closed

muffato closed this as completed Jan 11, 2023

aaannaw mentioned this issue May 17, 2024

The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided (Singularity container ) #1390

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU acceleration on a single machine #900

GPU acceleration on a single machine #900

muffato commented Jan 11, 2023

glennhickey commented Jan 11, 2023

muffato commented Jan 11, 2023

muffato commented Jan 11, 2023 •

edited

Loading

muffato commented Jan 11, 2023

glennhickey commented Jan 11, 2023

glennhickey commented Jan 11, 2023

muffato commented Jan 11, 2023

glennhickey commented Jan 11, 2023 via email

muffato commented Jan 11, 2023

adamnovak commented Jan 11, 2023

glennhickey commented Jan 11, 2023

muffato commented Jan 11, 2023

GPU acceleration on a single machine #900

GPU acceleration on a single machine #900

Comments

muffato commented Jan 11, 2023

glennhickey commented Jan 11, 2023

muffato commented Jan 11, 2023

muffato commented Jan 11, 2023 • edited Loading

muffato commented Jan 11, 2023

glennhickey commented Jan 11, 2023

glennhickey commented Jan 11, 2023

muffato commented Jan 11, 2023

glennhickey commented Jan 11, 2023 via email

muffato commented Jan 11, 2023

adamnovak commented Jan 11, 2023

glennhickey commented Jan 11, 2023

muffato commented Jan 11, 2023

muffato commented Jan 11, 2023 •

edited

Loading