Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU acceleration on a single machine #900

Closed
muffato opened this issue Jan 11, 2023 · 12 comments
Closed

GPU acceleration on a single machine #900

muffato opened this issue Jan 11, 2023 · 12 comments

Comments

@muffato
Copy link
Contributor

muffato commented Jan 11, 2023

Hello,

This is the equivalent ticket of #887, but for the single_machine batch system

    raise InsufficientSystemResources(requirer, 'accelerators', self.accelerator_identities, details=[
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job LastzRepeatMaskJob is requesting [{'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [] accelerators that SingleMachineBatchSystem was configured with. The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided. Scale is set to 1.0.

According to the release notes:

Upgrade to Toil 5.8.0, which allows GPU counts to be assigned to jobs (which can be passed from cactus via the --gpu option). Toil only currently supports this functionality in single machine mode.

which sounds like it should work ?

I'm running this command:

cactus  $PWD/240_3_js $PWD/data/evolverMammals.txt $PWD/240_3.hal --gpu 1
@glennhickey
Copy link
Collaborator

Yeah, it looks like Toil doesn't believe you're on a machine with a GPU. I think Toil is relying on nvidia-smi ie through toil.lib.accelerators.count_nvidia_gpus() to make that determination. Does nvidia-smi work on your system?

Unfortunately, there's no real work-around on the cactus end that I can think of. GPU jobs need to get assigned GPUs via Toil starting in this release.

@muffato
Copy link
Contributor Author

muffato commented Jan 11, 2023

(I should have said I'm using Cactus 2.4.0)

Yes, nvidia-smi works on this machine:

Wed Jan 11 19:51:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                   On |
| N/A   31C    P0    48W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                   On |
| N/A   28C    P0    50W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                   On |
| N/A   29C    P0    48W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                   On |
| N/A   26C    P0    48W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    9   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   13   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@muffato
Copy link
Contributor Author

muffato commented Jan 11, 2023

and:

$ nvidia-smi -q -x | head -n 7
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v11.dtd">
<nvidia_smi_log>
        <timestamp>Wed Jan 11 19:55:17 2023</timestamp>
        <driver_version>510.108.03</driver_version>
        <cuda_version>11.6</cuda_version>
        <attached_gpus>4</attached_gpus>
``
(`attached_gpus` is the [key toil looks for](https://github.com/DataBiosphere/toil/blob/083ad3b6d40f145e3ba2073fe66c7c0cf3c60ea7/src/toil/lib/accelerators.py#L53))

@muffato
Copy link
Contributor Author

muffato commented Jan 11, 2023

Would I re-raise the issue on toil ?

@glennhickey
Copy link
Collaborator

Yes, please do, as I'm 99% sure this is on the Toil end. And by the output of your nvidia-smi, it should be detecting your 4 gpus without issue.. I'll ping @adamnovak here too.

@glennhickey
Copy link
Collaborator

Would also be curious to know what happens when you run with --gpu but without any arguemnts. Cactus should then use that toil function to set all 4 gpus.

@muffato
Copy link
Contributor Author

muffato commented Jan 11, 2023

the argument of --gpu seems mandatory:

[2023-01-11T20:50:29+0000] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: 47f9079cc31a5533ffb76f038480fdec1b6f7c4f
Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus", line 8, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/progressive/cactus_progressive.py", line 372, in main
    config_wrapper.initGPU(options)
  File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/shared/configWrapper.py", line 274, in initGPU
    lastz_gpu = get_gpu_count()
  File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/shared/configWrapper.py", line 261, in get_gpu_count
    raise RuntimeError('Unable to automatically determine number of GPUs: Please set with --gpu N')
RuntimeError: Unable to automatically determine number of GPUs: Please set with --gpu N

@glennhickey
Copy link
Collaborator

glennhickey commented Jan 11, 2023 via email

@muffato
Copy link
Contributor Author

muffato commented Jan 11, 2023

I've found the problem ! nvidia-smi is in /usr/bin. I run Cactus through Singularity, so /usr/bin of the host is not available

Thanks for the help 🤝

@adamnovak
Copy link
Collaborator

I think you need to run Cactus in a top-level Singularity container that has GPUs available, and the nvidia userspace binaries (like nvidia-smi) to access them. Can't you mount those in somehow?

@glennhickey
Copy link
Collaborator

Ah ok, running in Singularity is an important detail. Still strange because the gpu-enabled cactus image (quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu) contains /usr/bin/nvidia-smi. If you're running from the regular release (quay.io/comparative-genomics-toolkit/cactus:v2.4.0), even if you could detect GPUs it wouldn't work since SegAlign isn't included.

@muffato
Copy link
Contributor Author

muffato commented Jan 11, 2023

Sorry guys, I forgot the --nv flag 🤦🏼🤦🏼 !
With it, singularity mounts nvidia-smi (and other nvidia libraries / executables) into the container, and toil can successfully detect the GPUs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants