The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided (Singularity container ) #1390

aaannaw · 2024-05-17T06:30:29Z

Hello, professor
I used the GPU docker container to run cactus: singularity run --nv ./cactus_v2.8.2-gpu.sif cactus ./jobStore ./Evolve.Bathyergidae.txt ./Evolve.Bathyergidae.hal --gpu 1 --workDir /data/groups/lzu_public/home/u220220932211/wn-cactus/tmp --lastzMemory 10G --maxDisk 5T
However, I suffered from warning and error:

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

raise InsufficientSystemResources(requirer, 'accelerators', self.accelerator_identities, details=[
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'run_lastz' kind-run_lastz/instance-tzo0sor0 v1 is requesting [{'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [] accelerators that SingleMachineBatchSystem was configured with. The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided. Scale is set to 1.

The problem is similar with #900 , it looks like Toil doesn't believe I am on a machine with a GPU. I add the --nv parameters but it does not work. Could you give me any suggestions?
Thank you for much!

Best wishes!
Na Wan

The text was updated successfully, but these errors were encountered:

glennhickey · 2024-05-17T12:25:04Z

What happens when you run

singularity run --nv ./cactus_v2.8.2-gpu.sif nvidia-smi

aaannaw · 2024-05-18T03:41:11Z

Hello, professor
Thanks for your response. When I run singularity run --nv ./cactus_v2.8.2-gpu.sif nvidia-smi, the eorror occurs:

==========
== CUDA ==
==========

CUDA Version 11.4.3

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

However I only run nvidia-smi, it showed normal:

Sat May 18 11:30:34 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   20C    P0    24W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:45:00.0 Off |                    0 |
| N/A   21C    P0    25W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:B5:00.0 Off |                    0 |
| N/A   20C    P0    25W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:B9:00.0 Off |                    0 |
| N/A   20C    P0    23W / 250W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I try to add the path of libnvidia-ml.so to the command"singularity run -B /usr/local/nvidia/lib --nv ./cactus_v2.8.2-gpu.sif cactus ./liutest2 ./Evolve.txt ./Evolve..hal --maxCore=20 --gpu 1 --workDir ./tmpdir --lastzMemory 10G --maxDisk 5T",it loos run normally.

I have another confusion about the parameter "--targettime" and we should how to set it. Could you give me any suggestions?
Thank you very much.
Best wishes!
Na Wan

glennhickey · 2024-05-23T15:40:06Z

--tarrgettime doesn't do anything unless you're using the aws autoscaler - just ignore it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided (Singularity container ) #1390

The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided (Singularity container ) #1390

aaannaw commented May 17, 2024

glennhickey commented May 17, 2024

aaannaw commented May 18, 2024

glennhickey commented May 23, 2024

The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided (Singularity container ) #1390

The accelerator {'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided (Singularity container ) #1390

Comments

aaannaw commented May 17, 2024

glennhickey commented May 17, 2024

aaannaw commented May 18, 2024

glennhickey commented May 23, 2024