SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster #887

oneillkza · 2023-01-05T22:21:25Z

@glennhickey I've been trying out the latest code in #884 to enable requesting of accelerators from Toil, but am now getting the following error:

  File "/projects/koneill_prj/conda/envs/cactus/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 347, in _check_accelerator_request
    raise InsufficientSystemResources(requirer, 'accelerators', [], details=[
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job LastzRepeatMaskJob is requesting [{'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [] accelerators that SlurmBatchSystem was configured with. The batch system does not support any accelerators.

I'm not sure if this isn't maybe an upstream error -- ie Toil just hasn't implemented support for GPU resources on Slurm yet. @adamnovak is that the case? Or is this something I need to set somewhere?

(Note that we run NextFlow on this cluster pretty regularly, and it has no trouble requesting GPUs from the scheduler, and then having individual jobs use the right ones based on $CUDA_VISIBLE_DEVICES)

The text was updated successfully, but these errors were encountered:

adamnovak · 2023-01-05T23:01:51Z

Sorry, I haven't implemented support for GPUs in the Toil SlurmBatchSystem yet. We don't have a Slurm GPU setup at UCSC yet to try it with, although we should be getting one soon.

How does your Slurm cluster do GPUs @oneillkza? It looks like some (all?) clusters use a generic resource (GRES) of gpu to represent Nvidia CUDA-capable GPUs, so Toil could use --gres=gpu:1 to ask for one of those. But there's also a --gpus option that Slurm can make available under some circumstances; should we pass that one instead?

It seems like Slurm also has AMD ROCm support, but that it doesn't really give you a way (beyond the "type" which can be exact model numbers like "a100") to say you want a CUDA API or a ROCm API.

oneillkza · 2023-01-05T23:44:30Z

Thanks @adamnovak -- yep we use --gres=gpu:1, and I believe the Slurm scheduler sets $CUDA_VISIBLE_DEVICES with the allocated GPUs, which the GPU-enabled software is expected to respect.

(Our cluster is a bunch of servers running NVidia CUDA-capable cards, mainly 3090s, with eight GPUs per node.)

oneillkza · 2023-01-06T17:49:49Z

@thiagogenez just noting that to run Cactus on a local slurm cluster, this is also necessary (ie using the latest code for Cactus, incorporating #844 as well as waiting for DataBiosphere/toil#4308).

HFzzzzzzz · 2023-01-10T01:53:49Z

@thiagogenez just noting that to run Cactus on a local slurm cluster, this is also necessary (ie using the latest code for Cactus, incorporating #844 as well as waiting for DataBiosphere/toil#4308).

Hello, I am using the slurm cluster to run cactus, but after I use the module load cactus, the problem of toil_worker, command not found appears, I don’t know if you have encountered it, how did you run it? Thank you so much for your guidance, I'm a newbie and this has been bugging me for ages

thiagogenez · 2023-01-10T11:09:31Z

Hi @790634750 , Can you share the details of how are you calling Cactus on your Slurm environment and the errors you get, please? Then I can provide you with better answers. Cheers

HFzzzzzzz · 2023-01-10T11:56:31Z

Hi @790634750 , Can you share the details of how are you calling Cactus on your Slurm environment and the errors you get, please? Then I can provide you with better answers. Cheers

HI，@thiagogenez
You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below
cd cactus
virtualenv -p python3 cactus_env
echo "export PATH=$(pwd)/bin:$PATH" >> cactus_env/bin/activate
echo "export PYTHONPATH=$(pwd)/lib:$PYTHONPATH" >> cactus_env/bin/activate
source cactus_env/bin/activate
python3 -m pip install -U setuptools pip
python3 -m pip install -U -r ./toil-requirement.txt
python3 -m pip install -U .
make
An error occurred
/home/apps/soft/anaconda3/2019.10/bin/h5c++: line 304: x86_64-conda_cos6-linux-gnu-c++: command not found
make[3]: *** [../objs/api/impl/halAlignmentInstance.o] Error 127
make[3]: Leaving directory /home/zhouhf/cactus/submodules/hal/api' make[2]: *** [api.libs] Error 2 make[2]: Leaving directory /home/zhouhf/cactus/submodules/hal'
make[1]: *** [suball.hal] Error 2
make[1]: Leaving directory `/home/zhouhf/cactus'
make: *** [all] Error 2

I use conda install -c anaconda gcc_linux-64, the download fails.
I use conda install -c bioconda cactus, but the download also fails.
How should I run cactus on a slurm cluster?

thiagogenez · 2023-01-10T13:32:19Z

You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below

Hi @790634750
It seems you are getting errors during the Cactus compilation. The error you described is because g++ can't be found in your $PATH. I don't use conda, but I believe the solution for your case is to load hdf5 and gcc modules provided in your cluster environment before run make.

The easiest way journey to run Cactus is the use of containers. I strongly recommended use the Docker image provided

Docker Image: quay.io/comparative-genomics-toolkit/cactus:v2.4.0
GPU-accelerated Docker Image: quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu

If you have Singularity in your cluster (which I believe you might have), you can use it to run the container. Example:

Step 1:

# load singularity module provided by your cluster

Step 2: to download the container

# if you don't have GPU available
singularity pull --name cactus.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0

# if you have GPUavailable
singularity pull --name cactus-gpu.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu

Step 3: Run the container

# if you don't have GPU available
singularity run cactus.sif cactus --help

# if you have GPUavailable
singularity run --nv cactus-gpu.sif cactus --help

HFzzzzzzz · 2023-01-10T13:57:33Z

You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below

Hi @790634750 It seems you are getting errors during the Cactus compilation. The error you described is because g++ can't be found in your $PATH. I don't use conda, but I believe the solution for your case is to load hdf5 and gcc modules provided in your cluster environment before run make.

The easiest way journey to run Cactus is the use of containers. I strongly recommended use the Docker image provided

Docker Image: quay.io/comparative-genomics-toolkit/cactus:v2.4.0

GPU-accelerated Docker Image: quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu

If you have Singularity in your cluster (which I believe you might have), you can use it to run the container. Example:

Step 1:
# load singularity module provided by your cluster
Step 2: to download the container
# if you don't have GPU available
singularity pull --name cactus.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0

# if you have GPUavailable
singularity pull --name cactus-gpu.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu
Step 3: Run the container
# if you don't have GPU available
singularity run cactus.sif cactus --help

# if you have GPUavailable
singularity run --nv cactus-gpu.sif cactus --help

HI @thiagogenez
Thank you very much for your answer, but currently our slurm does not have singluarity, but slurm has different versions of cactus, I can use module load cactus, but when I run it like this, there will be an error of toil_worker: command not found, how should I solve it?

thiagogenez · 2023-01-11T11:40:28Z

HI @thiagogenez
Thank you very much for your answer, but currently our slurm does not have singluarity, but slurm has different versions of cactus, I can use module load cactus, but when I run it like this, there will be an error of toil_worker: command not found, how should I solve it?

I believe there is a misconfiguration of the cactus module in your cluster. The toil_worker binary should be found inside the Cactus python environment.

glennhickey · 2023-01-11T14:24:33Z

To install the Cactus Python module, download the Cactus binaries here: https://github.com/ComparativeGenomicsToolkit/cactus/releases and install using the linked instructions: BIN_INSTALL.md

You should not need to apt install anything except maybe python3-dev (if one of the pip install commands gives an error). You definitely do not want to be following the "Installing Manually From Source" instructions unless you have a really good reason to be doing so.

oneillkza mentioned this issue Jan 5, 2023

LastzRepeatMaskJob does not request any GPUs when running with --gpu #884

Closed

adamnovak mentioned this issue Jan 5, 2023

Support for GPU scheduling with Slurm DataBiosphere/toil#4308

Closed

muffato mentioned this issue Jan 11, 2023

GPU acceleration on a single machine #900

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster #887

SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster #887

oneillkza commented Jan 5, 2023

adamnovak commented Jan 5, 2023

oneillkza commented Jan 5, 2023

oneillkza commented Jan 6, 2023

HFzzzzzzz commented Jan 10, 2023

thiagogenez commented Jan 10, 2023

HFzzzzzzz commented Jan 10, 2023

thiagogenez commented Jan 10, 2023 •

edited

Loading

HFzzzzzzz commented Jan 10, 2023

thiagogenez commented Jan 11, 2023

glennhickey commented Jan 11, 2023

SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster #887

SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster #887

Comments

oneillkza commented Jan 5, 2023

adamnovak commented Jan 5, 2023

oneillkza commented Jan 5, 2023

oneillkza commented Jan 6, 2023

HFzzzzzzz commented Jan 10, 2023

thiagogenez commented Jan 10, 2023

HFzzzzzzz commented Jan 10, 2023

thiagogenez commented Jan 10, 2023 • edited Loading

HFzzzzzzz commented Jan 10, 2023

thiagogenez commented Jan 11, 2023

glennhickey commented Jan 11, 2023

thiagogenez commented Jan 10, 2023 •

edited

Loading