Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster #887

Open
oneillkza opened this issue Jan 5, 2023 · 10 comments

Comments

@oneillkza
Copy link

@glennhickey I've been trying out the latest code in #884 to enable requesting of accelerators from Toil, but am now getting the following error:

  File "/projects/koneill_prj/conda/envs/cactus/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 347, in _check_accelerator_request
    raise InsufficientSystemResources(requirer, 'accelerators', [], details=[
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job LastzRepeatMaskJob is requesting [{'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [] accelerators that SlurmBatchSystem was configured with. The batch system does not support any accelerators.

I'm not sure if this isn't maybe an upstream error -- ie Toil just hasn't implemented support for GPU resources on Slurm yet. @adamnovak is that the case? Or is this something I need to set somewhere?

(Note that we run NextFlow on this cluster pretty regularly, and it has no trouble requesting GPUs from the scheduler, and then having individual jobs use the right ones based on $CUDA_VISIBLE_DEVICES)

@adamnovak
Copy link
Collaborator

Sorry, I haven't implemented support for GPUs in the Toil SlurmBatchSystem yet. We don't have a Slurm GPU setup at UCSC yet to try it with, although we should be getting one soon.

How does your Slurm cluster do GPUs @oneillkza? It looks like some (all?) clusters use a generic resource (GRES) of gpu to represent Nvidia CUDA-capable GPUs, so Toil could use --gres=gpu:1 to ask for one of those. But there's also a --gpus option that Slurm can make available under some circumstances; should we pass that one instead?

It seems like Slurm also has AMD ROCm support, but that it doesn't really give you a way (beyond the "type" which can be exact model numbers like "a100") to say you want a CUDA API or a ROCm API.

@oneillkza
Copy link
Author

Thanks @adamnovak -- yep we use --gres=gpu:1, and I believe the Slurm scheduler sets $CUDA_VISIBLE_DEVICES with the allocated GPUs, which the GPU-enabled software is expected to respect.

(Our cluster is a bunch of servers running NVidia CUDA-capable cards, mainly 3090s, with eight GPUs per node.)

@oneillkza
Copy link
Author

@thiagogenez just noting that to run Cactus on a local slurm cluster, this is also necessary (ie using the latest code for Cactus, incorporating #844 as well as waiting for DataBiosphere/toil#4308).

@HFzzzzzzz
Copy link

@thiagogenez just noting that to run Cactus on a local slurm cluster, this is also necessary (ie using the latest code for Cactus, incorporating #844 as well as waiting for DataBiosphere/toil#4308).

Hello, I am using the slurm cluster to run cactus, but after I use the module load cactus, the problem of toil_worker, command not found appears, I don’t know if you have encountered it, how did you run it? Thank you so much for your guidance, I'm a newbie and this has been bugging me for ages

@thiagogenez
Copy link
Contributor

Hi @790634750 , Can you share the details of how are you calling Cactus on your Slurm environment and the errors you get, please? Then I can provide you with better answers. Cheers

@HFzzzzzzz
Copy link

Hi @790634750 , Can you share the details of how are you calling Cactus on your Slurm environment and the errors you get, please? Then I can provide you with better answers. Cheers

HI,@thiagogenez
You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below
cd cactus
virtualenv -p python3 cactus_env
echo "export PATH=$(pwd)/bin:$PATH" >> cactus_env/bin/activate
echo "export PYTHONPATH=$(pwd)/lib:$PYTHONPATH" >> cactus_env/bin/activate
source cactus_env/bin/activate
python3 -m pip install -U setuptools pip
python3 -m pip install -U -r ./toil-requirement.txt
python3 -m pip install -U .
make
An error occurred
/home/apps/soft/anaconda3/2019.10/bin/h5c++: line 304: x86_64-conda_cos6-linux-gnu-c++: command not found
make[3]: *** [../objs/api/impl/halAlignmentInstance.o] Error 127
make[3]: Leaving directory /home/zhouhf/cactus/submodules/hal/api' make[2]: *** [api.libs] Error 2 make[2]: Leaving directory /home/zhouhf/cactus/submodules/hal'
make[1]: *** [suball.hal] Error 2
make[1]: Leaving directory `/home/zhouhf/cactus'
make: *** [all] Error 2

I use conda install -c anaconda gcc_linux-64, the download fails.
I use conda install -c bioconda cactus, but the download also fails.
How should I run cactus on a slurm cluster?

@thiagogenez
Copy link
Contributor

thiagogenez commented Jan 10, 2023

You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below

Hi @790634750
It seems you are getting errors during the Cactus compilation. The error you described is because g++ can't be found in your $PATH. I don't use conda, but I believe the solution for your case is to load hdf5 and gcc modules provided in your cluster environment before run make.

The easiest way journey to run Cactus is the use of containers. I strongly recommended use the Docker image provided

If you have Singularity in your cluster (which I believe you might have), you can use it to run the container. Example:

  • Step 1:
# load singularity module provided by your cluster
  • Step 2: to download the container
# if you don't have GPU available
singularity pull --name cactus.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0

# if you have GPUavailable
singularity pull --name cactus-gpu.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu
  • Step 3: Run the container
# if you don't have GPU available
singularity run cactus.sif cactus --help

# if you have GPUavailable
singularity run --nv cactus-gpu.sif cactus --help

@HFzzzzzzz
Copy link

You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below

Hi @790634750 It seems you are getting errors during the Cactus compilation. The error you described is because g++ can't be found in your $PATH. I don't use conda, but I believe the solution for your case is to load hdf5 and gcc modules provided in your cluster environment before run make.

The easiest way journey to run Cactus is the use of containers. I strongly recommended use the Docker image provided

If you have Singularity in your cluster (which I believe you might have), you can use it to run the container. Example:

  • Step 1:
# load singularity module provided by your cluster
  • Step 2: to download the container
# if you don't have GPU available
singularity pull --name cactus.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0

# if you have GPUavailable
singularity pull --name cactus-gpu.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu
  • Step 3: Run the container
# if you don't have GPU available
singularity run cactus.sif cactus --help

# if you have GPUavailable
singularity run --nv cactus-gpu.sif cactus --help

HI @thiagogenez
Thank you very much for your answer, but currently our slurm does not have singluarity, but slurm has different versions of cactus, I can use module load cactus, but when I run it like this, there will be an error of toil_worker: command not found, how should I solve it?

@thiagogenez
Copy link
Contributor

HI @thiagogenez
Thank you very much for your answer, but currently our slurm does not have singluarity, but slurm has different versions of cactus, I can use module load cactus, but when I run it like this, there will be an error of toil_worker: command not found, how should I solve it?

I believe there is a misconfiguration of the cactus module in your cluster. The toil_worker binary should be found inside the Cactus python environment.

@glennhickey
Copy link
Collaborator

To install the Cactus Python module, download the Cactus binaries here: https://github.com/ComparativeGenomicsToolkit/cactus/releases and install using the linked instructions: BIN_INSTALL.md

You should not need to apt install anything except maybe python3-dev (if one of the pip install commands gives an error). You definitely do not want to be following the "Installing Manually From Source" instructions unless you have a really good reason to be doing so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants