-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting on GPU support (2023‐11‐21)
Kenneth Hoste edited this page Dec 8, 2023
·
1 revision
- https://github.com/EESSI/software-layer/issues/375
- open PRs
- PR #368
- PR #381
- combining these two PRs should be sufficient to get GPU support working, if the GPU driver is recent enough
- combine both PRs into one branch to experiment with
mkdir -p /tmp/$USER cd /tmp/$USER git clone https://github.com/EESSI/software-layer cd software-layer # fetch Alan's branches git remote add alan https://github.com/ocaisa/software-layer git fetch alan # create 'gpu' branch, and merge PR branches into it git checkout -b gpu git merge alan/host_injections_cuda git merge alan/cuda_install
- steps:
-
- start container in read-write mode + prepared to install CUDA + access GPU
./eessi_container.sh -m shell --access rw --nvidia all
-
- install CUDA 12.1.1 in
/cvmfs/pilot.eessi-hpc.org/host_injections
by running theinstall_cuda_host_injections.sh
script:
source /cvmfs/pilot.eessi-hpc.org/versions/2023.06/init/bash gpu_support/nvidia/install_cuda_host_injections.sh 12.1.1
- install CUDA 12.1.1 in
-
- install CUDA/12.1.1 (runtime only) + CUDA samples in EESSI
- update
eessi-2023.06-eb-4.8.2-2023a.yml
to use--from-pr 19189
(unless it has been already) - follow the steps at https://www.eessi.io/docs/adding_software/debugging_failed_builds/ and
(note that this will not update the lmodrc file, which is done by
eb --easystack eessi-2023.06-eb-4.8.2-2023a.yml --robot
EESSI-pilot-install-software.sh
) OR - first update
EESSI-pilot-install-software.sh
script to hardcode use ofeessi-2023.06-eb-4.8.2-2023a.yml
easystack file:and then runfor easystack_file in eessi-2023.06-eb-4.8.2-2023a.yml; do
./install_software_layer.sh
in container
-
- TODO
- script to create file required by Lmod hook (cfr. lmodrc file) is still missing, needs to be done manually work (or tweak module to bypass the check)
- should be separate PR to add scripts in
software-layer/scripts/
- bot/build.sh can be updated to also deploy scripts in EESSI repo
- this script should create symlinks for all libraries shipped with GPU driver, based on:
ldconfig -p | awk '{print $1 " " $NF}' > libs.txt curl -O https://raw.githubusercontent.com/apptainer/apptainer/main/etc/nvliblist.conf grep '.so$' nvliblist.conf | xargs -i grep {} libs.txt
- should be separate PR to add scripts in
- placeholder page in docs that we can point to from Lmod load hook: https://eessi.io/docs/gpu
- script to create file required by Lmod hook (cfr. lmodrc file) is still missing, needs to be done manually work (or tweak module to bypass the check)
-
cd /scratch-shared/casparl
# Using /tmp results in "WARNING: 'nodev' mount option set on /tmp, it could be a source of failure during build process" git clone https://github.com/EESSI/software-layer
cd software-layer
git remote add alan https://github.com/ocaisa/software-layer
git fetch alan
-
git checkout -b gpu --track alan/host_injections_cuda
# Creating a fresh branch from the main branch now gives a ton of conflicts. Its easier to start from this, then merge cuda_install into it git merge alan/cuda_install
-
module purge
# Make sure we don't pick up on EasyBuild from the host later on -
SINGULARITY_TMPDIR=/scratch-shared/casparl/singularity.tmpdir ./eessi_container.sh -m shell --access rw --nvidia all -g /scratch-shared/casparl/
# See if pointing SINGULARITY_TMPDIR and -g away from /tmp resolves the "/cvmfs/.../ is a read only file system" issue - Follow steps at https://www.eessi.io/docs/adding_software/debugging_failed_builds/ to start prefix and source EESSI environment
- module load EasyBuild/4.8.2
-
gpu_support/nvidia/install_cuda_host_injections.sh 12.1.1
# Install cuda 12.1.1 in/cvmfs/pilot.eessi-hpc.org/host_injections
export WORKDIR=$(mktemp --directory --tmpdir=/tmp -t eessi-debug.XXXXXXXXXX)
source configure_easybuild
eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --robot --from-pr 19189