Skip to content

Sync meeting on GPU support (2023‐11‐21)

Kenneth Hoste edited this page Dec 8, 2023 · 1 revision

EESSI GPU support sync meeting (2023-11-21)

  • https://github.com/EESSI/software-layer/issues/375
  • open PRs
    • PR #368
    • PR #381
    • combining these two PRs should be sufficient to get GPU support working, if the GPU driver is recent enough
    • combine both PRs into one branch to experiment with
      mkdir -p /tmp/$USER
      cd /tmp/$USER
      git clone https://github.com/EESSI/software-layer
      cd software-layer
      # fetch Alan's branches
      git remote add alan https://github.com/ocaisa/software-layer
      git fetch alan
      # create 'gpu' branch, and merge PR branches into it
      git checkout -b gpu
      git merge alan/host_injections_cuda
      git merge alan/cuda_install
    • steps:
        1. start container in read-write mode + prepared to install CUDA + access GPU
        ./eessi_container.sh -m shell --access rw --nvidia all
        
        1. install CUDA 12.1.1 in /cvmfs/pilot.eessi-hpc.org/host_injections by running the install_cuda_host_injections.sh script:
        source /cvmfs/pilot.eessi-hpc.org/versions/2023.06/init/bash
        gpu_support/nvidia/install_cuda_host_injections.sh 12.1.1
        
        1. install CUDA/12.1.1 (runtime only) + CUDA samples in EESSI
        • update eessi-2023.06-eb-4.8.2-2023a.yml to use --from-pr 19189 (unless it has been already)
        • follow the steps at https://www.eessi.io/docs/adding_software/debugging_failed_builds/ and
          eb --easystack eessi-2023.06-eb-4.8.2-2023a.yml --robot
          (note that this will not update the lmodrc file, which is done by EESSI-pilot-install-software.sh) OR
        • first update EESSI-pilot-install-software.sh script to hardcode use of eessi-2023.06-eb-4.8.2-2023a.yml easystack file:
          for easystack_file in eessi-2023.06-eb-4.8.2-2023a.yml; do
          and then run ./install_software_layer.sh in container
    • TODO
      • script to create file required by Lmod hook (cfr. lmodrc file) is still missing, needs to be done manually work (or tweak module to bypass the check)
        • should be separate PR to add scripts in software-layer/scripts/
        • bot/build.sh can be updated to also deploy scripts in EESSI repo
        • this script should create symlinks for all libraries shipped with GPU driver, based on:
          ldconfig -p | awk '{print $1 " " $NF}' > libs.txt
          curl -O https://raw.githubusercontent.com/apptainer/apptainer/main/etc/nvliblist.conf
          grep '.so$' nvliblist.conf | xargs -i grep {} libs.txt
      • placeholder page in docs that we can point to from Lmod load hook: https://eessi.io/docs/gpu

Caspar's replication steps:

  • cd /scratch-shared/casparl # Using /tmp results in "WARNING: 'nodev' mount option set on /tmp, it could be a source of failure during build process"
  • git clone https://github.com/EESSI/software-layer
  • cd software-layer
  • git remote add alan https://github.com/ocaisa/software-layer
  • git fetch alan
  • git checkout -b gpu --track alan/host_injections_cuda # Creating a fresh branch from the main branch now gives a ton of conflicts. Its easier to start from this, then merge cuda_install into it
  • git merge alan/cuda_install
  • module purge # Make sure we don't pick up on EasyBuild from the host later on
  • SINGULARITY_TMPDIR=/scratch-shared/casparl/singularity.tmpdir ./eessi_container.sh -m shell --access rw --nvidia all -g /scratch-shared/casparl/ # See if pointing SINGULARITY_TMPDIR and -g away from /tmp resolves the "/cvmfs/.../ is a read only file system" issue
  • Follow steps at https://www.eessi.io/docs/adding_software/debugging_failed_builds/ to start prefix and source EESSI environment
  • module load EasyBuild/4.8.2
  • gpu_support/nvidia/install_cuda_host_injections.sh 12.1.1 # Install cuda 12.1.1 in /cvmfs/pilot.eessi-hpc.org/host_injections
  • export WORKDIR=$(mktemp --directory --tmpdir=/tmp -t eessi-debug.XXXXXXXXXX)
  • source configure_easybuild
  • eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --robot --from-pr 19189
Clone this wiki locally