Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problematic /etc/profile.d/zippy_efa.sh #6642

Open
ssyed85 opened this issue Jan 16, 2025 · 1 comment
Open

problematic /etc/profile.d/zippy_efa.sh #6642

ssyed85 opened this issue Jan 16, 2025 · 1 comment

Comments

@ssyed85
Copy link

ssyed85 commented Jan 16, 2025

Required Info:

  • AWS ParallelCluster version [3.12.0]:

Bug description and how to reproduce:
On top of rocky 8, the image created by pcluster build-image command ends up creating an image that contains /etc/profile.d/zippy_efa.sh that has

PATH="/opt/amazon/efa/bin/:$PATH"
PATH="/opt/amazon/openmpi/bin/:$PATH"
MODULEPATH="/opt/amazon/modules/modulefiles:$MODULEPATH"

As a consequent of 2nd line for example, openmpi binaries path can supersede intel-mpi's binary path if you have "module load intelmpi" appended at the end of your $HOME/.bashrc, and afterwards when you create a dcv session ; after logging into that dcv session, you may end up in something like

[rocky@ip-xx ~]$ module list
Currently Loaded Modulefiles:
 1) intelmpi/2021.13  
[rocky@ip-xx ~]$ which mpicc
/opt/amazon/openmpi/bin/mpicc
[rocky@ip-xx ~]$ echo $PATH
/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin/:/opt/intel/mpi/2021.13/opt/mpi/libfabric/bin:/opt/intel/mpi/2021.13/bin:/home/rocky/.local/bin:/home/rocky/bin:/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin/:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/bin:/opt/aws/bin:/opt/parallelcluster/pyenv/versions/3.9.20/envs/awsbatch_virtualenv/bin:/opt/parallelcluster/pyenv/versions/3.9.20/envs/awsbatch_virtualenv/bin

Can someone be more imaginative than prepending paths inside /etc/profile.d specially when we have "module load openmpi" available !

Image configuration:

Region: us-east-2
Image:
  Name: Compute-Rocky8
  Tags:
    - Key: Version
      Value: 3.12.0
  RootVolume:
    Size: 34
Build:
  InstanceType: g6.4xlarge
  Components:
    # install compute node packages
    - Type: script
      Value: s3://cluster-reveal/images/scripts/computeNodeAMIInstaller.sh
  ParentImage: ami-02391db2758465a87
  SubnetId: subnet-0b90567b48c01804f
  SecurityGroupIds:
    - sg-015922ec139586a02
  UpdateOsPackages:
    Enabled: true
  Installation:
    NvidiaSoftware:
      Enabled: true
    LustreClient:
      Enabled: true

computeNodeAMIInstaller.sh looks like

#!/bin/bash

dnf -y install firewalld dnf-plugin-versionlock
systemctl start firewalld
firewall-cmd --zone=public --add-port=0-65535/tcp --permanent
firewall-cmd --add-service={ldap,ldaps} --permanent
firewall-cmd --reload	
	
dnf -y install epel-release
dnf -y group install --with-optional minimal-environment base
dnf -y group install scientific performance 
dnf -y install libnsl python2 libatomic libgfortran libglvnd mesa-libGLU

dnf versionlock add kernel*
@ssyed85 ssyed85 added the 3.x label Jan 16, 2025
@shijin-aws
Copy link

This is by design. We want open mpi to be the default MPI library on the system. I am not surprised that the /etc/profile.d/zippy_efa.sh will conflict with your module load if it is in your bashrc. The best suggestion I can offer is to have that module load in your job script explicitly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants