Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Package Request] - CUDA Compatibility #316

Closed
c200chromebook opened this issue Apr 5, 2023 · 11 comments
Closed

[Package Request] - CUDA Compatibility #316

c200chromebook opened this issue Apr 5, 2023 · 11 comments

Comments

@c200chromebook
Copy link

c200chromebook commented Apr 5, 2023

What package is missing from Amazon Linux 2023? Please describe and include package name.

Can't currently seem to install cuda as cuda-repo-rhel7-11-4-local-11.4.0_470.42.01-1 wants libtirpc-0.2.4 and only has 1.3.3-0, which provides libtirpc.so.3 and not libtirpc.so.1

Tried building/installing tirpc 0.2.4 from source and it gripes: auth_time.c:46:10: fatal error: rpcsvc/nis.h: No such file or directory

Is this an update to existing package or new package request?

Downgrade, actually.

Is this package available in Amazon Linux 2? If it is available via external sources such as EPEL, please specify.

Yep, available. Believe it's installed by default.

Any additional information you'd like to include. (use-cases, etc)

Is CUDA intended to work on 2023 yet? Our security guys want us to upgrade from 2 if possible.

@ozbenh
Copy link

ozbenh commented Apr 5, 2023

We are working on nVidia support. It will not be via the RHEL 7 packages however. We'll update you when we have something and will post some documentation/instructions. Soon hopefully :)

@alexreyes
Copy link

Hey @ozbenh, are there any updates on this? I also have been having problems with installing GPU drivers on amazon linux 2023

@ozbenh
Copy link

ozbenh commented May 10, 2023

The dependencies for the Fedora 35 RPMs should be resolved by now, so those should install. Still working with nVidia to get Amazon Linux specific builds.

@adrianmace
Copy link

Just stumbled upon this hurdle in our environment too. It would be great to see this installed/included by default into the AL2023 images so we don't need to think when picking instance types.

I can confirm the fedora35 packages appear to install, but didn't get as far as testing them.

enrico-usai added a commit to enrico-usai/aws-parallelcluster-cookbook that referenced this issue Jun 29, 2023
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
@c200chromebook
Copy link
Author

Related: aws/containers-roadmap#2072

@c200chromebook
Copy link
Author

c200chromebook commented Jul 20, 2023

The following seems to work. I didn't include nccl as we don't use it. You may not need sudo/rsync/iproute/iputils/cifs-utils/shadow-utils/openssl-devel/etc for your application. Based heavily on https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/11.7.1/centos7/

FROM amazonlinux:2023

RUN yum -y install wget yum-utils && yum -y clean all
RUN yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora35/x86_64/cuda-fedora35.repo

ENV NV_CUDA_CUDART_VERSION 11.7.60-1
ENV NV_LIBNPP_VERSION 11.7.3.21-1
ENV NV_LIBNPP_PACKAGE libnpp-11-7-${NV_LIBNPP_VERSION}
ENV NV_LIBCUBLAS_VERSION 11.10.1.25-1
ENV NV_NVTX_VERSION 11.7.50-1

ENV NV_CUDA_LIB_VERSION 11.7.0-1
ENV NV_CUDA_CUDART_DEV_VERSION ${NV_CUDA_CUDART_VERSION}
ENV NV_NVML_DEV_VERSION 11.7.50-1
ENV NV_LIBCUBLAS_DEV_VERSION ${NV_LIBCUBLAS_VERSION}

ENV NV_LIBNPP_DEV_VERSION 11.7.3.21-1
ENV NV_LIBNPP_DEV_PACKAGE libnpp-devel-11-7-${NV_LIBNPP_DEV_VERSION}

RUN dnf install -y make \
                   bzip2-devel \
                   findutils \
                   tar \
                   gzip \
                   zlib-devel \
                   openssl-devel \
                   shadow-utils \
                   libffi-devel \
                   openmpi \
                   openmpi-devel \
                   sudo \
                   rsync \
                   cifs-utils \
                   iproute \
                   iputils \
                   cuda-cudart-11-7-${NV_CUDA_CUDART_VERSION} \
                   cuda-compat-11-7 \
                   cuda-libraries-11-7-${NV_CUDA_LIB_VERSION} \
                   cuda-nvtx-11-7-${NV_NVTX_VERSION} \
                   ${NV_LIBNPP_PACKAGE} \
                   libcublas-11-7-${NV_LIBCUBLAS_VERSION} \
                   cuda-command-line-tools-11-7-${NV_CUDA_LIB_VERSION} \
                   cuda-libraries-devel-11-7-${NV_CUDA_LIB_VERSION} \
                   cuda-minimal-build-11-7-${NV_CUDA_LIB_VERSION} \
                   cuda-cudart-devel-11-7-${NV_CUDA_CUDART_DEV_VERSION} \
                   cuda-nvml-devel-11-7-${NV_NVML_DEV_VERSION} \
                   libcublas-devel-11-7-${NV_LIBCUBLAS_DEV_VERSION} \
                   ${NV_LIBNPP_DEV_PACKAGE} \
                   && dnf clean all

ENV NVIDIA_REQUIRE_CUDA=cuda>=11.0
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV MPICC /usr/lib64/openmpi/bin/mpicc
ENV CUDA_VERSION=11.7

@ozbenh
Copy link

ozbenh commented Aug 11, 2023

We are working with nVidia on a better long term solution, including the container support. In the meantime, the Fedora 37 packages appear to work as well as the 35 ones after a quick smoke test.

@stewartsmith
Copy link
Member

Going to centralize everything on #12

@stewartsmith stewartsmith closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
himani2411 pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue Apr 4, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
himani2411 pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue Apr 8, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
himani2411 pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue Apr 15, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
@sfloresk
Copy link

sfloresk commented Apr 17, 2024

This seems to work in G5s - based on @bryantbiggs code

dnf -y install gcc kernel-modules-extra wget kernel-devel
wget https://us.download.nvidia.com/tesla/535.161.08/NVIDIA-Linux-x86_64-535.161.08.run
sh NVIDIA-Linux-x86_64-535.161.08.run -a -s --ui=none -m=kernel-open

"nvidia-smi" command should return the GPUs

If you need the nvidia container runtime (for example, ECS tasks), you need to also execute:

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo |   sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Tested with an ubuntu container

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

himani2411 pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue May 1, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
himani2411 pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue May 8, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue May 17, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
@limmike
Copy link

limmike commented May 26, 2024

NVIDIA has added AL2023 support to CUDA repo.

Install article at How do I install NVIDIA GPU driver, CUDA toolkit and optionally NVIDIA Container Toolkit in Amazon Linux 2023 (AL2023)?

Here is my install script

sudo dnf install -y dkms kernel-devel kernel-modules-extra
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
sudo dnf clean expire-cache
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf install -y cuda-toolkit

hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 3, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 4, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 5, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 5, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 7, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 7, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 10, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to hanwen-pcluste/aws-parallelcluster-cookbook that referenced this issue Jun 11, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue Jun 11, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue Jun 11, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to himani2411/aws-parallelcluster-cookbook that referenced this issue Jun 11, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of aws#2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with aws#2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
hanwen-pcluste pushed a commit to aws/aws-parallelcluster-cookbook that referenced this issue Jun 12, 2024
### Description of changes

Used new utility `os-resources.py` introduced as part of #2328 to create new resources for alinux2023, starting from redhat8 resources.

Relevant changes to the code:
* os_type --> Replaced rhel with alinux
* Tried to fix CloudWatch agent setup by changing `platform_url_component` to point to the same of rhel (not sure if it's correct)
* Copied network setup templates from redhat folders
* Added alinux2023 to pcluster_dcv_connect.sh script

Removed redhat_on_docker condition from:
* stunnel
* system_authentication
* efa

TODO:
* efa -> check efa_supported? condition and log messages
* lustre -> check version condition and log messages
* install_packages --> Removed postgresql packages
* Enable repository needed by hwloc-devel blas-devel libedit-devel and glibc-static packages

### Tests

* Added Alinux2023 to ec2 kitchen configuration files. Copied from rhel8 with minor changes:
  * AMI name prefix took from EC2 Amazon Linux official AMI
  * I called the suite `alinux-2023`, with an "-" in the name to avoid having alinux2
    as prefix of alinux-2023 and be able to distinguish them on Inspec runs.
* Created new `pre_converge` hook to install libxcrypt-compat package, required to install Chef, leveraging the work done with #2342
* I had to pass a custom `provisioner/download_url` for cinc because package for AL2023 is not available in the default path.
* The validated resources are:
  * nvidia_driver
  * arm_Pl
  * c_states
  * stunnel
  * build_tools
  * chrony
  * modules
  * munge
  * dns_domain (install only)
  * jwt_dependencies
  * nfs (install only)
  * raid (install only)
  * system_authentication (install only)
  * efs (install only)

TODO:
* Add Alinux2023 to GitHub actions
* Add new os to kitchen.docker.yml config file (search for `kernel_release` version from an EC2 instance)
* Fix Inspec and ChefSpec tests conditions

### References
* https://hub.docker.com/_/amazonlinux
* https://github.com/test-kitchen/kitchen-ec2/tree/main/lib/kitchen/driver/aws/standard_platform

Known issues/FE:
* amazonlinux/amazon-linux-2023#47
* amazonlinux/amazon-linux-2023#146
* amazonlinux/amazon-linux-2023#168
* amazonlinux/amazon-linux-2023#309
* amazonlinux/amazon-linux-2023#316

Signed-off-by: Enrico Usai <[email protected]>

[AL2023] Use systemd-resolved instead of dhclient on Alinux 2023

Signed-off-by: Hanwen <[email protected]>
@becker929
Copy link

becker929 commented Aug 30, 2024

sudo dnf -y module install nvidia-driver:latest-dkms

Got stopped here with this error

(robophone) [ec2-user@ip-172-31-46-213 roborx]$ sudo dnf -y module install nvidia-driver:latest-dkms
Amazon Linux 2023 repository                                                           35 kB/s | 3.6 kB     00:00
Amazon Linux 2023 Kernel Livepatch repository                                          26 kB/s | 2.9 kB     00:00
cuda-amzn2023-x86_64                                                                  3.3 MB/s | 323 kB     00:00
packages for the GitHub CLI                                                            75 kB/s | 3.0 kB     00:00
Modular dependency problem:

 Problem: nothing provides requested module(nvidia-driver:open-dkms)
All matches for argument 'nvidia-driver:latest-dkms' in module 'nvidia-driver:latest-dkms' are not active
Error: Problems in request:
broken groups or modules: nvidia-driver:latest-dkms
Modular dependency problems:

 Problem: nothing provides requested module(nvidia-driver:latest-dkms)

Seems this is because I am on ARM (uname -m returns aarch64), specifically Graviton arm64 (g5g instance). There are different instructions for that on the page mentioned above, under "Method 2": https://repost.aws/articles/ARwfQMxiC-QMOgWykD9mco1w/how-do-i-install-nvidia-gpu-driver-cuda-toolkit-and-optionally-nvidia-container-toolkit-in-amazon-linux-2023-al2023

Following these instructions worked.

Install script:

sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux_sbsa.run
chmod +x ./cuda*.run
sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent
sudo reboot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants