Support AL2 kernel 5.10 GPU and INF #214

danehlim · 2024-02-27T03:51:25Z

Summary

Add new GPU and INF packer recipes using AL2 kernel 5.10.

Implementation details

Create al2kernel5dot10gpu.pkr.hcl and al2kernel5dot10inf.pkr.hcl
Modify al2.pkr.hcl build sources to include al2kernel5dot10gpu and al2kernel5dot10inf
Add new targets al2kernel5dot10gpu and al2kernel5dot10inf to Makefile and README.md
Run AL2 install kernel 5.10 script when AMI_TYPE starts with "al2kernel5dot10"
Run enable GPU support script when AMI_TYPE starts with "al2" and ends with "gpu"
Make sure the NVIDIA driver used is compiled with the appropriate GCC version (gcc10) for kernel 5.10 GPU AMI type
- Ref: https://nvidia.custhelp.com/app/answers/detail/a_id/5323/~/known-issue%3A-nvidia-vgpu-software-graphics-driver-installation-fails-on-some
Run enable Inferentia support script when AMI_TYPE starts with "al2" and ends with "inf" (or is "al2023neu")
Add a reboot after executing install kernel 5.10 script for these new AMI variants and make sure that enable Inferentia support and enable GPU support scripts are executed after the reboot
- Ref: https://github.com/aws/amazon-ecs-ami/blob/main/al2023.pkr.hcl#L144-L162 (this follows current behavior for ECS-Optimized AL2023 Neuron AMI variant)

Testing

✅ Successfully build AMIs using the new AMI recipes locally.
✅ Launch EC2 instances using the built AMIs.
✅ Confirm Linux kernel version 5.10 is enabled and running on the launched instances.
✅ Confirm kernel and kernel header packages installed in the RPM database are only for Linux kernel 5.10 on the launched instances.
✅ Run relevant GPU and INF functional tests against the built AMIs from manual testing steps above and ensure all tests pass.
✅ For the built GPU kernel 5.10 AMI, run testing steps from #163 against it and make sure they are successful. This is to ensure post-kepler NVIDIA open kernel module installation still behaves as expected.

Additional sanity checks

On EC2 instance launched using built GPU kernel 5.10 AMI:

$ dkms status
nvidia, 535.129.03, 5.10.209-198.858.amzn2.x86_64, x86_64: installed

$ nvidia-smi
Tue Feb 27 03:14:43 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0              24W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

On EC2 instance launched using built INF kernel 5.10 AMI:

$ dkms status
aws-neuronx, 2.15.9.0, 5.10.209-198.858.amzn2.x86_64, x86_64: installed

New tests cover the changes: N/A

Description for the changelog

Support AL2 kernel 5.10 GPU and INF

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

al2.pkr.hcl

scripts/enable-ecs-agent-gpu-support.sh

mye956

nit: Regarding the generate release note script, should we also include al2kerneldot10gpu in these checks to ensure we're passing in a cuda + nvidia version?

amazon-ecs-ami/generate-release-notes.sh

Line 77 in ba5782d

if ! is_ami_excluded "al2gpu"; then

Since we're not building/publishing these new AMI types yet, perhaps we should add a TODO somewhere

danehlim · 2024-02-28T23:39:17Z

nit: Regarding the generate release note script, should we also include al2kerneldot10gpu in these checks to ensure we're passing in a cuda + nvidia version?

amazon-ecs-ami/generate-release-notes.sh

Line 77 in ba5782d

if ! is_ami_excluded "al2gpu"; then

Since we're not building/publishing these new AMI types yet, perhaps we should add a TODO somewhere

Good call out! Once ECS Agent team starts building and publishing these new AMI variants as part of the AMI release process, I will make sure to update the generate release notes script to factor in the new AMI variants. I will create an item to track this internally.

danehlim requested a review from a team as a code owner February 27, 2024 03:51

danehlim marked this pull request as draft February 27, 2024 03:52

danehlim marked this pull request as ready for review February 27, 2024 08:06

chienhanlin reviewed Feb 27, 2024

View reviewed changes

al2.pkr.hcl Show resolved Hide resolved

chienhanlin reviewed Feb 27, 2024

View reviewed changes

scripts/enable-ecs-agent-gpu-support.sh Show resolved Hide resolved

danehlim force-pushed the al2-kernel-5.10-gpu-inf branch from 0ded325 to 58f6e8b Compare February 28, 2024 18:56

Support AL2 kernel 5.10 GPU and INF

cddb996

danehlim force-pushed the al2-kernel-5.10-gpu-inf branch from 58f6e8b to cddb996 Compare February 28, 2024 21:54

Merge branch 'main' into al2-kernel-5.10-gpu-inf

e6c4d85

chienhanlin approved these changes Feb 28, 2024

View reviewed changes

mye956 reviewed Feb 28, 2024

View reviewed changes

mye956 approved these changes Feb 28, 2024

View reviewed changes

danehlim merged commit 3c0411b into aws:main Feb 28, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support AL2 kernel 5.10 GPU and INF #214

Support AL2 kernel 5.10 GPU and INF #214

danehlim commented Feb 27, 2024 •

edited

Loading

mye956 left a comment •

edited

Loading

danehlim commented Feb 28, 2024

Support AL2 kernel 5.10 GPU and INF #214

Support AL2 kernel 5.10 GPU and INF #214

Conversation

danehlim commented Feb 27, 2024 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

mye956 left a comment • edited Loading

Choose a reason for hiding this comment

danehlim commented Feb 28, 2024

danehlim commented Feb 27, 2024 •

edited

Loading

mye956 left a comment •

edited

Loading