Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AL2 kernel 5.10 GPU and INF #214

Merged
merged 2 commits into from
Feb 28, 2024

Conversation

danehlim
Copy link
Contributor

@danehlim danehlim commented Feb 27, 2024

Summary

Add new GPU and INF packer recipes using AL2 kernel 5.10.

Implementation details

  • Create al2kernel5dot10gpu.pkr.hcl and al2kernel5dot10inf.pkr.hcl
  • Modify al2.pkr.hcl build sources to include al2kernel5dot10gpu and al2kernel5dot10inf
  • Add new targets al2kernel5dot10gpu and al2kernel5dot10inf to Makefile and README.md
  • Run AL2 install kernel 5.10 script when AMI_TYPE starts with "al2kernel5dot10"
  • Run enable GPU support script when AMI_TYPE starts with "al2" and ends with "gpu"
  • Make sure the NVIDIA driver used is compiled with the appropriate GCC version (gcc10) for kernel 5.10 GPU AMI type
  • Run enable Inferentia support script when AMI_TYPE starts with "al2" and ends with "inf" (or is "al2023neu")
  • Add a reboot after executing install kernel 5.10 script for these new AMI variants and make sure that enable Inferentia support and enable GPU support scripts are executed after the reboot

Testing

✅ Successfully build AMIs using the new AMI recipes locally.
✅ Launch EC2 instances using the built AMIs.
✅ Confirm Linux kernel version 5.10 is enabled and running on the launched instances.
✅ Confirm kernel and kernel header packages installed in the RPM database are only for Linux kernel 5.10 on the launched instances.
✅ Run relevant GPU and INF functional tests against the built AMIs from manual testing steps above and ensure all tests pass.
✅ For the built GPU kernel 5.10 AMI, run testing steps from #163 against it and make sure they are successful. This is to ensure post-kepler NVIDIA open kernel module installation still behaves as expected.

Additional sanity checks On EC2 instance launched using built GPU kernel 5.10 AMI:
$ dkms status
nvidia, 535.129.03, 5.10.209-198.858.amzn2.x86_64, x86_64: installed
$ nvidia-smi
Tue Feb 27 03:14:43 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0              24W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

On EC2 instance launched using built INF kernel 5.10 AMI:

$ dkms status
aws-neuronx, 2.15.9.0, 5.10.209-198.858.amzn2.x86_64, x86_64: installed

New tests cover the changes: N/A

Description for the changelog

Support AL2 kernel 5.10 GPU and INF

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@danehlim danehlim requested a review from a team as a code owner February 27, 2024 03:51
@danehlim danehlim marked this pull request as draft February 27, 2024 03:52
@danehlim danehlim marked this pull request as ready for review February 27, 2024 08:06
@danehlim danehlim force-pushed the al2-kernel-5.10-gpu-inf branch from 0ded325 to 58f6e8b Compare February 28, 2024 18:56
@danehlim danehlim force-pushed the al2-kernel-5.10-gpu-inf branch from 58f6e8b to cddb996 Compare February 28, 2024 21:54
Copy link
Contributor

@mye956 mye956 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Regarding the generate release note script, should we also include al2kerneldot10gpu in these checks to ensure we're passing in a cuda + nvidia version?

if ! is_ami_excluded "al2gpu"; then

Since we're not building/publishing these new AMI types yet, perhaps we should add a TODO somewhere

@danehlim
Copy link
Contributor Author

nit: Regarding the generate release note script, should we also include al2kerneldot10gpu in these checks to ensure we're passing in a cuda + nvidia version?

if ! is_ami_excluded "al2gpu"; then

Since we're not building/publishing these new AMI types yet, perhaps we should add a TODO somewhere

Good call out! Once ECS Agent team starts building and publishing these new AMI variants as part of the AMI release process, I will make sure to update the generate release notes script to factor in the new AMI variants. I will create an item to track this internally.

@danehlim danehlim merged commit 3c0411b into aws:main Feb 28, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants