Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing AL2023 GPU AMIs #370

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Introducing AL2023 GPU AMIs #370

wants to merge 3 commits into from

Conversation

harishxr
Copy link
Contributor

@harishxr harishxr commented Jan 15, 2025

Summary

Add new GPU packer recipes for AL2023

Implementation details

  • Create al2023gpu.pkr.hcl and enable-ecs-agent-gpu-support-al2023.sh.
  • Modify al2023.pkr.hclbuild sources to include al2023gpu.
  • Add new targets al2023gpu to Makefile and README.md.
  • Run enable GPU support script when AMI_TYPE starts with "al2023" and ends with "gpu".

Testing

✅ Successfully build AMIs using the new AMI recipes locally.
✅ Launch EC2 instances using the built AMIs.
✅ Run relevant GPU functional tests against the built AMIs from manual testing steps above and ensure all tests pass.

Additional sanity checks
$ nvidia-smi
Thu Jan 16 00:24:49 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ /usr/local/cuda/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
$ nvidia-container-cli -V
cli-version: 1.17.3
lib-version: 1.17.3
build date: 2024-12-04T09:47+0000
build revision: 16f37fcafcbdaf67525135104d60d98d36688ba9
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi
Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
889191eec1e0: Pull complete
Digest: sha256:115e4b0c86e75eb6c34049d8369c932cd40a5588a98a83187a9bd1d150cd2679
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
Thu Jan 16 00:29:02 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   22C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

New tests cover the changes: N/A

Description for the changelog

Introducing AL2023 GPU AMIs

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@harishxr harishxr force-pushed the main branch 4 times, most recently from ab41b33 to 71f5f4c Compare January 16, 2025 00:35
@harishxr harishxr marked this pull request as ready for review January 16, 2025 00:36
@harishxr harishxr requested a review from a team as a code owner January 16, 2025 00:36
@harishxr harishxr force-pushed the main branch 4 times, most recently from 46e0808 to 52ea689 Compare January 17, 2025 01:46
Copy link
Contributor

@mye956 mye956 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be done as a follow up for this change, probably a good idea to also update our release notes script

al2023gpu.pkr.hcl Outdated Show resolved Hide resolved
@giantcow
Copy link

This will fix #319

@giantcow
Copy link

Added a comment but re-sharing in this main comment thread:

If I try to build this AMI I get this error:

...
    amazon-ebs.al2023gpu:
    amazon-ebs.al2023gpu: Installed:
    amazon-ebs.al2023gpu:   nvidia-release-2023-1.amzn2023.noarch
    amazon-ebs.al2023gpu:
    amazon-ebs.al2023gpu: Complete!
==> amazon-ebs.al2023gpu: + sudo dnf install -y nvidia-driver nvidia-fabric-manager pciutils xorg-x11-server-Xorg nvidia-container-toolkit docker-runtime-nvidia
    amazon-ebs.al2023gpu: Amazon Linux 2023 repository                     68 MB/s |  30 MB     00:00
    amazon-ebs.al2023gpu: Amazon Linux 2023 NVIDIA repository             1.9 MB/s | 372 kB     00:00
    amazon-ebs.al2023gpu: Amazon Linux 2023 Kernel Livepatch repository   114 kB/s |  11 kB     00:00
    amazon-ebs.al2023gpu: Package pciutils-3.7.0-3.amzn2023.0.2.x86_64 is already installed.
    amazon-ebs.al2023gpu: No match for argument: docker-runtime-nvidia
==> amazon-ebs.al2023gpu: Error: Unable to find a match: docker-runtime-nvidia
==> amazon-ebs.al2023gpu: Provisioning step had errors: Running the cleanup provisioner, if present...
==> amazon-ebs.al2023gpu: Terminating the source AWS instance...
...
# <snip> in .../workspace/amazon-ecs-ami on git:e1d2909 x [1:09:50] 
$ git status
HEAD detached at harishxr/main
nothing to commit, working tree clean

I didn't include this in my PR: https://github.com/aws/amazon-ecs-ami/pull/362/files#diff-b24f8d8798c5a8a2228a9c31b03f65adb65fd114e9b013c84a651ba40d321326

@harishxr
Copy link
Contributor Author

We are waiting on Amazon Linux to release the docker-runtime-nvidia package in their repos. Once the package is available, this PR will be merged.

The error you are running into is because of the package being unavailable currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants