Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add AL2023 GPU AMI #362

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

feat: Add AL2023 GPU AMI #362

wants to merge 3 commits into from

Conversation

giantcow
Copy link

@giantcow giantcow commented Jan 7, 2025

fixes #319

Summary

Adds build support to generate a GPU-enabled ECS AMI based on AL2023

Implementation details

I tried my best to maintain the existing support in scripts/enable-ecs-agent-gpu-support.sh. At least for AL2023, I think this could be simplified a lot, but again I didn't want to cause to much churn.

Testing

Built and published an image to my account, created an image and validated nvidia-smi worked and Docker runtimes were enabled

Per NVIDIA's documentation (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-a-sample-workload-with-docker), this command works as expected:

[ec2-user@ip-10-0-141-175 ~]$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
de44b265507a: Pull complete
Digest: sha256:80dd3c3b9c6cecb9f1667e9290b3bc61b78c2678c02cbdae5f0fea92cc6734ab
Status: Downloaded newer image for ubuntu:latest
Tue Jan  7 07:49:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   21C    P8             11W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I also did some basic tests:

$ ssh -i <SNIP> ec2-user@<SNIP>
<SNIP>
   ,     #_
   ~\_  ####_
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   Amazon Linux 2023 (ECS Optimized)
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'

For documentation, visit http://aws.amazon.com/documentation/ecs
[ec2-user@ip-172-31-33-216 ~]$ nvidia-smi
Tue Jan  7 04:42:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   38C    P8             12W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[ec2-user@ip-172-31-33-216 ~]$ docker info | grep -i Runtimes
 Runtimes: io.containerd.runc.v2 nvidia runc

Description for the changelog

  • Feature: Added AMI for AL2023 with GPU support

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nvidia-fabric-manager \
pciutils \
xorg-x11-server-Xorg \
oci-add-hooks \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be needed - the NVIDIA container toolkit will add the hooks, likewise for the next two libnvidia-container* deps, those should be handled by the container toolkit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can see what we do with the EKS AL2023 NVIDIA AMI here awslabs/amazon-eks-ami#1924

Note: we dual ship both the proprietary NVIDIA driver and the open GPU kernel module and load the correct one during instance provisioning based on the GPU card. Older cards will require the proprietary driver, newer will required the open GPU kernel module

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for AL2023 GPU
2 participants