We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I am working on building an AL2023 AMI with GPU support in aws/amazon-ecs-ami#362 and ran into this crash when NVML is being initialized
I'm expecting that the NVML bindings are no longer correct with the latest NVML/NVIDIA driver versions as the import defined here
amazon-ecs-agent/ecs-init/gpu/nvidia_gpu_manager.go
Line 21 in 41d593c
The agent runs
Jan 07 06:34:11 <snip> systemd[1]: Starting ecs.service - Amazon Elastic Container Service - container agent... Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="pre-start" Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="Successfully created docker client with API version 1.25" Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="pre-start: setting up GPUs" Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="By using the GPU Optimized AMI, you agree to Nvidia’s End User License Agreement: https://www.nvidia.com/en-us/about-nvidia/eula-agreement/" Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: SIGSEGV: segmentation violation Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: PC=0x0 m=0 sigcode=1 addr=0x0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: signal arrived during cgo execution Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: goroutine 1 gp=0xc0000061c0 m=0 mp=0x1403d20 [syscall]: Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: runtime.cgocall(0xc293f0, 0xc000227910) Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: /usr/lib/golang/src/runtime/cgocall.go:157 +0x4b fp=0xc0002278e8 sp=0xc0002278b0 pc=0x40a68b Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml._Cfunc_nvmlInit_dl() Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: _cgo_gotypes.go:685 +0x47 fp=0xc000227910 sp=0xc0002278e8 pc=0xc1bda7 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml.init_() Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/bindings.go:61 +0x13 fp=0xc000227930 sp=0xc000227910 pc=0xc1bf93 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml.Init(...) Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml.go:251 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/aws/amazon-ecs-agent/ecs-init/gpu.InitNVML() ... Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x33 fp=0xc00017dfc8 sp=0xc00017dfb0 pc=0x6d6fd3 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/cihub/seelog.NewAsyncLoopLogger.gowrap1() Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x25 fp=0xc00017dfe0 sp=0xc00017dfc8 pc=0x6d6dc5 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: runtime.goexit({}) Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: /usr/lib/golang/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00017dfe8 sp=0xc00017dfe0 pc=0x4739a1 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: created by github.com/cihub/seelog.NewAsyncLoopLogger in goroutine 1 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0xcf Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rax 0x62bd230 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rbx 0xc000227910 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rcx 0x0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rdx 0x0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rdi 0x7fae5b685988 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rsi 0x0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rbp 0xc0002278a0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rsp 0x7ffcbfb06138 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r8 0x7fae5b686018 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r9 0xca Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r10 0x7fadfae00000 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r11 0x202 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r12 0xc000228000 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r13 0x0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r14 0xc0000061c0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r15 0x97 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rip 0x0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rflags 0x10202 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: cs 0x33 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: fs 0x0 Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: gs 0x0 Jan 07 06:34:11 <snip> systemd[1]: ecs.service: Control process exited, code=exited, status=2/INVALIDARGUMENT Jan 07 06:34:11 <snip> amazon-ecs-init[11065]: level=info time=2025-01-07T06:34:11Z msg="post-stop"
My AMI: unofficial-amzn2023-ami-ecs-gpu-hvm-2023.0.20241217-kernel-6.1-x86_64 ecs_agent_version 1.89.2 source_image_name al2023-ami-minimal-2023.6.20241212.0-kernel-6.1-x86_64 ecs_runtime_version Docker version 25.0.6
unofficial-amzn2023-ami-ecs-gpu-hvm-2023.0.20241217-kernel-6.1-x86_64
1.89.2
al2023-ami-minimal-2023.6.20241212.0-kernel-6.1-x86_64
Docker version 25.0.6
[ec2-user@ip-10-0-141-175 ~]$ nvidia-smi | grep -i version | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
[ec2-user@ip-10-0-141-175 ~]$ docker info Client: Version: 25.0.5 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.0.0+unknown Path: /usr/libexec/docker/cli-plugins/docker-buildx Server: Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 0 Server Version: 25.0.6 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 nvidia runc Default Runtime: runc Init Binary: docker-init containerd version: 8fc6bcff51318944179630522a095cc9dbf9f353 runc version: 2c9f5602f0ba3d9da1c2596322dfc4e156844890 init version: de40ad0 Security Options: seccomp Profile: builtin cgroupns Kernel Version: 6.1.119-129.201.amzn2023.x86_64 Operating System: Amazon Linux 2023.6.20241212 OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 30.15GiB Name: ip-10-0-141-175.us-east-2.compute.internal ID: 7677907d-d247-4376-8ee3-72e9a3ecdbd5 Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
I have the output from ecs-logs-collector if the above isn't enough info
[ 3.507809] systemd[1]: /usr/lib/systemd/system/nvidia-persistenced.service:7: PIDFile= references a path below legacy directory /var/run/, updating /var/run/nvidia-persistenced/nvidia-persistenced.pid → /run/nvidia-persistenced/nvidia-persistenced.pid; please update the unit file accordingly. [ 3.508506] systemd[1]: /usr/lib/systemd/system/nvidia-fabricmanager.service:18: PIDFile= references a path below legacy directory /var/run/, updating /var/run/nvidia-fabricmanager/nv-fabricmanager.pid → /run/nvidia-fabricmanager/nv-fabricmanager.pid; please update the unit file accordingly. [ 6.837826] nvidia: loading out-of-tree module taints kernel. [ 6.838727] nvidia: module license 'NVIDIA' taints kernel. [ 6.848116] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 6.947597] nvidia-nvlink: Nvlink Core is being initialized, major device number 242 [ 6.950067] nvidia 0000:31:00.0: enabling device (0000 -> 0002) [ 6.956827] nvidia 0000:31:00.0: PCI INT A: no GSI [ 7.005088] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 560.35.03 Fri Aug 16 21:39:15 UTC 2024 [ 7.061326] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint. [ 7.095624] nvidia-uvm: Loaded the UVM driver, major device number 240. [ 7.128296] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 560.35.03 Fri Aug 16 21:21:48 UTC 2024 [ 7.181227] [drm] [nvidia-drm] [GPU ID 0x00003100] Loading driver [ 7.181889] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:31:00.0 on minor 1
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Summary
I am working on building an AL2023 AMI with GPU support in aws/amazon-ecs-ami#362 and ran into this crash when NVML is being initialized
Description
I'm expecting that the NVML bindings are no longer correct with the latest NVML/NVIDIA driver versions as the import defined here
amazon-ecs-agent/ecs-init/gpu/nvidia_gpu_manager.go
Line 21 in 41d593c
Expected Behavior
The agent runs
Observed Behavior
Environment Details
My AMI:
unofficial-amzn2023-ami-ecs-gpu-hvm-2023.0.20241217-kernel-6.1-x86_64
ecs_agent_version
1.89.2
source_image_name
al2023-ami-minimal-2023.6.20241212.0-kernel-6.1-x86_64
ecs_runtime_version
Docker version 25.0.6
docker info result
Supporting Log Snippets
I have the output from ecs-logs-collector if the above isn't enough info
The text was updated successfully, but these errors were encountered: