Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AL2023 + GPU crashing on initalization because of NVML bindings #4466

Open
giantcow opened this issue Jan 7, 2025 · 0 comments
Open

AL2023 + GPU crashing on initalization because of NVML bindings #4466

giantcow opened this issue Jan 7, 2025 · 0 comments

Comments

@giantcow
Copy link

giantcow commented Jan 7, 2025

Summary

I am working on building an AL2023 AMI with GPU support in aws/amazon-ecs-ami#362 and ran into this crash when NVML is being initialized

Description

I'm expecting that the NVML bindings are no longer correct with the latest NVML/NVIDIA driver versions as the import defined here

"github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml"
is reference a deprecated repo (https://github.com/NVIDIA/gpu-monitoring-tools?tab=readme-ov-file). It's been replaced by https://github.com/NVIDIA/go-nvml

Expected Behavior

The agent runs

Observed Behavior

Jan 07 06:34:11 <snip> systemd[1]: Starting ecs.service - Amazon Elastic Container Service - container agent...
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="pre-start"
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="Successfully created docker client with API version 1.25"
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="pre-start: setting up GPUs"
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: level=info time=2025-01-07T06:34:11Z msg="By using the GPU Optimized AMI, you agree to Nvidia’s End User License Agreement: https://www.nvidia.com/en-us/about-nvidia/eula-agreement/"
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: SIGSEGV: segmentation violation
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: PC=0x0 m=0 sigcode=1 addr=0x0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: signal arrived during cgo execution
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: goroutine 1 gp=0xc0000061c0 m=0 mp=0x1403d20 [syscall]:
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: runtime.cgocall(0xc293f0, 0xc000227910)
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         /usr/lib/golang/src/runtime/cgocall.go:157 +0x4b fp=0xc0002278e8 sp=0xc0002278b0 pc=0x40a68b
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml._Cfunc_nvmlInit_dl()
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         _cgo_gotypes.go:685 +0x47 fp=0xc000227910 sp=0xc0002278e8 pc=0xc1bda7
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml.init_()
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/bindings.go:61 +0x13 fp=0xc000227930 sp=0xc000227910 pc=0xc1bf93
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml.Init(...)
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml.go:251
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/aws/amazon-ecs-agent/ecs-init/gpu.InitNVML()

...

Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:63 +0x33 fp=0xc00017dfc8 sp=0xc00017dfb0 pc=0x6d6fd3
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: github.com/cihub/seelog.NewAsyncLoopLogger.gowrap1()
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0x25 fp=0xc00017dfe0 sp=0xc00017dfc8 pc=0x6d6dc5
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: runtime.goexit({})
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         /usr/lib/golang/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00017dfe8 sp=0xc00017dfe0 pc=0x4739a1
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: created by github.com/cihub/seelog.NewAsyncLoopLogger in goroutine 1
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]:         /tmp/tmp.nOX9tXWSyS/src/github.com/aws/amazon-ecs-agent/ecs-init/vendor/github.com/cihub/seelog/behavior_asynclooplogger.go:40 +0xcf
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rax    0x62bd230
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rbx    0xc000227910
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rcx    0x0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rdx    0x0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rdi    0x7fae5b685988
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rsi    0x0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rbp    0xc0002278a0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rsp    0x7ffcbfb06138
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r8     0x7fae5b686018
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r9     0xca
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r10    0x7fadfae00000
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r11    0x202
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r12    0xc000228000
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r13    0x0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r14    0xc0000061c0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: r15    0x97
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rip    0x0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: rflags 0x10202
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: cs     0x33
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: fs     0x0
Jan 07 06:34:11 <snip> amazon-ecs-init[11051]: gs     0x0
Jan 07 06:34:11 <snip> systemd[1]: ecs.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
Jan 07 06:34:11 <snip> amazon-ecs-init[11065]: level=info time=2025-01-07T06:34:11Z msg="post-stop"

Environment Details

My AMI: unofficial-amzn2023-ami-ecs-gpu-hvm-2023.0.20241217-kernel-6.1-x86_64
ecs_agent_version 1.89.2
source_image_name al2023-ami-minimal-2023.6.20241212.0-kernel-6.1-x86_64
ecs_runtime_version Docker version 25.0.6

[ec2-user@ip-10-0-141-175 ~]$ nvidia-smi | grep -i version
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
docker info result
[ec2-user@ip-10-0-141-175 ~]$ docker info
Client:
Version:    25.0.5
Context:    default
Debug Mode: false
Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.0.0+unknown
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx

Server:
Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
Images: 0
Server Version: 25.0.6
Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8fc6bcff51318944179630522a095cc9dbf9f353
runc version: 2c9f5602f0ba3d9da1c2596322dfc4e156844890
init version: de40ad0
Security Options:
  seccomp
  Profile: builtin
  cgroupns
Kernel Version: 6.1.119-129.201.amzn2023.x86_64
Operating System: Amazon Linux 2023.6.20241212
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 30.15GiB
Name: ip-10-0-141-175.us-east-2.compute.internal
ID: 7677907d-d247-4376-8ee3-72e9a3ecdbd5
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
  127.0.0.0/8
Live Restore Enabled: false

Supporting Log Snippets

I have the output from ecs-logs-collector if the above isn't enough info

[    3.507809] systemd[1]: /usr/lib/systemd/system/nvidia-persistenced.service:7: PIDFile= references a path below legacy directory /var/run/, updating /var/run/nvidia-persistenced/nvidia-persistenced.pid → /run/nvidia-persistenced/nvidia-persistenced.pid; please update the unit file accordingly.
[    3.508506] systemd[1]: /usr/lib/systemd/system/nvidia-fabricmanager.service:18: PIDFile= references a path below legacy directory /var/run/, updating /var/run/nvidia-fabricmanager/nv-fabricmanager.pid → /run/nvidia-fabricmanager/nv-fabricmanager.pid; please update the unit file accordingly.
[    6.837826] nvidia: loading out-of-tree module taints kernel.
[    6.838727] nvidia: module license 'NVIDIA' taints kernel.
[    6.848116] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    6.947597] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[    6.950067] nvidia 0000:31:00.0: enabling device (0000 -> 0002)
[    6.956827] nvidia 0000:31:00.0: PCI INT A: no GSI
[    7.005088] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  560.35.03  Fri Aug 16 21:39:15 UTC 2024
[    7.061326] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[    7.095624] nvidia-uvm: Loaded the UVM driver, major device number 240.
[    7.128296] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  560.35.03  Fri Aug 16 21:21:48 UTC 2024
[    7.181227] [drm] [nvidia-drm] [GPU ID 0x00003100] Loading driver
[    7.181889] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:31:00.0 on minor 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant