You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Missing confing file in /etc/ld.so.conf.d/ pointing to /lib64 causing the ldconfig caches wrong glibc and broken the k8s-device-plugin container img
#1182
Open
fangpenlin opened this issue
Feb 27, 2025
· 0 comments
· May be fixed by #1183
I followed the guide described in here to install the plugin helm chart. Somehow a few container shows error messages like this:
symbol lookup error: /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libc.so.6: undefined symbol: __tunable_is_initialized, version GLIBC_PRIVATE
This Kubernetes cluster is running on top of NixOS by the way. I can reproduce it with a simple podman cli run like this:
$ podman run -it --entrypoint=/bin/sh --rm --device nvidia.com/gpu=all --security-opt=label=disable nvcr.io/nvidia/k8s-device-plugin:v0.17.0
/bin/sh: symbol lookup error: /nix/store/nqb2ns2d1lahnd5ncwmn6k84qfd7vx2k-glibc-2.40-36/lib/libc.so.6: undefined symbol: __tunable_is_initialized, version GLIBC_PRIVATE
After digging into the problem, the ldconfig cache OCI hook generated in the CDI config is the root cause:
I guess the glibc version compiled inside the plugin container is different from the one ship with my nixos nvidia binaries, as a result, we see this error:
undefined symbol: __tunable_is_initialized, version GLIBC_PRIVATE
Solving this problem is fairly simple. One only needs to create a lib64.conf inside the /etc/ld.so.conf.d folder making it pointing to /lib64 in the container from the first place. Like this:
I checked with the other common docker images, such as ubuntu, it doesn't come with this issue because it comes with all those needed ld config files:
$ podman run -it ubuntu
root@3e192d26a2c8:/# ls -al /etc/ld.so.conf.d/
total 16
drwxr-xr-x 2 root root 4096 Jan 27 02:09 .
drwxr-xr-x 1 root root 4096 Feb 27 20:30 ..
-rw-r--r-- 1 root root 44 Aug 2 2022 libc.conf
-rw-r--r-- 1 root root 100 Mar 30 2024 x86_64-linux-gnu.conf
I think it makes sense to update the docker file in this repo to include the ldconfig config file so that we can avoid a problem like that. I am going to create a PR shortly for adding that missing config file.
The text was updated successfully, but these errors were encountered:
fangpenlin
added a commit
to fangpenlin/k8s-device-plugin
that referenced
this issue
Feb 27, 2025
I followed the guide described in here to install the plugin helm chart. Somehow a few container shows error messages like this:
This Kubernetes cluster is running on top of NixOS by the way. I can reproduce it with a simple
podman
cli run like this:After digging into the problem, the ldconfig cache OCI hook generated in the CDI config is the root cause:
The nvidia-ctk command line tool will create a config file like
nvcr-1167929244.conf
into the/etc/ld.so.conf.d
for the mounted folders we provided via--folder
argument. Before the hook runs, the ld cache looks like this:But then when the ldconfig hook kicks in, right after it runs, here's what the ld cache looks like inside the container's namespace:
As you can see now the
libc
is pointing to the one provided by my CDI config file as the other nvidia drivers and executables relying on itI guess the glibc version compiled inside the plugin container is different from the one ship with my nixos nvidia binaries, as a result, we see this error:
Solving this problem is fairly simple. One only needs to create a
lib64.conf
inside the/etc/ld.so.conf.d
folder making it pointing to/lib64
in the container from the first place. Like this:I checked with the other common docker images, such as
ubuntu
, it doesn't come with this issue because it comes with all those needed ld config files:I think it makes sense to update the docker file in this repo to include the ldconfig config file so that we can avoid a problem like that. I am going to create a PR shortly for adding that missing config file.
The text was updated successfully, but these errors were encountered: