The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564

uniit · 2023-08-03T09:57:37Z

1. Quick Debug Information

OS/Version: Ubuntu 22.04
Kernel Version:

Linux version 5.19.0-1030-gcp (buildd@bos03-amd64-050) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #32~22.04.1-Ubuntu SMP Thu Jul 13 09:36:23 UTC 2023

GPU Operator Version: GPU Operator 23.3.2 Release

2. Issue or feature description

The issue with the GPU Operator is that it cannot build Nvidia drivers on GCP with Ubuntu 22.04 due to the usage of the x86_64-linux-gnu-gcc-12 compiler in the build process. This incompatibility is causing the build to fail.

warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0
  You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Ubuntu 22.04 on AWS works well because it uses a similar major version of the compiler as the one used on the generic kernels. This similarity in compiler versions allows the GPU Operator to build Nvidia drivers successfully on AWS with Ubuntu 22.04.

Linux version 5.15.0-1034-aws (buildd@lcy02-amd64-114) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #38~20.04.1-Ubuntu SMP Wed Mar 29 19:48:16 UTC 2023

3. Steps to reproduce the issue

Create GCP instance (N1 + NVIDIA T4) with Ubuntu 22.04.
Install k3s:

curl -sfL https://get.k3s.io | sh -

Install GPU Operator with the followiing helm values:

USER-SUPPLIED VALUES:
driver:
  enabled: true
operator:
  cleanupCRD: true
  defaultRuntime: containerd
toolkit:
  env:
  - name: CONTAINERD_CONFIG
    value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
  - name: CONTAINERD_SOCKET
    value: /run/k3s/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "true"

4. Question:

Can I easily include gcc-12 in the driver image and change the build instructions to utilize it, either through an environment variable or by overriding the initial command?

Is there a plan to introduce support for Ubuntu 22.04 on GCP?

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2023-08-08T20:30:29Z

Hi @uniit can you provide complete logs from the driver container?

uniit · 2023-08-14T13:29:23Z

Hi @uniit can you provide complete logs from the driver container?

Sure, please review it below:

k logs nvidia-driver-daemonset-d6jsz -n nvidia -f
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-525.105.17
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation,and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.105.17 for Linux kernel version 5.19.0-1030-gcp

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.19.0-1030-gcp
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0
  You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-acpi.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-dmabuf.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-pci.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-nano-timer.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-dma.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-cray.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-p2p.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-mmap.o] Error 1
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-i2c.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-pat.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-procfs.o] Error 1
make[1]: *** [Makefile:1857: /usr/src/nvidia-525.105.17/kernel] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

shivamerla · 2023-08-14T15:30:13Z

@uniit can you try with driver version 525.125.06 which should use an updated CUDA base image which comes with GCC version as below. Looks like minimal version of 12 required to support this feature flag.

ii gcc-12-base:amd64 12.1.0-2ubuntu1~22.04 amd64 GCC, the GNU Compiler Collection (base package)

uniit · 2023-08-14T18:58:04Z

@shivamerla I've tried nvcr.io/nvidia/driver:525.125.06-ubuntu22.04. Looks like gcc-12 is still missing there. Errors are the same.

shivamerla · 2023-08-14T19:34:29Z

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here
Install 12.x GCC and overwrite what is installed with build-essential meta package here.

apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

Its-Alex · 2024-03-06T15:19:23Z

@shivamerla Any news on this ?

Shouldn't this be managed directly within driver containers rather than by clients?

xzzvsxd · 2024-03-19T16:04:41Z

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here

Install 12.x GCC and overwrite what is installed with build-essential meta package here.
apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

@shivamerla Any news on this ?

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here

Install 12.x GCC and overwrite what is installed with build-essential meta package here.
apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

nice bro you are right!

BHSDuncan · 2024-03-26T00:12:28Z

I'm cross-posting here a bit, but I'm having this same issue, although I'm deploying via Cloud Native Stack (one of the Ansible playbooks) on Ubuntu 22.04. I'm unsure if upgrading my CNS version will solve this, based on what's been said above and that the relevant install.sh/Dockerfile look about the same. Hopefully I'm missing something.

Any feedback/insight would be appreciated. Thank you.

shivamerla · 2024-03-26T15:52:08Z

@BHSDuncan @xzzvsxd we are looking to fix this soon as this is affecting all customers using 6.x kernels.

BHSDuncan · 2024-03-26T16:02:37Z

@BHSDuncan @xzzvsxd we are looking to fix this soon as this is affecting all customers using 6.x kernels.

Thank you for the update. Is there any timeline for when we can expect this fix?

Also, is it safe to assume that the only real work around if using Cloud Native Stack (i.e. being unable to easily rebuild images) is to rollback to a 5.x kernel? E.g. 5.19.0

shivamerla · 2024-03-26T17:40:35Z

Yes, you can check the GCC version used by referring to /proc/version on the host. With GCC 11.x builds are working fine.

BHSDuncan · 2024-03-26T17:57:52Z

Yes, you can check the GCC version used by referring to /proc/version on the host. With GCC 11.x builds are working fine.

It's the driver daemonset pod that's failing, specifically the nvidia-driver-ctr container, and I can't change how it's built (i.e. I can't tell it which version of GCC to download/install since this was all installed via a Cloud Native Stack Ansible playbook).

Changing the version of GCC on the host doesn't fix anything. For example, I did try installing GCC 12 on a host, but the problem still happens when starting the driver daemonset pod.

Its-Alex · 2024-03-31T00:11:38Z

Changing the version of GCC on the host doesn't fix anything. For example, I did try installing GCC 12 on a host, but the problem still happens when starting the driver daemonset pod.

@BHSDuncan You must change the version of GCC used to build the kernel on the host. So rollback to kernel 5.X.

Changing the version of GCC within the container should work, but assuming your previous messages you can't.

BHSDuncan · 2024-03-31T00:15:23Z

Yeah, because I'm using CNS, I can't do much of anything other than rely on their images.

wyike · 2024-04-15T09:36:45Z

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

1. Build an image with pre-compiled modules for `-gcp` kernel from precompiled folder using steps [here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#building-a-custom-driver-container-image)

2. Install 12.x GCC and overwrite what is installed with `build-essential` meta package [here](https://gitlab.com/nvidia/container-images/driver/-/blame/main/ubuntu22.04/install.sh#L10).

apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

Hi @shivamerla, I see gcc-12 is already installed here https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L97-101 and set to the alternatives. Just curious the image is built from this Dockerfile or not?

wyike · 2024-04-16T05:20:35Z

I build a driver myself from latest code (with the fix https://gitlab.com/nvidia/container-images/driver/-/commit/dd69782dc6a21aa92ded68fb9db58bd4b1a23a4a) can workaround temporarily:
docker build -t mydriver --build-arg DRIVER_VERSION="550.54.14" --build-arg DRIVER_BRANCH="550" --build-arg CUDA_VERSION=12.4.0 --build-arg TARGETARCH=amd64 ubuntu22.04 :

Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  You are using:           cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-peermem.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-modeset.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-drm.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-uvm.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia.ko due to unavailability of vmlinux
Relinking NVIDIA driver kernel modules...

tariq1890 · 2024-04-16T17:07:45Z

@wyike Thanks for testing the CI pipeline build and confirming that it works. We will have this fix out when the next Data Center GPU Driver is released by the driver team. We (the gpu-operator team) build the driver containers off of the driver releases managed by the driver team

See here to get more info on the Data Center drivers: https://docs.nvidia.com/datacenter/tesla/

wyike · 2024-04-19T09:21:41Z

Hi @tariq1890 sorry to ask a very junior question, why driver installer has to be built upon https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L2 cuda-base image? Anything in this image will be used by installer to install to the host?
I would like to ask it somewhere like community slack channel and couldn't find one. Would you help answer the question or point out somewhere public I could ask the general questions, thanks a lot!

dvaldivia · 2024-04-25T22:32:47Z

I managed to get this to work around by forcing ubuntu to start with kernel 5.15.0-79, that fixed the issue and I was able to get the driver installed via the daemonset

wyike · 2024-05-08T05:47:09Z

Will this fix, can the new driver installer still be used on previous kernel versions?

dr3s · 2024-06-13T15:42:36Z

For those coming across this because GKE auto upgrades, we were able to pin the node group version to 1.27.11-gke.1062003 because 1.27.11-gke.1062004 introduced a new kernel which triggers this issue

jerryguowei · 2024-11-14T04:08:38Z

The latest version of aws ubuntu 22.04 image also have the same issues, is there any update when the new docker images been released?

uniit · 2024-11-18T16:29:56Z

The GPU Operator works well on the latest Ubuntu 22.04 in GCP. I can't reproduce the issue described in the subject.

jerryguowei · 2024-11-20T00:02:59Z

@uniit Would you mind to give details that which version the driver are you using? and Is the kernel also compiled by gcc 12.

maybe run cat /proc/version to give the kernel info?

tariq1890 · 2024-11-20T02:55:04Z

@jerryguowei The GCP Ubuntu Kernels are compiled using gcc12. Please use the latest driver containers published by us. They should work on your GCP Ubuntu nodes

jerryguowei · 2024-11-21T07:06:12Z

Thanks @tariq1890 , I just tested the latest version of gpu-operator v24.9.0 and confirm it works for ubuntu22.04 with 6.x kernel (compiled with gcc 12) and previous old kernel(compile by gcc 11).

BHSDuncan mentioned this issue Mar 26, 2024

GPU Driver Container Won't Start NVIDIA/cloud-native-stack#54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564

The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564

uniit commented Aug 3, 2023

cdesiniotis commented Aug 8, 2023

uniit commented Aug 14, 2023

shivamerla commented Aug 14, 2023

uniit commented Aug 14, 2023

shivamerla commented Aug 14, 2023

Its-Alex commented Mar 6, 2024 •

edited

Loading

xzzvsxd commented Mar 19, 2024

BHSDuncan commented Mar 26, 2024

shivamerla commented Mar 26, 2024

BHSDuncan commented Mar 26, 2024

shivamerla commented Mar 26, 2024

BHSDuncan commented Mar 26, 2024

Its-Alex commented Mar 31, 2024 •

edited

Loading

BHSDuncan commented Mar 31, 2024

wyike commented Apr 15, 2024 •

edited

Loading

wyike commented Apr 16, 2024

tariq1890 commented Apr 16, 2024

wyike commented Apr 19, 2024 •

edited

Loading

dvaldivia commented Apr 25, 2024 •

edited

Loading

wyike commented May 8, 2024

dr3s commented Jun 13, 2024

jerryguowei commented Nov 14, 2024

uniit commented Nov 18, 2024

jerryguowei commented Nov 20, 2024

tariq1890 commented Nov 20, 2024

jerryguowei commented Nov 21, 2024

The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564

The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564

Comments

uniit commented Aug 3, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Question:

cdesiniotis commented Aug 8, 2023

uniit commented Aug 14, 2023

shivamerla commented Aug 14, 2023

uniit commented Aug 14, 2023

shivamerla commented Aug 14, 2023

Its-Alex commented Mar 6, 2024 • edited Loading

xzzvsxd commented Mar 19, 2024

BHSDuncan commented Mar 26, 2024

shivamerla commented Mar 26, 2024

BHSDuncan commented Mar 26, 2024

shivamerla commented Mar 26, 2024

BHSDuncan commented Mar 26, 2024

Its-Alex commented Mar 31, 2024 • edited Loading

BHSDuncan commented Mar 31, 2024

wyike commented Apr 15, 2024 • edited Loading

wyike commented Apr 16, 2024

tariq1890 commented Apr 16, 2024

wyike commented Apr 19, 2024 • edited Loading

dvaldivia commented Apr 25, 2024 • edited Loading

wyike commented May 8, 2024

dr3s commented Jun 13, 2024

jerryguowei commented Nov 14, 2024

uniit commented Nov 18, 2024

jerryguowei commented Nov 20, 2024

tariq1890 commented Nov 20, 2024

jerryguowei commented Nov 21, 2024

Its-Alex commented Mar 6, 2024 •

edited

Loading

Its-Alex commented Mar 31, 2024 •

edited

Loading

wyike commented Apr 15, 2024 •

edited

Loading

wyike commented Apr 19, 2024 •

edited

Loading

dvaldivia commented Apr 25, 2024 •

edited

Loading