Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564

Open
uniit opened this issue Aug 3, 2023 · 26 comments
Open

The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564

uniit opened this issue Aug 3, 2023 · 26 comments

Comments

@uniit
Copy link

uniit commented Aug 3, 2023

1. Quick Debug Information

  • OS/Version: Ubuntu 22.04
  • Kernel Version:
Linux version 5.19.0-1030-gcp (buildd@bos03-amd64-050) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #32~22.04.1-Ubuntu SMP Thu Jul 13 09:36:23 UTC 2023
  • GPU Operator Version: GPU Operator 23.3.2 Release

2. Issue or feature description

The issue with the GPU Operator is that it cannot build Nvidia drivers on GCP with Ubuntu 22.04 due to the usage of the x86_64-linux-gnu-gcc-12 compiler in the build process. This incompatibility is causing the build to fail.

warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0
  You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Ubuntu 22.04 on AWS works well because it uses a similar major version of the compiler as the one used on the generic kernels. This similarity in compiler versions allows the GPU Operator to build Nvidia drivers successfully on AWS with Ubuntu 22.04.

Linux version 5.15.0-1034-aws (buildd@lcy02-amd64-114) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #38~20.04.1-Ubuntu SMP Wed Mar 29 19:48:16 UTC 2023

3. Steps to reproduce the issue

  1. Create GCP instance (N1 + NVIDIA T4) with Ubuntu 22.04.
  2. Install k3s:
curl -sfL https://get.k3s.io | sh - 

Install GPU Operator with the followiing helm values:

USER-SUPPLIED VALUES:
driver:
  enabled: true
operator:
  cleanupCRD: true
  defaultRuntime: containerd
toolkit:
  env:
  - name: CONTAINERD_CONFIG
    value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
  - name: CONTAINERD_SOCKET
    value: /run/k3s/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "true"

4. Question:

Can I easily include gcc-12 in the driver image and change the build instructions to utilize it, either through an environment variable or by overriding the initial command?

Is there a plan to introduce support for Ubuntu 22.04 on GCP?

@cdesiniotis
Copy link
Contributor

Hi @uniit can you provide complete logs from the driver container?

@uniit
Copy link
Author

uniit commented Aug 14, 2023

Hi @uniit can you provide complete logs from the driver container?

Sure, please review it below:

k logs nvidia-driver-daemonset-d6jsz -n nvidia -f
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-525.105.17
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation,and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.105.17 for Linux kernel version 5.19.0-1030-gcp

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.19.0-1030-gcp
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0
  You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-acpi.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-dmabuf.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-pci.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-nano-timer.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-dma.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-cray.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-p2p.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-mmap.o] Error 1
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-i2c.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-pat.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-procfs.o] Error 1
make[1]: *** [Makefile:1857: /usr/src/nvidia-525.105.17/kernel] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

@shivamerla
Copy link
Contributor

@uniit can you try with driver version 525.125.06 which should use an updated CUDA base image which comes with GCC version as below. Looks like minimal version of 12 required to support this feature flag.

ii gcc-12-base:amd64 12.1.0-2ubuntu1~22.04 amd64 GCC, the GNU Compiler Collection (base package)

@uniit
Copy link
Author

uniit commented Aug 14, 2023

@shivamerla I've tried nvcr.io/nvidia/driver:525.125.06-ubuntu22.04. Looks like gcc-12 is still missing there. Errors are the same.

image

@shivamerla
Copy link
Contributor

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

  1. Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here

  2. Install 12.x GCC and overwrite what is installed with build-essential meta package here.

apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

@Its-Alex
Copy link

Its-Alex commented Mar 6, 2024

@shivamerla Any news on this ?

Shouldn't this be managed directly within driver containers rather than by clients?

@xzzvsxd
Copy link

xzzvsxd commented Mar 19, 2024

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

  1. Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here
  2. Install 12.x GCC and overwrite what is installed with build-essential meta package here.
apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

@shivamerla Any news on this ?

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

  1. Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here
  2. Install 12.x GCC and overwrite what is installed with build-essential meta package here.
apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

nice bro you are right!

@BHSDuncan
Copy link

I'm cross-posting here a bit, but I'm having this same issue, although I'm deploying via Cloud Native Stack (one of the Ansible playbooks) on Ubuntu 22.04. I'm unsure if upgrading my CNS version will solve this, based on what's been said above and that the relevant install.sh/Dockerfile look about the same. Hopefully I'm missing something.

Any feedback/insight would be appreciated. Thank you.

@shivamerla
Copy link
Contributor

@BHSDuncan @xzzvsxd we are looking to fix this soon as this is affecting all customers using 6.x kernels.

@BHSDuncan
Copy link

@BHSDuncan @xzzvsxd we are looking to fix this soon as this is affecting all customers using 6.x kernels.

Thank you for the update. Is there any timeline for when we can expect this fix?

Also, is it safe to assume that the only real work around if using Cloud Native Stack (i.e. being unable to easily rebuild images) is to rollback to a 5.x kernel? E.g. 5.19.0

@shivamerla
Copy link
Contributor

Yes, you can check the GCC version used by referring to /proc/version on the host. With GCC 11.x builds are working fine.

@BHSDuncan
Copy link

Yes, you can check the GCC version used by referring to /proc/version on the host. With GCC 11.x builds are working fine.

It's the driver daemonset pod that's failing, specifically the nvidia-driver-ctr container, and I can't change how it's built (i.e. I can't tell it which version of GCC to download/install since this was all installed via a Cloud Native Stack Ansible playbook).

Changing the version of GCC on the host doesn't fix anything. For example, I did try installing GCC 12 on a host, but the problem still happens when starting the driver daemonset pod.

@Its-Alex
Copy link

Its-Alex commented Mar 31, 2024

Changing the version of GCC on the host doesn't fix anything. For example, I did try installing GCC 12 on a host, but the problem still happens when starting the driver daemonset pod.

@BHSDuncan You must change the version of GCC used to build the kernel on the host. So rollback to kernel 5.X.

Changing the version of GCC within the container should work, but assuming your previous messages you can't.

@BHSDuncan
Copy link

Yeah, because I'm using CNS, I can't do much of anything other than rely on their images.

@wyike
Copy link

wyike commented Apr 15, 2024

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

1. Build an image with pre-compiled modules for `-gcp` kernel from precompiled folder using steps [here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#building-a-custom-driver-container-image)

2. Install 12.x GCC and overwrite what is installed with `build-essential` meta package [here](https://gitlab.com/nvidia/container-images/driver/-/blame/main/ubuntu22.04/install.sh#L10).
apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

Hi @shivamerla, I see gcc-12 is already installed here https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L97-101 and set to the alternatives. Just curious the image is built from this Dockerfile or not?

@wyike
Copy link

wyike commented Apr 16, 2024

I build a driver myself from latest code (with the fix https://gitlab.com/nvidia/container-images/driver/-/commit/dd69782dc6a21aa92ded68fb9db58bd4b1a23a4a) can workaround temporarily:
docker build -t mydriver --build-arg DRIVER_VERSION="550.54.14" --build-arg DRIVER_BRANCH="550" --build-arg CUDA_VERSION=12.4.0 --build-arg TARGETARCH=amd64 ubuntu22.04 :

Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  You are using:           cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-peermem.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-modeset.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-drm.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-uvm.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia.ko due to unavailability of vmlinux
Relinking NVIDIA driver kernel modules...

@tariq1890
Copy link
Contributor

@wyike Thanks for testing the CI pipeline build and confirming that it works. We will have this fix out when the next Data Center GPU Driver is released by the driver team. We (the gpu-operator team) build the driver containers off of the driver releases managed by the driver team

See here to get more info on the Data Center drivers: https://docs.nvidia.com/datacenter/tesla/

@wyike
Copy link

wyike commented Apr 19, 2024

Hi @tariq1890 sorry to ask a very junior question, why driver installer has to be built upon https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L2 cuda-base image? Anything in this image will be used by installer to install to the host?
I would like to ask it somewhere like community slack channel and couldn't find one. Would you help answer the question or point out somewhere public I could ask the general questions, thanks a lot!

@dvaldivia
Copy link

dvaldivia commented Apr 25, 2024

I managed to get this to work around by forcing ubuntu to start with kernel 5.15.0-79, that fixed the issue and I was able to get the driver installed via the daemonset

@wyike
Copy link

wyike commented May 8, 2024

Will this fix, can the new driver installer still be used on previous kernel versions?

@dr3s
Copy link

dr3s commented Jun 13, 2024

For those coming across this because GKE auto upgrades, we were able to pin the node group version to 1.27.11-gke.1062003 because 1.27.11-gke.1062004 introduced a new kernel which triggers this issue

@jerryguowei
Copy link

The latest version of aws ubuntu 22.04 image also have the same issues, is there any update when the new docker images been released?

@uniit
Copy link
Author

uniit commented Nov 18, 2024

The GPU Operator works well on the latest Ubuntu 22.04 in GCP. I can't reproduce the issue described in the subject.

@jerryguowei
Copy link

@uniit Would you mind to give details that which version the driver are you using? and Is the kernel also compiled by gcc 12.

maybe run cat /proc/version to give the kernel info?

@tariq1890
Copy link
Contributor

@jerryguowei The GCP Ubuntu Kernels are compiled using gcc12. Please use the latest driver containers published by us. They should work on your GCP Ubuntu nodes

@jerryguowei
Copy link

Thanks @tariq1890 , I just tested the latest version of gpu-operator v24.9.0 and confirm it works for ubuntu22.04 with 6.x kernel (compiled with gcc 12) and previous old kernel(compile by gcc 11).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests