-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The GPU Operator driver build fails on GCP when using Ubuntu 22.04. #564
Comments
Hi @uniit can you provide complete logs from the driver container? |
Sure, please review it below:
|
@uniit can you try with driver version
|
@shivamerla I've tried nvcr.io/nvidia/driver:525.125.06-ubuntu22.04. Looks like gcc-12 is still missing there. Errors are the same. |
@uniit you are right, we install
|
@shivamerla Any news on this ? Shouldn't this be managed directly within driver containers rather than by clients? |
nice bro you are right! |
I'm cross-posting here a bit, but I'm having this same issue, although I'm deploying via Cloud Native Stack (one of the Ansible playbooks) on Ubuntu 22.04. I'm unsure if upgrading my CNS version will solve this, based on what's been said above and that the relevant install.sh/Dockerfile look about the same. Hopefully I'm missing something. Any feedback/insight would be appreciated. Thank you. |
@BHSDuncan @xzzvsxd we are looking to fix this soon as this is affecting all customers using 6.x kernels. |
Thank you for the update. Is there any timeline for when we can expect this fix? Also, is it safe to assume that the only real work around if using Cloud Native Stack (i.e. being unable to easily rebuild images) is to rollback to a 5.x kernel? E.g. 5.19.0 |
Yes, you can check the GCC version used by referring to |
It's the driver daemonset pod that's failing, specifically the Changing the version of GCC on the host doesn't fix anything. For example, I did try installing GCC 12 on a host, but the problem still happens when starting the driver daemonset pod. |
@BHSDuncan You must change the version of GCC used to build the kernel on the host. So rollback to kernel 5.X. Changing the version of GCC within the container should work, but assuming your previous messages you can't. |
Yeah, because I'm using CNS, I can't do much of anything other than rely on their images. |
Hi @shivamerla, I see gcc-12 is already installed here https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L97-101 and set to the alternatives. Just curious the image is built from this Dockerfile or not? |
I build a driver myself from latest code (with the fix https://gitlab.com/nvidia/container-images/driver/-/commit/dd69782dc6a21aa92ded68fb9db58bd4b1a23a4a) can workaround temporarily:
|
@wyike Thanks for testing the CI pipeline build and confirming that it works. We will have this fix out when the next Data Center GPU Driver is released by the driver team. We (the gpu-operator team) build the driver containers off of the driver releases managed by the driver team See here to get more info on the Data Center drivers: https://docs.nvidia.com/datacenter/tesla/ |
Hi @tariq1890 sorry to ask a very junior question, why driver installer has to be built upon https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L2 cuda-base image? Anything in this image will be used by installer to install to the host? |
I managed to get this to work around by forcing ubuntu to start with kernel |
Will this fix, can the new driver installer still be used on previous kernel versions? |
For those coming across this because GKE auto upgrades, we were able to pin the node group version to 1.27.11-gke.1062003 because 1.27.11-gke.1062004 introduced a new kernel which triggers this issue |
The latest version of aws ubuntu 22.04 image also have the same issues, is there any update when the new docker images been released? |
The GPU Operator works well on the latest Ubuntu 22.04 in GCP. I can't reproduce the issue described in the subject. |
@uniit Would you mind to give details that which version the driver are you using? and Is the kernel also compiled by gcc 12. maybe run |
@jerryguowei The GCP Ubuntu Kernels are compiled using gcc12. Please use the latest driver containers published by us. They should work on your GCP Ubuntu nodes |
Thanks @tariq1890 , I just tested the latest version of gpu-operator v24.9.0 and confirm it works for ubuntu22.04 with 6.x kernel (compiled with gcc 12) and previous old kernel(compile by gcc 11). |
1. Quick Debug Information
2. Issue or feature description
The issue with the GPU Operator is that it cannot build Nvidia drivers on GCP with Ubuntu 22.04 due to the usage of the x86_64-linux-gnu-gcc-12 compiler in the build process. This incompatibility is causing the build to fail.
Ubuntu 22.04 on AWS works well because it uses a similar major version of the compiler as the one used on the generic kernels. This similarity in compiler versions allows the GPU Operator to build Nvidia drivers successfully on AWS with Ubuntu 22.04.
3. Steps to reproduce the issue
Install GPU Operator with the followiing helm values:
4. Question:
Can I easily include gcc-12 in the driver image and change the build instructions to utilize it, either through an environment variable or by overriding the initial command?
Is there a plan to introduce support for Ubuntu 22.04 on GCP?
The text was updated successfully, but these errors were encountered: