Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically provisioning X11 and Wayland devices of GPU inside container? #118

Open
ehfd opened this issue Nov 28, 2020 · 29 comments
Open

Comments

@ehfd
Copy link

ehfd commented Nov 28, 2020

Redirected from NVIDIA/k8s-device-plugin#206 to a more suitable repository.

1. Issue or feature description

In docker and kubernetes, people have had to have manual host setup to provision the X server using host path directive /tmp/.X11-unix. This is quite tedious for sysadmins and at the same time a security threat as people can spoof the host.

To mitigate this, there have been attempts (https://github.com/ehfd/docker-nvidia-glx-desktop which is based on https://github.com/ryought/glx-docker-headless-gpu) to execute an X server and use GLX inside the container after getting provisioned the GPU using libnvidia-container.

An alternative was created by the developers at VirtualGL (used widely in HPC to enable GPU-based rendering in VNC virtual display environments) have been able to develop a feature that uses the EGL API to enable 3D GL rendering such as Blender, Matlab, and Unity, previously only possible with GLX and thus an X server. As you guys know well, nvidia-docker does not support GLX but has introduced the EGL API just below two years ago.
See EGL config section of VirtualGL/virtualgl#113 (comment)

EGL is also required to start a Wayland compositor inside a container with the EGLStreams specification in NVIDIA GPUs, which is the way forward after X11 development has stopped.

These use cases require access to the devices /dev/dri/cardX corresponding to each GPU provisioned using libnvidia-container. However, it does not seem like libnvidia-container provisions this automatically. I would like to ask you whether this is possible, and how this can be configured.

2. Steps to reproduce the issue

Provision one GPU inside container nvidia/cudagl:11.0-devel-ubuntu20.04 or nvidia/opengl:1.2-glvnd-devel-ubuntu20.04 in Docker CE 19.03 (or using one nvidia.com/gpu: 1 with k8s-device-plugin v0.7.0 with default configurations in Kubernetes v1.18.6).

Do: ls /dev

Result: Inside the container you see /dev/nvidiaX, /dev/nvidia-modeset, /dev/nvidia-uvm, /dev/nvidia-uvm-tools, HOWEVER directory /dev/dri does not exist.
Wayland compositors are unlikely to start inside a container without DRM devices. VirtualGL does not work through any devices other than /dev/dri/cardX as well.

3. Information to attach (optional if deemed irrelevant)

Other issues and repositories:
Example of VirtualGL EGL configuration that requires /dev/dri/cardX: https://github.com/ehfd/docker-nvidia-egl-desktop

Implementation of an unprivileged remote desktop bundling an X server with many hacks: https://github.com/ehfd/docker-nvidia-glx-desktop

@klueska
Copy link
Contributor

klueska commented Dec 1, 2020

I have added this feature request to our backlog. At present we have a big backlog, so it's unclear exactly when we will be able to look at this in detail.

That said, it feels like it could be added as a new NVIDIA_DRIVER_CAPABILITY that tries to look for these devices if they exist and inject them. You would set this capability either in the container image or the command line via an environment variable (which would work in the k8s context as well).

@ehfd
Copy link
Author

ehfd commented Dec 2, 2020

As you see the thumbs up, this feature is in quite a big demand, so it would be great to be implemented quickly. Thank you.

@xkszltl
Copy link

xkszltl commented Dec 15, 2020

If you get a chance to do that, maybe add /dev/gdrdrv for nvidia gdrcopy as well.

@ehfd
Copy link
Author

ehfd commented Jan 23, 2021

https://github.com/mviereck/x11docker/wiki/Hardware-acceleration#share-nvidia-device-files-with-container

To use a custom base image, share all files matching /dev/nvidia*, /dev/nvhost* and /dev/nvmap with docker run option --device. Share /dev/dri and /dev/vga_arbiter, too. Add container user to groups video and render with --group-add video --group-add render.

In addition to the initial feature request, these are all the devices required to be provisioned automatically for NVIDIA to officially support Display (e.g. X11, Wayland) in Docker. If these devices are able to be provisioned using the container toolkit automatically, the nvidia/opengl container (nvidia-docker) can properly support the NVIDIA version of XWayland (currently undergoing support into the Linux kernel by NVIDIA devs) and thus support Displays.

There are a lot of people waiting for Display support in Docker and Kubernetes, especially because NVIDIA is to support XWayland in the near future. Please implement this feature to streamline this.

@ehfd
Copy link
Author

ehfd commented Jul 21, 2021

Any updates? @klueska
I was able to start up an unprivileged X server inside an OCI Docker container with nvidia-docker in https://github.com/ehfd/docker-nvidia-glx-desktop, but thinking ahead to Wayland support (since the 470 driver is out), we likely this.

@ehfd
Copy link
Author

ehfd commented Jun 15, 2022

Please use https://gitlab.com/arm-research/smarter/smarter-device-manager for /dev/dri/card* and /dev/dri/render* if you stumble upon this issue.

@ehfd
Copy link
Author

ehfd commented Sep 12, 2022

EGL does not require /dev/dri for NVIDIA devices. VirtualGL has merged support for GLX over EGL without such devices.

@ehfd ehfd closed this as completed Sep 12, 2022
@ehfd
Copy link
Author

ehfd commented Sep 28, 2022

Still likely needed for Wayland with GBM.

@ehfd ehfd reopened this Sep 28, 2022
@elezar
Copy link
Member

elezar commented Sep 28, 2022

Thanks @ehfd. We are working on improving the injection of these devices in an upcoming release. Note that the current plan is to do so using the nvidia-container-runtime at an OCI runtime specification level instead of relying on the NVIDIA Container CLI.

Do you have samples containers / test cases that you would be able to provide to ensure that we meet the requirements?

@ehfd
Copy link
Author

ehfd commented Sep 28, 2022

@elezar
https://github.com/ehfd/docker-nvidia-glx-desktop/blob/main/entrypoint.sh
https://github.com/ehfd/docker-nvidia-egl-desktop/blob/main/entrypoint.sh

These two repositories involve a series of hacks to make NVIDIA GPUs work reliably inside a container unprivileged with a properly accelerated GUI.

docker-nvidia-glx-desktop must install the userspace driver components at startup mostly following your examples but after reading from /proc/driver/nvidia/version because libraries aren't injected to the container.

In the current state, the same userspace driver installation must be done for Wayland by reading /proc/driver/nvidia/version as well. This is undesirable.

Also, in docker-nvidia-egl-desktop, where the userspace drivers aren't installed at startup, an annoying situation arises, where Vulkan requires the display capability of NVIDIA_DRIVER_CAPABILITIES must be included because nvidia_icd.json requires libGLX_nvidia.so.0 and probably more other libraries even when not using Xorg with the NVIDIA driver.

Vulkan should be possible only with the graphics capability as intended, but it requires display as well. NVIDIA/nvidia-container-toolkit#140 Thank god it does work without major modifications to libnvidia-container.

And the display feature currently does not enable starting an Xorg server with the NVIDIA driver in its current state, because of the lack of the libraries being injected to the container. Hence the hacks applied by these two containers are required.

Please also consider injecting the necessary libraries for NVFBC with the video capability as well, even if the SDK must be installed inside the container.

We really hope that NVIDIA_DRIVER_CAPABILITIES starts working properly and that the hacks that my containers applied won't be needed anymore. These can all likely be done at OCI runtime specification level.

Note that we currently use https://gitlab.com/arm-research/smarter/smarter-device-manager for provisioning /dev/dri devices, but there is no methodology to push just the devices for the GPU allocated to the container.

Thanks a lot!

@elezar
Copy link
Member

elezar commented Sep 28, 2022

Thanks for all the information. I will comb through it while working on the feature. Hopefully we can improve things significantly!

@Zubnix
Copy link

Zubnix commented Oct 18, 2022

Hi @elezar @ehfd,

I'm writing a remote Wayland compositor and am currently busy integrating it with k8s and can independently confirm everything @ehfd has stated so far, as I've hit all of them in the last couple of weeks. Being able to access /dev/dri/renderDevice12x and /dev/dri/cardx while limiting, preferably eliminating startup actions and driver dependencies of a container is an absolute must.

@elezar I'm happy to assist and answer any questions you might have to help move this forward!

@elezar
Copy link
Member

elezar commented Oct 18, 2022

Thanks @Zubnix. We have started work on injecting the /dev/dri/cardx devices as part of https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/219

I think in all cases having a list of specific devices, libraries, and environment variables that are required in a container for things to work as expected would be quite useful. We will be sure to update this issue as soon as there is something our for testing and early feedback.

@ehfd
Copy link
Author

ehfd commented Oct 18, 2022

@Zubnix Hi! I've been having interest in Greenfield for a long time. Nice to meet you here! I also hope that eliminating driver dependencies of a container is very important. Thanks for your feedback!
Btw, do you have any interest in using WebTransport over WebSockets in your project?

@Zubnix
Copy link

Zubnix commented Oct 18, 2022

Hi @ehfd I've written my answer here as not to hijack this thread :)

@ehfd
Copy link
Author

ehfd commented Jan 16, 2023

@elezar Hi! I saw that the /dev/dri component got merged.
https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/commit/f7021d84b555b00857640681136b9b9b08fd067f

I believe that should make Wayland fundamentally work in Kubernetes.

Would it be possible to pass the below library components for enhanced X11/Wayland support?
https://download.nvidia.com/XFree86/Linux-x86_64/525.78.01/README/installedcomponents.html

@elezar
Copy link
Member

elezar commented Jan 18, 2023

Thanks @ehfd I will have a look at the link you suggested.

@ehfd
Copy link
Author

ehfd commented Jan 18, 2023

@elezar
In specific, I feel the below are neccessary for a full X11/Wayland + OpenGL EGL/GLX + Vulkan stack without downloading the driver from the container.

Anything with AND means should be injected in either of the cases. And as you know well, the generic symlinks to the .so.525.78.01 files should be passed.

And I believe that, for practical use, everything in graphics should be injected anyways if display is specified without graphics. Else I feel that it won't work.

Configuration .json files should be added to the container like the base images do now.

(should be injected to display)
'/usr/lib/xorg/modules/drivers/nvidia_drv.so'
'/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01'
'/usr/bin/nvidia-xconfig'
'/usr/bin/nvidia-settings' + /usr/lib/libnvidia-gtk2.so.525.78.01 and on some platforms /usr/lib/libnvidia-gtk3.so.525.78.01

(should be injected to graphics AND display, probably already injected)
'/usr/lib/libGL.so.1', '/usr/lib/libEGL.so.1', '/usr/lib/libGLESv1_CM.so.525.78.01', '/usr/lib/libGLESv2.so.525.78.01', '/usr/lib/libEGL_nvidia.so.0'

(should be injected to graphics AND display)
'/usr/lib/libOpenGL.so.0', '/usr/lib/libGLX.so.0', and '/usr/lib/libGLdispatch.so.0', '/usr/lib/libnvidia-tls.so.525.78.01'

(currently injected to display only, must be injected for graphics too in order to use Vulkan)
'/usr/lib/libGLX_nvidia.so.0' and the configuration /etc/vulkan/icd.d/nvidia_icd.json

(should be injected to display AND egl, else eglinfo segfaults)
'/usr/lib/libnvidia-egl-wayland.so.1' and the config '/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json'
'/usr/lib/libnvidia-egl-gbm.so.1' and the config '/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json'

(should be injected to video AND display)
/usr/lib/libnvidia-fbc.so.525.78.01

(should be injected to graphics AND video)
/usr/lib/libnvoptix.so.1

(should be injected to compute as there is a CUDA and CUVID dependency)
/usr/lib/libnvidia-opticalflow.so.525.78.01

(should be injected to video, not currently injected)
/usr/lib/vdpau/libvdpau_nvidia.so.525.78.01

(should be injected to video)
/usr/lib/libnvidia-encode.so.525.78.01

(should be injected to both compute AND video)
/usr/lib/libnvcuvid.so.525.78.01

(should be injected to compute if not already there)
Two OpenCL libraries (/usr/lib/libOpenCL.so.1.0.0, /usr/lib/libnvidia-opencl.so.525.78.01); the former is a vendor-independent Installable Client Driver (ICD) loader, and the latter is the NVIDIA Vendor ICD. A config file /etc/OpenCL/vendors/nvidia.icd is also installed, to advertise the NVIDIA Vendor ICD to the ICD Loader.

(should be injected to utility)
/usr/lib/libnvidia-ml.so.525.78.01

(should be injected to ngx)
/usr/lib/libnvidia-ngx.so.525.78.01
/usr/bin/nvidia-ngx-updater
/usr/lib/nvidia/wine/nvngx.dll
/usr/lib/nvidia/wine/_nvngx.dll

Various libraries that are used internally by other driver components. These include /usr/lib/libnvidia-cfg.so.525.78.01, /usr/lib/libnvidia-compiler.so.525.78.01, /usr/lib/libnvidia-eglcore.so.525.78.01, /usr/lib/libnvidia-glcore.so.525.78.01, /usr/lib/libnvidia-glsi.so.525.78.01, /usr/lib/libnvidia-glvkspirv.so.525.78.01, /usr/lib/libnvidia-rtcore.so.525.78.01, and /usr/lib/libnvidia-allocator.so.525.78.01.

@ehfd
Copy link
Author

ehfd commented Nov 25, 2023

As of libnvidia-container 1.14.3-1:

/usr/lib/xorg/modules/drivers/nvidia_drv.so
/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

These important libraries are still not provisioned.

@elezar

@ehfd ehfd changed the title Automatically provisioning /dev/dri devices of GPU inside container? Automatically provisioning X11 and Wayland devices of GPU inside container? Dec 21, 2023
@ehfd
Copy link
Author

ehfd commented Dec 21, 2023

@klueska @elezar A reminder for you guys... The below are the only libraries left until I can finally close this three-year-old issue and both X11 and Wayland works inside a container.

This is likely a 30 minute work for you guys.

Things mostly work now, but only after downloading .run userspace driver library files inside the container.

/usr/lib/xorg/modules/drivers/nvidia_drv.so
/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

If you can't include some of these into the container toolkit, please tell us why.

@elezar
Copy link
Member

elezar commented Jan 8, 2024

@ehfd thanks for the reminder here.

Some of the libraries are already handled by the NVIDIA Container Toolkit -- with the Caveat that their detection may be distribution dependent at the moment. The main thing to change here is where we search for the libraries. There should be no technical reason for why we haven't done this and the delay is largely caused by resource constraints.

Note that in theory, if you mount these missing libraries from the host it should not be required to use the .run file to install the user space libraries in the container.

If you have capacity to contribute the changes, I would be happy to review these. Note that I would recommend making these against the NVIDIA Container Toolkit where we already inject some of the libraries that you mentioned.

@ehfd
Copy link
Author

ehfd commented Jan 8, 2024

Thank you @elezar
I will assess this within the NVIDIA GitLab repositories and possibly contribute code to inject these packages.
Thanks!

@ehfd
Copy link
Author

ehfd commented Jan 10, 2024

@elezar

The core issue seems that https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/blob/main/internal/discover/graphics.go does not invoke with Docker somehow. Perhaps this might be something with the Docker runner not being based on CDI?

@elezar
Copy link
Member

elezar commented Jan 10, 2024

To trigger the logic as linked you need to:

  1. Use the nvidia runtime
  2. Ensure that NVIDIA_DRIVER_CAPABILITIES includes graphics or display.

To configure the nvidia runtime for docker follow the steps described here.

Then we can run a container:

docker run --rm -ti --runtime=nvidia --gpus=all -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu 

This does not require CDI support explicitly.

@ehfd
Copy link
Author

ehfd commented Mar 28, 2024

Most of the above issues were probably because the PPA for graphics drivers did not install:

libnvidia-egl-gbm1
libnvidia-egl-wayland1

@ehfd
Copy link
Author

ehfd commented May 10, 2024

@elezar I have a contribution.

NVIDIA/nvidia-container-toolkit#490

@ehfd
Copy link
Author

ehfd commented May 10, 2024

NVIDIA/nvidia-container-toolkit#490 (comment)

More detailed situation and requirements to close this issue conclusively.

@ehfd
Copy link
Author

ehfd commented Jun 24, 2024

PR to fix Wayland: NVIDIA/nvidia-container-toolkit#548 - Merged.

New issue for X11: NVIDIA/nvidia-container-toolkit#563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants