Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu: Check that an instance with a CDI GPU device, can be started even after its host abrupty crash and reboot #393

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 61 additions & 1 deletion tests/gpu-container
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use a VM for this, isnt one of the points of CDI that we support nesting (such as docker) inside a container, so can we use a nested LXD inside a LXD container to test this crash support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes true, a nested LXD inside a LXD container should work. I'll test that approach

Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,69 @@ lxc config device add c1 gpu0 gpu id="nvidia.com/gpu=0"
lxc start c1
[ "$(lxc exec c1 -- ls /dev/dri/ | grep -c '^card[0-9]')" = "1" ] || false
lxc exec c1 -- nvidia-smi
lxc delete -f c1

# Check that CDI device files are cleanly remove even if the host machine is abruptly rebooted
echo "==> Testing that CDI device files are cleanly removed after abrupt reboot"
lxc init "${IMAGE}" v1 --vm
gabrielmougard marked this conversation as resolved.
Show resolved Hide resolved
if hasNeededAPIExtension devlxd_images_vm; then
lxc config set v1 -c security.devlxd.images=true
fi

lxc config device add v1 gpu0 gpu pci="${first_card_pci_slot}"
lxc start v1
echo "==> Waiting for the VM agent to be ready"
waitInstanceBooted v1

echo "==> Installing NVIDIA drivers inside the VM"
lxc exec v1 -- apt-get update
lxc exec v1 --env DEBIAN_FRONTEND=noninteractive -- apt-get install -y ubuntu-drivers-common
lxc exec v1 --env DEBIAN_FRONTEND=noninteractive -- ubuntu-drivers autoinstall

echo "==> Rebooting the VM to load NVIDIA drivers"
lxc restart v1

waitInstanceBooted v1

echo "==> Verifying NVIDIA driver installation in the VM"
lxc exec v1 -- nvidia-smi

echo "==> Installing LXD inside the VM"
lxc exec v1 -- snap install lxd --channel="${LXD_SNAP_CHANNEL}"

echo "==> Initializing LXD inside the VM"
lxc exec v1 -- lxd init --auto

echo "==> Launching a container inside the VM"
lxc exec v1 -- lxc init "${IMAGE}" c1

echo "==> Adding GPU to the container inside the VM using CDI"
lxc exec v1 -- lxc config device add c1 gpu0 gpu id="nvidia.com/gpu=0"
lxc exec v1 -- lxc start c1
# Wait for the container to be ready
sleep 20

echo "==> Verifying GPU access inside the container"
lxc exec v1 -- lxc exec c1 -- nvidia-smi

echo "==> Simulating abrupt reboot by force-stopping the VM"
lxc stop v1 -f

echo "==> Starting the VM again"
lxc start v1

waitInstanceBooted v1

echo "==> Starting the container inside the VM after reboot"
lxc exec v1 -- lxc start c1

echo "==> Verifying GPU access inside the container after VM reboot"
lxc exec v1 -- lxc exec c1 -- nvidia-smi

echo "==> Cleaning up the VM"
lxc delete v1 -f

echo "==> Cleaning up"
lxc delete -f c1
lxc profile device remove default root
lxc profile device remove default eth0
lxc storage delete default
Expand Down
Loading