Skip to content

Commit

Permalink
Device: Remove existing GPU CDI device files before new device files …
Browse files Browse the repository at this point in the history
…are added (#14842)

If a host machine is not shut down properly [ i.e. it loses power ], the
instances having CDI GPU attached to them won't start again after the
host is started, even if a start command is issued manually.

LXD returns as exception:

`Failed to start device "nvidia-gpu": Failed to create device
"/var/snap/lxd/common/lxd/devices/oel-ogrp623/cdi.unix.nvidia--gpu.dev-nvidia0"
for "/dev/nvidia0": file exists`

To solve that, we must remove any remaining device files before adding
new CDI device files in the instance GPU device directory. These old
files are still present if the host crash because the GPU device stop
hook is not called.

Fixes #14843
  • Loading branch information
tomponline authored Jan 24, 2025
2 parents 90d42c3 + cdf564a commit 3d9323c
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions lxd/device/gpu_physical.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package device
import (
"encoding/json"
"fmt"
"io/fs"
"net/http"
"os"
"path/filepath"
Expand Down Expand Up @@ -159,6 +160,27 @@ func (d *gpuPhysical) startCDIDevices(configDevices cdi.ConfigDevices, runConf *
return fmt.Errorf("Failed to parse minor number %q when starting CDI device: %w", conf["minor"], err)
}

// Check if there are any remaining CDI devices in the instance devices directory.
// If there are, we need to remove them. These can be present in the case where the device stop hook was not called
// (e.g. due to an abrupt host shutdown).
err = filepath.WalkDir(d.inst.DevicesPath(), func(path string, e fs.DirEntry, _ error) error {
if e.IsDir() {
return nil
}

if strings.HasPrefix(e.Name(), cdi.CDIUnixPrefix+"."+d.name) {
err := os.Remove(path)
if err != nil {
return err
}
}

return nil
})
if err != nil {
return err
}

// Here putting a `cdi.CDIUnixPrefix` prefix with 'd.name' as a device name will create an directory entry like:
// <lxd_var_path>/devices/<instance_name>/<cdi.CDIUnixPrefix>.<gpu_device_name>.<path_encoded_relative_dest_path>
// 'unixDeviceSetupCharNum' is already checking for dupe entries so we have no validation to do here.
Expand Down

0 comments on commit 3d9323c

Please sign in to comment.