Random "No device found" error inside CVM #52

hiroki-chen · 2024-04-09T19:46:47Z

So recently I've been doing kernel programming and a lot of debugging things and sometimes when I reloaded the KVM and/or VFIO module and re-log into the CVM and tried to nvidia-smi conf-compute -f after nvidia-persistenced there would be error:

$ sudo ./NvidiaCC/guest_nv_init.sh
[sudo] password for h100: 
nvidia-smi conf-compute -f
No devices were found
ERROR: nvidia-smi conf-compute -f

The above script just does nvidia-persistenced and enable confidential computing mode. The dmesg command shows that the RM initialization timed out.

[   36.077455] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d08c22901a000 >= 3d08c20857cf00
[   36.077466] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[   36.077473] NVRM kgspBootstrapRiscvOSEarly_GH100: Timeout waiting for GSP target mask release. This error may be caused by several reasons: Bootrom may have failed, GSP init code may have failed or ACR failed to release target mask. RM does not have access to information on which of those conditions happened.
[   36.077507] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76
[   36.077509] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x0
[   36.077511] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0
[   36.077513] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0
[   36.077515] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x0
[   36.077518] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0x65
[   36.077529] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[   38.591844] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1660)
[   38.593638] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.715488] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   38.715518] NVRM osInitNvMapping: *** Cannot attach gpu
[   38.715520] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   38.715546] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[   38.717604] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.913733] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   38.913771] NVRM osInitNvMapping: *** Cannot attach gpu
[   38.913774] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   38.913806] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[   38.917068] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   39.032908] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   39.032932] NVRM osInitNvMapping: *** Cannot attach gpu
[   39.032935] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   39.032962] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[   39.035083] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   39.099929] nvidia-uvm: Loaded the UVM driver, major device number 237.

I attached vfio-pci.disable_idle_d3=1 to the kernel boot command line to avoid VFIO to make the GPU into idle mode.

I also tried unbinding and rebinding the GPU to vfio-pci but that didn't work either. However, this issue would be resolved mysteriously after several times of reboot.

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

rnertney · 2024-06-11T16:42:19Z

Please ensure you are following the flows in our deployment guides.

it looks like nvidia-persistenced is being run without --uvm-persistence-mode which is required.

vfio-pci.disable_idle_d3=1 should no longer be required with the GA of our CC software stack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random "No device found" error inside CVM #52

Random "No device found" error inside CVM #52

hiroki-chen commented Apr 9, 2024 •

edited

Loading

Tasks

rnertney commented Jun 11, 2024

Random "No device found" error inside CVM #52

Random "No device found" error inside CVM #52

Comments

hiroki-chen commented Apr 9, 2024 • edited Loading

Tasks

rnertney commented Jun 11, 2024

hiroki-chen commented Apr 9, 2024 •

edited

Loading