You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So recently I've been doing kernel programming and a lot of debugging things and sometimes when I reloaded the KVM and/or VFIO module and re-log into the CVM and tried to nvidia-smi conf-compute -f after nvidia-persistenced there would be error:
$ sudo ./NvidiaCC/guest_nv_init.sh
[sudo] password for h100:
nvidia-smi conf-compute -f
No devices were found
ERROR: nvidia-smi conf-compute -f
The above script just does nvidia-persistenced and enable confidential computing mode. The dmesg command shows that the RM initialization timed out.
[ 36.077455] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d08c22901a000 >= 3d08c20857cf00
[ 36.077466] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[ 36.077473] NVRM kgspBootstrapRiscvOSEarly_GH100: Timeout waiting for GSP target mask release. This error may be caused by several reasons: Bootrom may have failed, GSP init code may have failed or ACR failed to release target mask. RM does not have access to information on which of those conditions happened.
[ 36.077507] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76
[ 36.077509] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x0
[ 36.077511] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0
[ 36.077513] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0
[ 36.077515] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x0
[ 36.077518] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0x65
[ 36.077529] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 38.591844] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1660)
[ 38.593638] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 38.715488] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[ 38.715518] NVRM osInitNvMapping: *** Cannot attach gpu
[ 38.715520] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 38.715546] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[ 38.717604] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 38.913733] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[ 38.913771] NVRM osInitNvMapping: *** Cannot attach gpu
[ 38.913774] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 38.913806] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[ 38.917068] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 39.032908] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[ 39.032932] NVRM osInitNvMapping: *** Cannot attach gpu
[ 39.032935] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 39.032962] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[ 39.035083] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 39.099929] nvidia-uvm: Loaded the UVM driver, major device number 237.
I attached vfio-pci.disable_idle_d3=1 to the kernel boot command line to avoid VFIO to make the GPU into idle mode.
I also tried unbinding and rebinding the GPU to vfio-pci but that didn't work either. However, this issue would be resolved mysteriously after several times of reboot.
The content you are editing has changed. Please copy your edits and refresh the page.
So recently I've been doing kernel programming and a lot of debugging things and sometimes when I reloaded the KVM and/or VFIO module and re-log into the CVM and tried to
nvidia-smi conf-compute -f
afternvidia-persistenced
there would be error:$ sudo ./NvidiaCC/guest_nv_init.sh [sudo] password for h100: nvidia-smi conf-compute -f No devices were found ERROR: nvidia-smi conf-compute -f
The above script just does
nvidia-persistenced
and enable confidential computing mode. Thedmesg
command shows that the RM initialization timed out.I attached
vfio-pci.disable_idle_d3=1
to the kernel boot command line to avoid VFIO to make the GPU into idle mode.I also tried unbinding and rebinding the GPU to
vfio-pci
but that didn't work either. However, this issue would be resolved mysteriously after several times of reboot.Tasks
The text was updated successfully, but these errors were encountered: