Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random "No device found" error inside CVM #52

Open
hiroki-chen opened this issue Apr 9, 2024 · 1 comment
Open

Random "No device found" error inside CVM #52

hiroki-chen opened this issue Apr 9, 2024 · 1 comment

Comments

@hiroki-chen
Copy link

hiroki-chen commented Apr 9, 2024

So recently I've been doing kernel programming and a lot of debugging things and sometimes when I reloaded the KVM and/or VFIO module and re-log into the CVM and tried to nvidia-smi conf-compute -f after nvidia-persistenced there would be error:

$ sudo ./NvidiaCC/guest_nv_init.sh
[sudo] password for h100: 
nvidia-smi conf-compute -f
No devices were found
ERROR: nvidia-smi conf-compute -f

The above script just does nvidia-persistenced and enable confidential computing mode. The dmesg command shows that the RM initialization timed out.

[   36.077455] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d08c22901a000 >= 3d08c20857cf00
[   36.077466] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[   36.077473] NVRM kgspBootstrapRiscvOSEarly_GH100: Timeout waiting for GSP target mask release. This error may be caused by several reasons: Bootrom may have failed, GSP init code may have failed or ACR failed to release target mask. RM does not have access to information on which of those conditions happened.
[   36.077507] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76
[   36.077509] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x0
[   36.077511] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0
[   36.077513] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0
[   36.077515] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x0
[   36.077518] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0x65
[   36.077529] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[   38.591844] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1660)
[   38.593638] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.715488] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   38.715518] NVRM osInitNvMapping: *** Cannot attach gpu
[   38.715520] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   38.715546] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[   38.717604] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   38.913733] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   38.913771] NVRM osInitNvMapping: *** Cannot attach gpu
[   38.913774] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   38.913806] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[   38.917068] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   39.032908] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   39.032932] NVRM osInitNvMapping: *** Cannot attach gpu
[   39.032935] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   39.032962] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[   39.035083] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   39.099929] nvidia-uvm: Loaded the UVM driver, major device number 237.

I attached vfio-pci.disable_idle_d3=1 to the kernel boot command line to avoid VFIO to make the GPU into idle mode.

I also tried unbinding and rebinding the GPU to vfio-pci but that didn't work either. However, this issue would be resolved mysteriously after several times of reboot.

Tasks

Preview Give feedback
No tasks being tracked yet.
@rnertney
Copy link
Collaborator

Please ensure you are following the flows in our deployment guides.

it looks like nvidia-persistenced is being run without --uvm-persistence-mode which is required.

vfio-pci.disable_idle_d3=1 should no longer be required with the GA of our CC software stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants