-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not turn off GPU confidential computing #58
Comments
This issue also encountered You should poweroff the VM before changing the CC state. |
@Tan-YiFan Yes, I have powered off VM, but I can't change the cc state. should I refresh the firmware of GPU? |
Did you try other commands of If a GPU is in CC mode ( |
@Tan-YiFan
However, when I ran the command with "--recover-broken-gpu", I got this error:
I guessed it related to modprobe and rebooting the host machine, but I have uninstalled nvidia ,and when I ran "lspci -d 10de: -k", I got
Should I still blacklist nvidia driver or it may relate to nouveau? |
when I checked the kernel message in host machine, I got the following message
|
Please try Please also confirm that the VBIOS version meets the requirement. |
@Tan-YiFan I have reinstalled nvidia in host, but I also got following message in host machine. How can I query VBIOS version?
when I checked the kernel message in host machine, I got the following message:
The only thing to be happy about is that when I run the nvidia_gpu_tools.py command with --recover-broken-gpu, no broken message appears. Is it possible for the firmware of the GPU to be locked or destroyed? |
You can try reset the GPU if it is suddenly stuck in some deadlock or unrecoverable states. sudo echo 1 > /sys/bus/pci/devices/0000:xx:00.0/reset Hope my late response helps. |
@hedj17 please post a full log of nvidia_gpu_tools.py with |
I have soled this prolem by refreshing the firmware of GPU. The problem is that the firmware version is too old |
|
Please contact the vendor of your server (e.g., Dell, SuperMicro, instead of Nvidia) for firmware update |
when I am lunching a CVM, I get the following warning:
qemu-system-x86_64: -device vfio-pci,host=99:00.0,bus=pci.1: warning: vfio_register_ram_discard_listener: possibly running out of DMA mappings. E.g., try increasing the 'block-size' of virtio-mem devies. Maximum possible DMA mappings: 1048576, Maximum possible memslots: 32764
I also change NVIDIA device state to nvidia-persistenced by changing /usr/lib/systemd/system/nvidia-persistenced.service to:
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose
but when I run nvidia-smi, I get an error:
No devices were found
when I run dmesg, I get:
When I try to turn off H100 confidential computing to reinitialize , I got this error:
How can I fix these problems?
The text was updated successfully, but these errors were encountered: