-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running out of DMA mapping when launching VM #46
Comments
Can you share your scripts of launching the CVM here? |
Thanks for the quick reply. I am using the
|
I do not have access to TDX-enabled servers, so my analysis is completely based on related source code. Which stage did you reach after executing this boot script? Did you get any boot log from the guest kernel? Does The message is provided by qemu code. Hacking into this function (add |
It seems like without proper DMA mappings. Drivers cannot communicate with GPUs associated with TDX enclaves. |
Could you produce some error log (such as dmesg of guest or host) if it existed? |
Everytime I launch the TDX enclave, my host will produce such messages:
Then, the guest VM will produce such messages. There are some messages about DMA, and the last several messages is produced when I am trying to install the GPU driver. I suspect due to the DMA mapping errors, my driver installation is not sucesful.
Also, in my |
If I encountered this problem, I would try:
|
I actually booted into a non-TDX VM and turned off the Nvidia CC for H100. In such a case, I am able to install the driver successfully. I also tried TDX VM without Nvidia CC for H100, where the driver installation is also unsuccessful. |
Could you give the result of |
Below is the lsmod output.
|
Hi Tan-Yifan, Would you mind letting me know your hardware system configurations? We could not make CC work with Intel chips, and we are thinking of shifting to AMD chips. Thanks, |
The CPU should support SEV-SNP. AMD 7003 and 9004 series would be ok. Some previous issues successfully ran H100 CC on AMD servers. You can refer to them. |
@Tan-YiFan |
The suggested motherboard manufacturer is Supermicro or ASRockRack. For further information, you can refer to https://docs.nvidia.com/confidential-computing-deployment-guide.pdf |
Can you please provide Did you utilize the instruction guides for GPU specific CVMs outlined here? CPUs that are supported are those with Intel TDX and AMD SEV-SNP. |
Have you sloved this problem? |
I recently ran into this problem and I might have an answer for it. The issue is because NVIDIA H100 requires several huge DMA areas but the QEMU's implementation for the 'legacy' VFIO routine allocates these areas at the lowest granularity the memory backend supports, which, in TDX, is a 4 KiB page, causing massive amount of memory slots used. I saw you are not using iommufd. If you can use iommufd, you should definitely use that which should not yield this problem. You can checkout the QEMU command in NVIDIA's official TDX confidential computing guide which uses iommufd. If you are like me who has to use the legacy routine for whatever the reason it might be, a nasty workaround that works on my side is to hardcode the VFIO's DMA allocation granularity to a higher value (must be a power of 2 though). To do that, for the latest available patched QEMU from Intel's TDX (https://github.com/intel/tdx-linux/tree/device-passthrough), change line 417 of /hw/vfio/common.c (in function |
When I am launching a CVM, I got the following error:
qemu-system-x86_64: -device vfio-pci,host=b0:00.0,bus=pci.1: warning: vfio_register_ram_discard_listener: possibly running out of DMA mappings. E.g., try increasing the 'block-size' of virtio-mem devies. Maximum possible DMA mappings: 65535, Maximum possible memslots: 32764
But I set the VM memory to 126GB. What configurations I need to change to fix this error?
The text was updated successfully, but these errors were encountered: