nodeinstaller: align default_memory and overhead to actual usage #1196
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR corrects the calculation of the
RuntimeClass
overhead.First, a bit of background on Kubernetes resource accounting. The total memory footprint of a pod is the sum of the individual container requests/limits and the overhead from the PodSpec or the overhead from the RuntimeClass. This total is accounted towards the node memory commitment, so we should aim to get it right in order to avoid overcommitting and idle resources.
Ignoring the pod memory requests for now, we have three sources that add to the total memory used by a pod:
default_memory
This implies immediately that the RuntimeClass overhead is the sum of
default_memory
and the host overhead.Empirically (see below), we observe that not all memory allotted to a VM is actually usable by the containers, because there are daemons running in the VM and the Linux kernel reserves some memory for itself. This guest overhead needs to be accounted for in the
default_memory
setting. It follows thatdefault_memory
needs to be (at least) the sum of the memory used by daemons and the memory that is not available to userland.Rounding up the numbers and optimizing for <2GiB containers, we arrive at <56MiB for daemons, <200MiB for kernel and <50MiB for host overhead. Thus:
default_memory = 256Mi
overhead = 320Mi
In order to properly account for the resources in k8s, we need to configure Kata to use the pod cgroups. This is accomplished by setting
sandbox_cgroup_only = true
. The concepts are explained in more detail at https://github.com/kata-containers/kata-containers/blob/f9bbe4e4396cde04ce8162aead1d12b4f3e7bc96/docs/design/host-cgroups.md.Memory usage and overhead survey
I launched pods with varying memory limits and recorded the overhead both inside and outside the VM. Most notable findings:
free -m used
) grows a little bit with the VM, but still fits into 70MiB even for large workloads.Overhead VM
) scales linearily with VM size.Overhead host
) is constant and on the order of 50MiB worst case.AKS-CLH-SNP
K3s-QEMU-TDX