nodeinstaller: align default_memory and overhead to actual usage #1196

burgerdev · 2025-01-29T16:24:25Z

This PR corrects the calculation of the RuntimeClass overhead.

First, a bit of background on Kubernetes resource accounting. The total memory footprint of a pod is the sum of the individual container requests/limits and the overhead from the PodSpec or the overhead from the RuntimeClass. This total is accounted towards the node memory commitment, so we should aim to get it right in order to avoid overcommitting and idle resources.

Ignoring the pod memory requests for now, we have three sources that add to the total memory used by a pod:

Size of the VM, which is the sum of
- the Kata runtime setting default_memory
- the individual container limits
VM-related processes on the host (i.e. Kata shim, qemu/CLH)

This implies immediately that the RuntimeClass overhead is the sum of default_memory and the host overhead.

Empirically (see below), we observe that not all memory allotted to a VM is actually usable by the containers, because there are daemons running in the VM and the Linux kernel reserves some memory for itself. This guest overhead needs to be accounted for in the default_memory setting. It follows that default_memory needs to be (at least) the sum of the memory used by daemons and the memory that is not available to userland.

Rounding up the numbers and optimizing for <2GiB containers, we arrive at <56MiB for daemons, <200MiB for kernel and <50MiB for host overhead. Thus:

default_memory = 256Mi
overhead = 320Mi

In order to properly account for the resources in k8s, we need to configure Kata to use the pod cgroups. This is accomplished by setting sandbox_cgroup_only = true. The concepts are explained in more detail at https://github.com/kata-containers/kata-containers/blob/f9bbe4e4396cde04ce8162aead1d12b4f3e7bc96/docs/design/host-cgroups.md.

Memory usage and overhead survey

I launched pods with varying memory limits and recorded the overhead both inside and outside the VM. Most notable findings:

The footprint of guest components unrelated to the workload (free -m used) grows a little bit with the VM, but still fits into 70MiB even for large workloads.
The portion of the VM unavailable to the workload (Overhead VM) scales linearily with VM size.
The overhead outside the VM (Overhead host) is constant and on the order of 50MiB worst case.

AKS-CLH-SNP

default_memory [MiB]	Container limit [MiB]	Theoretical VM mem [MiB]	free -m total [MiB]	free -m used [MiB]	cgroup memory.current [B]	Overhead VM [MiB]	Overhead host [MiB]
256	50	306	197	27	335077376	109	14
256	150	406	295	26	432263168	111	6
256	500	756	640	30	801509376	116	8
256	1024	1280	1090	32	1353519104	190	11
256	2048	2304	1965	31	2431700992	339	15
256	4096	4352	3716	47	4587950080	636	23

K3s-QEMU-TDX

default_memory [MiB]	Container limit [MiB]	Theoretical VM mem [MiB]	free -m total [MiB]	free -m used [MiB]	cgroup memory.current [B]	Overhead VM [MiB]	Overhead host [MiB]
256	50	306	207	49	355393536	99	33
256	150	406	305	49	466104320	101	39
256	500	756	650	52	841797632	106	47
256	1024	1280	1100	53	1387155456	180	43
256	2048	2304	1976	62	2462515200	328	44
256	4096	4352	3727	74	?	625	?

katexochen · 2025-01-30T15:28:30Z

I launched pods with varying memory limits and recorded the overhead both inside and outside the VM. Most notable findings:

Over what period did you conduct your observation?

burgerdev · 2025-01-30T15:35:32Z

I launched pods with varying memory limits and recorded the overhead both inside and outside the VM. Most notable findings:

Over what period did you conduct your observation?

Started the pod with a minimal container and observed when that container was ready. I did not watch the memory of an individual pod over time.

nodeinstaller: align default_memory and overhead to actual usage

056b819

burgerdev added the bug fix Fixing a user facing bug label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodeinstaller: align default_memory and overhead to actual usage #1196

nodeinstaller: align default_memory and overhead to actual usage #1196

burgerdev commented Jan 29, 2025 •

edited

Loading

katexochen commented Jan 30, 2025

burgerdev commented Jan 30, 2025

nodeinstaller: align default_memory and overhead to actual usage #1196

Are you sure you want to change the base?

nodeinstaller: align default_memory and overhead to actual usage #1196

Conversation

burgerdev commented Jan 29, 2025 • edited Loading

Memory usage and overhead survey

AKS-CLH-SNP

K3s-QEMU-TDX

katexochen commented Jan 30, 2025

burgerdev commented Jan 30, 2025

burgerdev commented Jan 29, 2025 •

edited

Loading