Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodeinstaller: align default_memory and overhead to actual usage #1196

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

burgerdev
Copy link
Contributor

@burgerdev burgerdev commented Jan 29, 2025

This PR corrects the calculation of the RuntimeClass overhead.

First, a bit of background on Kubernetes resource accounting. The total memory footprint of a pod is the sum of the individual container requests/limits and the overhead from the PodSpec or the overhead from the RuntimeClass. This total is accounted towards the node memory commitment, so we should aim to get it right in order to avoid overcommitting and idle resources.

Ignoring the pod memory requests for now, we have three sources that add to the total memory used by a pod:

  • Size of the VM, which is the sum of
    • the Kata runtime setting default_memory
    • the individual container limits
  • VM-related processes on the host (i.e. Kata shim, qemu/CLH)

This implies immediately that the RuntimeClass overhead is the sum of default_memory and the host overhead.

Empirically (see below), we observe that not all memory allotted to a VM is actually usable by the containers, because there are daemons running in the VM and the Linux kernel reserves some memory for itself. This guest overhead needs to be accounted for in the default_memory setting. It follows that default_memory needs to be (at least) the sum of the memory used by daemons and the memory that is not available to userland.

Rounding up the numbers and optimizing for <2GiB containers, we arrive at <56MiB for daemons, <200MiB for kernel and <50MiB for host overhead. Thus:

  • default_memory = 256Mi
  • overhead = 320Mi

In order to properly account for the resources in k8s, we need to configure Kata to use the pod cgroups. This is accomplished by setting sandbox_cgroup_only = true. The concepts are explained in more detail at https://github.com/kata-containers/kata-containers/blob/f9bbe4e4396cde04ce8162aead1d12b4f3e7bc96/docs/design/host-cgroups.md.


Memory usage and overhead survey

I launched pods with varying memory limits and recorded the overhead both inside and outside the VM. Most notable findings:

  • The footprint of guest components unrelated to the workload (free -m used) grows a little bit with the VM, but still fits into 70MiB even for large workloads.
  • The portion of the VM unavailable to the workload (Overhead VM) scales linearily with VM size.
  • The overhead outside the VM (Overhead host) is constant and on the order of 50MiB worst case.

AKS-CLH-SNP

default_memory [MiB] Container limit [MiB] Theoretical VM mem [MiB] free -m total [MiB] free -m used [MiB] cgroup memory.current [B] Overhead VM [MiB] Overhead host [MiB]
256 50 306 197 27 335077376 109 14
256 150 406 295 26 432263168 111 6
256 500 756 640 30 801509376 116 8
256 1024 1280 1090 32 1353519104 190 11
256 2048 2304 1965 31 2431700992 339 15
256 4096 4352 3716 47 4587950080 636 23

K3s-QEMU-TDX

default_memory [MiB] Container limit [MiB] Theoretical VM mem [MiB] free -m total [MiB] free -m used [MiB] cgroup memory.current [B] Overhead VM [MiB] Overhead host [MiB]
256 50 306 207 49 355393536 99 33
256 150 406 305 49 466104320 101 39
256 500 756 650 52 841797632 106 47
256 1024 1280 1100 53 1387155456 180 43
256 2048 2304 1976 62 2462515200 328 44
256 4096 4352 3727 74 ? 625 ?

@burgerdev burgerdev added the bug fix Fixing a user facing bug label Jan 29, 2025
@katexochen
Copy link
Member

I launched pods with varying memory limits and recorded the overhead both inside and outside the VM. Most notable findings:

Over what period did you conduct your observation?

@burgerdev
Copy link
Contributor Author

I launched pods with varying memory limits and recorded the overhead both inside and outside the VM. Most notable findings:

Over what period did you conduct your observation?

Started the pod with a minimal container and observed when that container was ready. I did not watch the memory of an individual pod over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixing a user facing bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants