-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High cpu usage when running ListContainerStats #135
Comments
This issue makes it impossible to run K8S >=1.24 and is a blocker for users relying on the Dockershim. Mirantis promised to take over responsibility of the Dockershim (https://www.mirantis.com/blog/the-future-of-dockershim-is-cri-dockerd/) and it is simply not working out. So please, fix this issue urgently as people have to stay on 1.22/23 or migrate to containerd otherwise. |
Hi @phoenix-bjoern. I don't work on cri-dockerd beyond providing issue triage, but I do want to address your statement, as I don't want others to be misled. Mirantis is maintaining cri-dockerd in collaboration with Docker Inc. and the container community as a community-driven open source project. As I am sure you can infer, the two sets of named contributors are most interested in supporting their own products. While we are all engineers very interested in the container and Kubernetes space and we want to solve all the issues we come across, the reality is that there is limited time in the day and that the companies who write our paychecks get to dictate our priorities. The performance regression in this issue appears to takes place in some Kubernetes distros due to a upstream change in Kubernetes. It does not affect all users, or indeed, the most prolific users of cri-dockerd. As such I expect that improvements will come from the Rancher folks (who work regularly on this component), from another member of the community who takes it upon themselves to solve this, or from the Mirantis Kubernetes Engine team, in order of timeliness. Again, that is not to say that there is no want to investigate this; it is just to say that when allocating limited engineering resources, it will take some time to justify working on an issue that does not affect any of the Kubernetes deployments/distributions the regular contributors work on/are familiar with. Bottom line up front: Mirantis maintains this repository for the community and its own products. That means that Mirantis is committed to delivering its own changes under an open source license, and that Mirantis is curating the issue tracker and providing code review/maintainer feedback as free cycles permit. It does not mean that Mirantis has pledged to fix every issue reported on a fixed timeline, or that you have a support/contractual relationship with Mirantis. Please keep in mind this is not an official statement by any means, and that I am not part of any of the teams mentioned above. |
This appears to be almost exactly the same issue as #127. We'll see if we can find some kind of workaround. The basic comment there is that trying to have a bare CRI inject (or deploy, anyway) a full cAdvisor stack to work around the fact that KEP 2371 is not in a good state, and the more or less official response from upstream is "either dramatically reduce the stats you collect, or deploy full-bore cAdvisor yourself" doesn't leave a lot of alternatives here. We're happy to work with Rancher (and Mirantis will have to do something ourselves in Mirantis Kubernetes Engine, but that's honestly likely to be deploying cAdvisor sets), but there isn't a lot we can do technically other than writing a temporary layer that we'll have to throw away when/if 2371 is ever ready. The memoizing/caching patch in the original comment looks reasonable, and implementing It's something we need to have an actual discussion about to see how much effort RKE wants to put into this (the best workaround would be to just deploy full-scale cAdvisor as part of k3s/RKE and stop relying on the kubelet at all, considering the stutter-stop state of the KEP). memoizing/caching carries a very real possibility of getting results which do not reflect reality depending on the Prometheus scrape interval, especially for spikey workloads (kubeflow, for example) |
An update from an affected Ubuntu 20.04 cluster: The high load went back to normal after upgrading the Kernel from 5.4.0-132 to 5.4.0-135. Not sure of the containerd fix (https://launchpad.net/ubuntu/+source/linux/5.4.0-135.152) has smth to do with it. |
@evol262 Since kubernetes/kubernetes#106080 (comment) was before KEP status moved to beta, I assumed it was only a workaround for the time being. But looking at the status now, it does seem like this won't get implemented anytime soon because it will still be in alpha for 1.25. I am okay with making the changes around caching or As for the workaround around deploying cAdvisor with k3s/RKE, I would like to implement and test this out because it would unblock a lot of users. So cri-dockerd will be updated to fetch metrics from cAdvisor and aggregate the results (as kubelet did before CRI refactoring)? If so, I am a little confused on (Also, my original understanding from the KEP was that all container runtimes will be responsible for full stats and not cAdvisor. Would we be going against the upstream expectation here?) |
@phoenix-bjoern Thanks much for the update! I don't see a huge improvement on my setup. But this is not surprising, I have seen varying results from different users..it comes down to how fast metrics are returned which is dependent on a lot of factors but the main issue in general seems to be how long (and how often) cri-dockerd takes to get metrics using docker client. |
No, we are not going to implement some subset of Prometheus. cri-dockerd would not be updated to fetch metrics from cAdvisor. Instead, the Prometheus deployment in k3s/RKE would point at "full" cAdvisor instead. This is the recommendation from upstream. No, we would not be going against upstream. Stats collection from "pure" CRI will be implemented with the KEP is ready (or at least graduated from beta). The understanding from the KEP (that container runtimes will be responsible for full stats and not cAdvisor) was right -- it's more that the KEP kept getting pulled back, and cAdvisor was removed anyway, leaving it in kind of a no-man's land. cri-dockerd is (still) dockershim and doesn't poll any more often (or take longer) than dockershim did. It's being polled far more often by the kubelet now. I'll (still) look into whether there's some kind of maintainable workaround until the KEP is ready, but "full" cAdvisor and polling that from Prometheus in platform solutions is still the recommendation, both upstream and here. |
@evol262 Yeah, I agree with prometheus deployment pointing to full cAdvisor. Tho the cpu spike we're seeing currently is without any monitoring deployed in the cluster - it's only from internal autoscaling pipelines (I checked only metrics server but would assume to be the same with Yep I see cri-dockerd is still dockershim and the stats codebase is also pretty similar without any big changes. |
Has anyone tested if they still have high CPU usage after #233 was added? |
@nwneisen I haven‘t experienced this issue again (currently using Rancher 2.8.0, RKE1, K8S 1.26). @kinarashah released several fixes for it, but I don’t remember which one exactly resolved the issue. |
I think that Rancher ended up using a forked kubelet to resolve this issue (see the rancher issue here rancher/rancher#38816 and PR here rancher/kubernetes#74). There is an ongoing discussion in rancher/rke#3409 to see if there is still a need for this forked kubelet as #233 is now merged. So from what I understand, #233 still needs to be tested in a Rancher/RKE context without the forked kubelet. |
Seeing cpu spike to 200%+ for
dockerd
on a RKE1 1.24 cluster. 1.24 is same as 1.23 withcri-dockerd
enabled so using 1.23 to observe difference between enabling and disablingcri-dockerd
. rancher/rancher#38816 has more context. Issue is easily observed without running any extra user workloads as well but a simple way to reproduce is:kubectl get --raw /api/v1/nodes/$NODE_NAME/proxy/stats/summary
enable_cri_dockerd
astrue
and monitor cpu usageI haven't been able to reproduce it on
overlay2
, but users have helped test it out.ContainerStatsOneShot
was introduced in v0.2.6 and I've run all tests on v0.2.6 to make sure it's accounted for. Would appreciate any help in resolving this, things I have tried so far -Disabling
cri-dockerd
on <1.24 clusters shows minimal cpu usage, difference is evident on a single node too (CPUs: 4, Memory 15.61GiB
):w/o cri-dockerd: <10%
, while deploying new pods: goes up to ~50%, drops down backw cri-dockerd: 250%+
, didn't deploy any new podsThere's a big difference in the cpu usage between docker storage drivers. Based on strace for
overlay
, it seems to be iterating/looping around content in/var/lib/rancher-data/local-catalogs/, repeatedly callinglgetxattr()
andnewfstatat()
, which seemed to be coming from calculation forWriteableLayer.UsedBytes
. Tried removingInspectContainerWithSize
, but I am not sure what is the bottleneck in/containers/"+containerID+"/json?size=1
calls.w/o InspectContainerWithSize: <20%
, adding new ~250 pods, goes up to ~30%Looks like this is because earlier upstream k8s relied on cadvisor stats: https://github.com/kubernetes/kubernetes/blob/v1.22.16/pkg/kubelet/kubelet.go#L710. Is there a way to keep using cadvisor provider as a workaround? Changing code to force use cadvisor stats doesn't work with the new runtime changes. It fails at https://github.com/kubernetes/kubernetes/blob/v1.24.8/pkg/kubelet/stats/cadvisor_stats_provider.go#L84
Updating code to cache stats, I believe changes would need more refinement to manage add/remove container and force update when container id is not found...these are experimental changes only. But the number is still high enough without deploying any user pods, and is bound to get worse.
per container id with refreshPeriod=15s, and
ListContainerStats
only reads values from cached containerStats: no major improvement and cpu spike is observed every time, https://github.com/Mirantis/cri-dockerd/compare/master...kinarashah:cri-dockerd:opt?expand=1w container caching: 250%+
goroutines were introduced to resolve the metrics-server deadline exceeded issues, using cache+removing goroutines to see if there was an impact: https://github.com/Mirantis/cri-dockerd/compare/master...kinarashah:cri-dockerd:opt2?expand=1
w container caching + no goroutine: <150%
cache entire result
[]*runtimeapi.ContainerStats
with same refreshPeriod and list simply returns the stats list: https://github.com/Mirantis/cri-dockerd/compare/master...kinarashah:cri-dockerd:opt3?expand=1w stat list caching + no goroutine: <150%
If we go down this path, I think it would make sense to introduce
ListPodSandboxStats
and switch to strictly collecting stats from cri.Combining changes from 3 and 4:
container caching + no goroutine + w/o InspectContainerWithSize: <10%
(same asw/o cri-dockerd
)Anything else I could try out? Thank you!
The text was updated successfully, but these errors were encountered: