-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve container and imagefs stats #233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
97c3d86
to
5af63d2
Compare
I pushed a few more commits over the weekend based on my testing and debugging.
I changed to use the home-made cache.(Even if using this home-mad cache, there might be heavy lock contention for
The From the analysis of |
5af63d2
to
e02e041
Compare
core/imagefs_linux.go
Outdated
info, err := ds.getDockerInfo() | ||
if err != nil { | ||
logrus.Error(err, "Failed to get docker info") | ||
return nil, err | ||
} | ||
|
||
bytes, inodes, err := dirSize(filepath.Join(info.DockerRootDir, "image")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This additional work looks promising to me (need to spend more time looking at specifics later in the week when I find time); it's worth noting that this will conflict with/obsolete #221.
This will also have similar problems when the containerd snapshotter backend is used with dockerd (not that anyone is likely yet using cri-dockerd against Moby 24.0 in production with this feature) as I mentioned in #221 (comment).
cc @surik
e02e041
to
7955ae3
Compare
Sorry for the messy commits, I just squashed them. Highlighted recent changes:
|
7955ae3
to
37ca72f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like store.NewObjectCache
is unused with these changes -- can we please add removing it to this PR?
Done. Thanks for pointing it out. |
@neersighted |
b90c208
to
4bbac8b
Compare
Some new updates:
New commits have been rebased with master branch and squashed. |
a328876
to
42f5668
Compare
- Use pipeline pattern for controlling the concurrecy of ListContainerStats - decouple computing container RW layer size and cache it. - for linux, calculate imageFS using statfs and docker images api - cache imageFS stats - pick up kubernetes#104287 Signed-off-by: Xinfeng Liu <[email protected]> remove store/object_cache.go Signed-off-by: Xinfeng Liu <[email protected]> decouple container RW layer size calculation Signed-off-by: Xinfeng Liu <[email protected]> use docker images api for imagefs usedBytes Signed-off-by: Xinfeng Liu <[email protected]>
42f5668
to
21b8469
Compare
Signed-off-by: Xinfeng Liu <[email protected]>
@nwneisen I pushed a new commit per your recent feedbacks. Please let me know if there's anything else I need to do. |
@xinfengliu The changes look good. I will go ahead an merge this in. Thank you for the PR |
after update to
|
We have seen high CPU usage in
dockerd
andcri-dockerd
in some cases. The profiling data shows high CPU was caused by calculating container RW layer size. Also there's an issue report in #135This PR does:
Limit the concurrency of
ListContainerStats
.Currently the number of concurrency is number of k8s containers, this could cause thunderstorm loads on big nodes running a large number of pods. This PR uses pipeline pattern
to use at most half of number of CPUs for. Please see later comments in this PR for updates.ListContainerStats
Cache container stats with dynamic TTL based on the time taken for getting stats.
Use another cache implementation to overcome the below shortcomings of
k8s.io/client-go/tools/cache
:InspectContainerWithSize
could take minutes to finish, if multiple requests come in, it will cause multipleInspectContainerWithSize
to run, thus adding high load to the system.lastTimeTaken
to the cache TTL to make cache TTL dynamic. Normally the operation takes some ms or a few seconds, addinglastTimeTaken
does not affect much to TTL. Some resource consuming operations may take minutes to finish, it does not make sense to keep the low TTL since the stats are already not realtime, make TTL longer helps reduce the load.cache imageFS stats
For linux, calculate imageFS using docker root directory (obsoletes Fix evaluation of usage for ImageFsInfo method #221). ImageFs means the filesystem that container runtimes use to store container images and container writable layers. Only counting
image
directory itself is wrong and does not helpkubelet
make correct decisions for garbage collections or evictions.Rename
image_linux.go
andimage_windows.go
toimagefs_linux.go
andimagefs_windows.go
to better reflect the file purposes.