Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slabinfo checks #70

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

Add slabinfo checks #70

wants to merge 2 commits into from

Conversation

treydock
Copy link
Contributor

@treydock treydock commented Nov 1, 2018

These checks we've used to detect kernel memory leaks that will cause a node to consume all physical memory but not be detected by conventional means like top, ps or free. We use these checks on our affected system to look for kmalloc in slabinfo that is going to consume large amounts of memory.

* || check_slabinfo_num_objs -m 'kmalloc-*' 5000000
* || check_slabinfo_active_objs -m 'kmalloc-*' 5000000
* || check_slabinfo_active_slabs -m 'kmalloc-*' 5000000
* || check_slabinfo_num_slabs -m 'kmalloc-*' 5000000

@treydock
Copy link
Contributor Author

Just in case someone comes across this the checks we're now using are basically marking node offline is kmalloc item in slabinfo is at >5GB based on active object count and size of each object:

* || check_slabinfo_active_objs -m 'kmalloc-8' 671088640
* || check_slabinfo_active_objs -m 'kmalloc-16' 335544320
* || check_slabinfo_active_objs -m 'kmalloc-32' 167772160
* || check_slabinfo_active_objs -m 'kmalloc-64' 83886080
* || check_slabinfo_active_objs -m 'kmalloc-96' 55924053
* || check_slabinfo_active_objs -m 'kmalloc-128' 41943040
* || check_slabinfo_active_objs -m 'kmalloc-192' 27962026
* || check_slabinfo_active_objs -m 'kmalloc-256' 20971520
* || check_slabinfo_active_objs -m 'kmalloc-512' 10485760
* || check_slabinfo_active_objs -m 'kmalloc-1024' 5242880
* || check_slabinfo_active_objs -m 'kmalloc-2048' 2621440
* || check_slabinfo_active_objs -m 'kmalloc-4096' 1310720
* || check_slabinfo_active_objs -m 'kmalloc-8192' 655360

@mej mej self-assigned this Dec 6, 2018
@mej mej added the enhancement label Dec 6, 2018
@mej mej added this to the 1.4.4 Release milestone Dec 6, 2018
@mej
Copy link
Owner

mej commented Dec 6, 2018

I'd like to get a 1.4.3 release out the door soon(ish), and since the current dev branch has had quite a lengthy life in production at LANL (and likely elsewhere), I feel safer sticking with it for the release and targetting new development at 1.4.4. (Don't worry; 1.4.4 won't take nearly as long to get out the door as 1.4.3 did, now that we've gotten the legal hurdles dealt with!)

Does that make sense?

@treydock
Copy link
Contributor Author

treydock commented Dec 7, 2018

Ya 1.4.4 sounds fine, we have the check deployed locally and gives us a bit of time to run this in production to refine if needed.

@mej mej modified the milestones: 1.4.4 Release, 1.4.4 Release (new), 1.5 Release Apr 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants