Skip to content

Commit

Permalink
Add leniency to disk thresholds
Browse files Browse the repository at this point in the history
Disk thresholds as a fraction of their usage does not scale well with
modern disks: on one hand a 90% full partition that store logs is
generaly an issue and should be reported, but in the other hand when a huge
volume is available for storing backups (e.g. 10TB) the 90% usage limit
does not really make sense as we do not want to waste 1TB of disk space.

Introduce two new parameters to tune disk usage thresholds:
  - `--disk-warning-leninency` (default: 500G)
  - `--disk-critical-leninency` (default: 250G)

When the fraction of disk space used reach a warning / critical
threshold, check the available space against these "leninency" values,
and only report the warning / critical status if the available space is
lower than this limit.

The defaults values have been chosen to be high enough to have an effect
only for disks lager than 5TB.

According to IEEE Std 1003.1-2017, a POSIX compliant `df(1)` must
support the `-k` flag to return sizes in kB instead of the default that
used to be 512-bytes (still in effect by default on FreeBSD but not on
Linux).  We use this flag on all systems to make sure the output is in
1024-bytes unit regardless of the operating system.  Existing unit tests
are updated accordingly.
  • Loading branch information
smortex committed Jan 22, 2024
1 parent 6fce7e6 commit eb6c0dd
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 24 deletions.
34 changes: 27 additions & 7 deletions lib/riemann/tools/health.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ class Health
opt :cpu_critical, 'CPU critical threshold (fraction of total jiffies)', default: 0.95
opt :disk_warning, 'Disk warning threshold (fraction of space used)', default: 0.9
opt :disk_critical, 'Disk critical threshold (fraction of space used)', default: 0.95
opt :disk_warning_leniency, 'Disk warning threshold (amount of free space)', short: :none, default: '500G'
opt :disk_critical_leniency, 'Disk critical threshold (amount of free space)', short: :none, default: '250G'
opt :disk_ignorefs, 'A list of filesystem types to ignore',
default: %w[anon_inodefs autofs cd9660 devfs devtmpfs fdescfs iso9660 linprocfs linsysfs nfs overlay procfs squashfs tmpfs]
opt :load_warning, 'Load warning threshold (load average / core)', default: 3.0
Expand All @@ -32,7 +34,7 @@ class Health
def initialize
@limits = {
cpu: { critical: opts[:cpu_critical], warning: opts[:cpu_warning] },
disk: { critical: opts[:disk_critical], warning: opts[:disk_warning] },
disk: { critical: opts[:disk_critical], warning: opts[:disk_warning], critical_leniency_kb: human_size_to_number(opts[:disk_critical_leniency]) / 1024, warning_leniency_kb: human_size_to_number(opts[:disk_warning_leniency]) / 1024 },
load: { critical: opts[:load_critical], warning: opts[:load_warning] },
memory: { critical: opts[:memory_critical], warning: opts[:memory_warning] },
uptime: { critical: opts[:uptime_critical], warning: opts[:uptime_warning] },
Expand Down Expand Up @@ -374,14 +376,14 @@ def darwin_memory
def df
case @ostype
when 'darwin', 'freebsd', 'openbsd'
`df -P -t no#{opts[:disk_ignorefs].join(',')}`
`df -Pk -t no#{opts[:disk_ignorefs].join(',')}`
when 'sunos'
`df -P` # Is there a good way to exlude iso9660 here?
`df -Pk` # Is there a good way to exlude iso9660 here?
else
if @supports_exclude_type
`df -P #{opts[:disk_ignorefs].map { |fstype| "--exclude-type=#{fstype}" }.join(' ')}`
`df -Pk #{opts[:disk_ignorefs].map { |fstype| "--exclude-type=#{fstype}" }.join(' ')}`
else
`df -P`
`df -Pk`
end
end
end
Expand All @@ -397,9 +399,9 @@ def disk

x = used.to_f / total_without_reservation

if x > @limits[:disk][:critical]
if x > @limits[:disk][:critical] && available < @limits[:disk][:critical_leniency_kb]
alert "disk #{f[5]}", :critical, x, "#{f[4]} used"
elsif x > @limits[:disk][:warning]
elsif x > @limits[:disk][:warning] && available < @limits[:disk][:warning_leniency_kb]
alert "disk #{f[5]}", :warning, x, "#{f[4]} used"
else
alert "disk #{f[5]}", :ok, x, "#{f[4]} used"
Expand Down Expand Up @@ -468,6 +470,24 @@ def uptime_to_human(value)
].compact.join(' ')
end

def human_size_to_number(value)
case value
when /^\d+$/ then value.to_i
when /^\d+k$/i then value.to_i * 1024
when /^\d+M$/i then value.to_i * (1024**2)
when /^\d+G$/i then value.to_i * (1024**3)
when /^\d+T$/i then value.to_i * (1024**4)
when /^\d+P$/i then value.to_i * (1024**5)
when /^\d+E$/i then value.to_i * (1024**6)
when /^\d+Z$/i then value.to_i * (1024**7)
when /^\d+Y$/i then value.to_i * (1024**8)
when /^\d+R$/i then value.to_i * (1024**9)
when /^\d+Q$/i then value.to_i * (1024**10)
else
raise %(Malformed size "#{value}", syntax is [0-9]+[kMGTPEZYRQ]?)
end
end

def tick
invalidate_cache

Expand Down
51 changes: 34 additions & 17 deletions spec/riemann/tools/health_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,38 @@
require 'riemann/tools/health'

RSpec.describe Riemann::Tools::Health do
describe('#human_size_to_number') do
subject { described_class.new.human_size_to_number(input) }

{
'512' => 512,
'1k' => 1024,
'2K' => 2048,
'42m' => 44_040_192,
}.each do |input, expected_output|
context %(when passed #{input.inspect}) do
let(:input) { input }

it { is_expected.to eq(expected_output) }
end
end
end

describe('#disks') do
before do
allow(subject).to receive(:df).and_return(<<~OUTPUT)
Filesystem 512-blocks Used Avail Capacity Mounted on
zroot/ROOT/13.1 643127648 46210936 596916712 7% /
zroot/var/audit 596916888 176 596916712 0% /var/audit
zroot/var/mail 596919416 2704 596916712 0% /var/mail
zroot/tmp 596999464 82752 596916712 0% /tmp
zroot 596916888 176 596916712 0% /zroot
zroot/var/crash 596916888 176 596916712 0% /var/crash
zroot/usr/src 596916888 176 596916712 0% /usr/src
zroot/usr/home 891927992 295011280 596916712 33% /usr/home
zroot/var/tmp 596916952 240 596916712 0% /var/tmp
zroot/var/log 596928976 12264 596916712 0% /var/log
192.168.42.5:/volume1/tank/Medias 7491362496 2989541992 4501820504 40% /usr/home/romain/Medias
Filesystem 1024-blocks Used Avail Capacity Mounted on
zroot/ROOT/13.1 321563824 23105468 298458356 7% /
zroot/var/audit 298458444 88 298458356 0% /var/audit
zroot/var/mail 298459708 1352 298458356 0% /var/mail
zroot/tmp 298499732 41376 298458356 0% /tmp
zroot 298458444 88 298458356 0% /zroot
zroot/var/crash 298458444 88 298458356 0% /var/crash
zroot/usr/src 298458444 88 298458356 0% /usr/src
zroot/usr/home 445963996 147505640 298458356 33% /usr/home
zroot/var/tmp 298458476 120 298458356 0% /var/tmp
zroot/var/log 298464488 6132 298458356 0% /var/log
192.168.42.5:/volume1/tank/Medias 3745681248 1494770996 2250910252 40% /usr/home/romain/Medias
OUTPUT
end

Expand Down Expand Up @@ -85,10 +102,10 @@
context 'with swap devices' do
before do
allow(subject).to receive(:`).with('swapinfo').and_return(<<~OUTPUT)
Device 512-blocks Used Avail Capacity
/dev/da0p2 4194304 2695808 1498496 64%
/dev/ggate0 2048 0 2048 0%
Total 4196352 2695808 1500544 64%
Device 1024-blocks Used Avail Capacity
/dev/da0p2 2097152 1347904 749248 64%
/dev/ggate0 1024 0 1024 0%
Total 2098176 1347904 750272 64%
OUTPUT
end

Expand All @@ -102,7 +119,7 @@
context 'without swap devices' do
before do
allow(subject).to receive(:`).with('swapinfo').and_return(<<~OUTPUT)
Device 512-blocks Used Avail Capacity
Device 1024-blocks Used Avail Capacity
OUTPUT
end

Expand Down

0 comments on commit eb6c0dd

Please sign in to comment.