Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handler-slack-multichannel.rb: several improvements #82

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
84 changes: 53 additions & 31 deletions bin/handler-slack-multichannel.rb
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,10 @@ def webhook_urls
get_setting('webhook_urls')
end

def webhook_url
get_setting('webhook_url')
end

def default_channels
return get_setting('channels')['default']
rescue
Expand Down Expand Up @@ -160,19 +164,22 @@ def custom_field
def compile_channel_list
channels = []

# Some slack channels come defined as strings (when used as a single channel handler), but we need them as arrays
# Force channels as arrays enclosing them in []'s and using flatten to deal with properly defined arrays
# This way we ensure we always return an array as the rest of the code expects
if check_configured_channels
channels = check_configured_channels
channels = [check_configured_channels].flatten
puts "using check configured channels: #{channels.join('.').chomp(',')}"
elsif client_configured_channels
channels = client_configured_channels
channels = [client_configured_channels].flatten
puts "using client configured channels: #{channels.join('.').chomp(',')}"
else
channels = default_channels
channels = [default_channels].flatten
puts "using check default channels: #{default_channels.join(',').chomp(',')}"
end

if compulsory_channels
channels |= compulsory_channels
channels |= [compulsory_channels].flatten
puts "adding compulsory channels: #{compulsory_channels.join(',').chomp(',')}"
end

Expand Down Expand Up @@ -244,7 +251,7 @@ def build_description
end

def post_data(notice, channel)
slack_webhook_url = webhook_urls[channel]
slack_webhook_url = webhook_url
uri = URI(slack_webhook_url)

http = if defined?(slack_proxy_addr).nil?
Expand All @@ -256,31 +263,44 @@ def post_data(notice, channel)
http.use_ssl = true

begin
req = Net::HTTP::Post.new("#{uri.path}?#{uri.query}")
if payload_template.nil?
text = slack_surround ? slack_surround + notice + slack_surround : notice
req.body = payload(text, channel).to_json
# Implement a retry strategy, with 5 tries max and a timeout of 10 seconds each
tries ||= 5
Timeout.timeout(10) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are gonna use timeout we should expose options to specify http level timeouts otherwise they will be obscured by a generic timeout message. See sensu-plugins/sensu-plugins-http#123 for additional context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for configurable options, we probably would like to have not only the timeout but also the number of retries, and the sleeping time in between retries, as configurable settings. Something like I proposed on https://github.com/sensu-plugins/sensu-plugins-slack/pull/83/files

On the other hand, I am not sure we do need to go as deep as specifying dns, open and read timeouts. Could we not just rescue these errors from Net::?

begin
req = Net::HTTP::Post.new("#{uri.path}?#{uri.query}")
if payload_template.nil?
text = slack_surround ? slack_surround + notice + slack_surround : notice
kali-brandwatch marked this conversation as resolved.
Show resolved Hide resolved
req.body = payload(text, channel).to_json
else
req.body = notice
end
puts "request: #{req}"
kali-brandwatch marked this conversation as resolved.
Show resolved Hide resolved

response = http.request(req)

puts "response: #{response}"

rescue Net::HTTPServerException => error
if (tries -= 1) > 0
sleep 5
puts "retrying incident #{incident_key}... #{tries} left"
retry
else
# raise error for sensu-server to catch and log
puts 'slack api failed (retries) ' + incident_key + ' : ' + error.response.code + ' ' + error.response.message + ': sending to channel "' + channel + '" the message: ' + notice
kali-brandwatch marked this conversation as resolved.
Show resolved Hide resolved
exit 1
end
end
end
rescue Timeout::Error
if (tries -= 1) > 0
puts "retrying... #{tries} left"
retry
else
req.body = notice
# raise error for sensu-server to catch and log
puts 'slack api failed (timeout) ' + incident_key + ' : sending to channel "' + channel + '" the message: ' + notice
kali-brandwatch marked this conversation as resolved.
Show resolved Hide resolved
exit 1
end
puts "request: #{req}"

response = http.request(req)

puts "response: #{response}"

verify_response(response)
rescue => error
puts "error making http request: #{error}"
end
end

def verify_response(response)
case response
when Net::HTTPSuccess
true
else
raise response.error!
end
end

Expand All @@ -301,7 +321,7 @@ def payload(notice, channel)
{
icon_url: slack_icon_url ? slack_icon_url : 'https://raw.githubusercontent.com/sensu/sensu-logo/master/sensu1_flat%20white%20bg_png.png',
attachments: [{
title: "#{@event['client']['name']} - #{translate_status}",
title: "#{@event['check']['name']} - #{translate_status}",
kali-brandwatch marked this conversation as resolved.
Show resolved Hide resolved
text: [slack_message_prefix, notice].compact.join(' '),
color: color,
fields: client_fields
Expand All @@ -321,7 +341,8 @@ def color
2 => '#FF0000',
3 => '#6600CC'
}
color.fetch(check_status.to_i)
# When the sensu check status is out of bounds, use color for UNK
color.fetch(check_status.to_i, '#6600CC')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the overlap we should break this out into a library method that can be shared to avoid duplication: https://github.com/sensu-plugins/sensu-plugins-slack/blob/4.0.0/bin/handler-slack.rb#L206-L228

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See following comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved by fac988a, please review

end

def check_status
Expand All @@ -335,6 +356,7 @@ def translate_status
2 => :CRITICAL,
3 => :UNKNOWN
}
status[check_status.to_i]
# When the sensu check status is out of bounds, don't hide the status but show it for detailed info
status.fetch(check_status.to_i, "OUT OF BOUNDS: #{check_status.to_i}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that this is the right approach, I think we should just fall back to an unknown. See previous comment about merging the functionality via a library method.

Copy link
Contributor Author

@kali-brandwatch kali-brandwatch Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not entirely agree with having OUT OF BOUNDS messages declared as UNKNOWN. An UNK in nagios-style means the check code has pruposely set this exit status and reflects the purpose of the programmer to explicitly tell the operator there is no way to know the state of the service checked.
An OOB from nagios standards means that the check script has exited somewhere in a nasty way. Typical cases are uncaught bad curl exits, missing files, missing permissions (on check script or files to read) and a myriad more.
The very reason to have OOB is to let the operator know that the check itself is failing, rather than working in an expected manner that is not able to monitor a service or resource.
I can see how it would make sense to avoid duplicity here, although you could argue likewise about the whole having a def translate_status and a def color which are based on the same input. We could return both in a hash and access the key we want specifically in the code.
In any case I strongly believe having OOB explicitly defined and the exit code returned to the operator for visibility is crucial to understand what is going on with the service monitored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point about translate_status and color duplication, I did not initially write that bit. I added in the error handling to return a status (and corresponding color) which is understood in the sensu ecosystem in #62. More often than not changes to existing plugins are made within the confines of how it already works. I did not use that slack handler at the time so I did not really have anywhere to test it. I made the smallest and safest change that solved the problem someone was facing.

For the moment I think we should stick to the status[1] of unknown for non [0-2] codes because there is lack of precedence in the ecosystem. In ~5 years of consuming, contributing, and maintaining sensu I have yet to see anyone follow this convention you are proposing. While we mirror certain aspects of nagios I don't want to hold ourselves to all of their conventions as it ties our hands. I am not opposed to it in general but I think we should put out an RFC and get community feedback before declaring a new convention to be followed. I am fine with adding some additional output to tell the operator more in the meantime.

[1]: leaving the status code intact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it probably make sense to open this discussion to the whole community. I still advocate for my proposed convention but I will move the conversation to there for openness and greater feedback.

Copy link
Contributor Author

@kali-brandwatch kali-brandwatch Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added here: sensu-plugins/community#127

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the de-duplication of translate_status and color, I will check if this would mean lots of further changes. If it doesn't I will go ahead with it, otherwise I will try to keep it simple for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved bu fac988a, please review

end
end