Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Livestatus] Unexpected error after an arbiter reload #67

Open
fullmetalucard opened this issue Nov 9, 2015 · 18 comments
Open

[Livestatus] Unexpected error after an arbiter reload #67

fullmetalucard opened this issue Nov 9, 2015 · 18 comments

Comments

@fullmetalucard
Copy link

Hi,

Since a few days, we encounter a new problem related to livestatus.
We're running a 2.4.1 version under Debian 8 with the following architecture:

  • 1 shinken master (with no poller running on it)
  • 4 pollers
  • 4 realms
  • Thruk 2.0 connected to 5 livestatus (master and 4 realms).

The fact is that when when we launch an arbiter-reload, the broker gets mad because of the livestatus module. Thruk interface becomes unusable although the livestatus still seems to be up.
Here is an example of the traceback in brokerd.log:

[1447026298] ERROR: [broker-1] [Livestatus] Unexpected error during process of request 'GET services\nColumns: accept_passive_checks acknowledged action_url action_url_expanded active_checks_enabled check_command check_interval check_options check_period check_type checks_enabled comments current_attempt current_notification_number description event_handler event_handler_enabled custom_variable_names custom_variable_values execution_time first_notification_delay flap_detection_enabled groups has_been_checked high_flap_threshold host_acknowledged host_action_url_expanded host_active_checks_enabled host_address host_alias host_checks_enabled host_check_type host_latency host_plugin_output host_perf_data host_current_attempt host_check_command host_comments host_groups host_has_been_checked host_icon_image_expanded host_icon_image_alt host_is_executing host_is_flapping host_name host_notes_url_expanded host_notifications_enabled host_scheduled_downtime_depth host_state host_accept_passive_checks host_last_state_change icon_image icon_image_alt icon_image_expanded is_executing is_flapping last_check last_notification last_state_change latency long_plugin_output low_flap_threshold max_check_attempts next_check notes notes_expanded notes_url notes_url_expanded notification_interval notification_period notifications_enabled obsess_over_service percent_state_change perf_data plugin_output process_performance_data retry_interval scheduled_downtime_depth state state_type modified_attributes_list last_time_critical last_time_ok last_time_unknown last_time_warning display_name host_display_name host_custom_variable_names host_custom_variable_values in_check_period in_notification_period host_parents\nFilter: host_has_been_checked = 0\nFilter: host_has_been_checked = 1\nFilter: host_state = 0\nAnd: 2\nOr: 2\nFilter: host_scheduled_downtime_depth = 0\nFilter: host_acknowledged = 0\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 1\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 3\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 2\nAnd: 2\nOr: 3\nFilter: scheduled_downtime_depth = 0\nFilter: acknowledged = 0\nAnd: 2\nAnd: 4\nOutputFormat: json\nResponseHeader: fixed16\n\n' : 115536
[1447026298] ERROR: [broker-1] [Livestatus] Back trace of this exception: Traceback (most recent call last):
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 74, in handle_request
    return self.handle_request_and_fail(data)
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 135, in handle_request_and_fail
    output, keepalive = query.process_query()
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 283, in process_query
    return self.response.respond()
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 147, in respond
    responselength = 1 + self.get_response_len() # 1 for the final '\n'
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 142, in get_response_len
    if isinstance(rsp, LiveStatusListResponse)
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 83, in total_len
    for generated_data in value:
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 278, in make_live_data_generator
    for value in self.make_live_data_generator2(result, columns, aliases):
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 224, in make_live_data_generator2
    item = next(result)
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 46, in gen_filtered
    for val in values:
  File "/var/lib/shinken/modules/livestatus/livestatus_regenerator.py", line 125, in itersorted
    yield self.items[id]
KeyError: 115536

The only workaround we found consists in restarting the broker each time we want to reload the arbiter. (and this workaround leads to high memory leaks..)
So to not replace a problem with another, we searched and found our issue could be related to issue #47

We tried to manually do the GET requests when everything goes fine and livestatus answers correctly:

echo -e "GET hosts\n\n" | netcat localhost 50000

(works too when doing queries about contacts, services, etc)

Another thing we noticed is that it may occur when livestatus is often asked by thruk, because we never have those errors during the night or weekend. So it might be related to the number of user/operators connected to thruk.

Any help would be appreciated,

Regards

@fullmetalucard
Copy link
Author

Hello, we still have this annoying problem. It gets worse as the infrastructure monitors more and more hosts.

Regards,

@olivierHa
Copy link
Member

I don't know about this specific issue, but why restarting broker is
leading to memory issues ?
I am currently restarting broker(s) each time I restart arbiter, and I
don't see any memory leaks so far.

Could you be more specific ?

2015-11-23 13:02 GMT+01:00 fullmetalucard [email protected]:

Hello, we still have this annoying problem. It gets worse as the
infrastructure monitors more and more hosts.

Regards,


Reply to this email directly or view it on GitHub
#67 (comment)
.

@fullmetalucard
Copy link
Author

Hi, it's clearly a bug related to big infrastructures.
We noticed this occurs only on brokers who manage a lots of checks.

To complete informations about our workaround, we made an alias shinken_reload who does this:

  • config check
  • arbiter reload
  • wait 120 seconds
  • broker restart

With this workaround, the platform seems more stable.
Our guess is that by default arbiter is not fully ready to dispatch orders after config check, we have to wait so it can communicate successfully with the broker.

I may also precise our shinken master isn't especially heavy loaded (load average = 2 on 8 PPC processors)
stats: more than 600 hosts/7000 checks

Regards

@vesl
Copy link

vesl commented Feb 8, 2016

Hi,

I have EXACTLY same architecture and same problem.
I got it on a second site now (not PPC but 20 realms).

What should i do to debug this (python debugger or else) ?

Anyway thank's shinken is the best solution .

Some logs when bug occurs :

here :

  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 74, in handle_request
    return self.handle_request_and_fail(data)
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 135, in handle_request_and_fail
    output, keepalive = query.process_query()
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 283, in process_query
    return self.response.respond()
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 147, in respond
    responselength = 1 + self.get_response_len() # 1 for the final '\n'
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 142, in get_response_len
    if isinstance(rsp, LiveStatusListResponse)
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 83, in total_len
    for generated_data in value:
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 278, in make_live_data_generator
    for value in self.make_live_data_generator2(result, columns, aliases):
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 224, in make_live_data_generator2
    item = next(result)
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 46, in gen_filtered
    for val in values:
  File "/var/lib/shinken/modules/livestatus/livestatus_regenerator.py", line 125, in itersorted
    yield self.items[id]

and here :

[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] INFO: [broker-1] Connection OK to the scheduler scheduler-1
[1454934164] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934165] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934165] ERROR: [broker-1] [Livestatus Query] Received a line of input which i can't handle: 'quit'
[1454934165] ERROR: [broker-1] [Livestatus Query] Received a line of input which i can't handle: 'exit'
[1454934165] WARNING: [broker-1] [Livestatus Query Metainfo] Received a line of input which i can't handle: 'quit'
[1454934165] WARNING: [broker-1] [Livestatus Query Metainfo] Received a line of input which i can't handle: 'exit'
[1454934173] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'
[1454934173] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934183] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934183] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934184] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934186] INFO: [broker-1] The module None is asking me to get all initial data from the scheduler 0
[1454934186] INFO: [broker-1] The module npcdmod is asking me to get all initial data from the scheduler 0
[1454934190] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934190] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934192] INFO: [broker-1] Connection OK to the scheduler scheduler-1
[1454934194] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934195] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934201] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934204] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'

@fullmetalucard
Copy link
Author

Hi, I knew i was not the only one ;)
I was also sure that it wasn't related to PPC architecture. We still have this annoying problem.
Our workaround works randomly. Our guess is that when livestatus is constantly accessed (by thruk sessions) during a reload, it becomes mad.
We are also wondering of why livestatus ' updates seems to have been abandoned. No more commits since last september and still a lot of issues with it. Is it for commercial purpose/reasons..?
Many people use thruk/livestatus and are not familiar or ready to migrate their interface on webui..
@olivierHa : you reported a memory issue too #63 As far as i know, broker launches the livestatus module. The problem you reported could be linked. I insist on the fact that this might clearly be related to big infrastructures with many realms and hosts because we don't have this problem on smaller infrastructures with same OS and shinken versions.

Well the only thing i'm sure is that someting has to be done with livestatus.

Thanks in advance for your help and patience.

Regards,

@webladen
Copy link

webladen commented Feb 9, 2016

Hi,

I have a fresh install of shinken (2.4.2) with webui2 and livestatus for check_mk using.
I have the same problem since I have activated livestatus (increasingly ram consumption).
Yesterday I disable livestatus from broker configuration (but keep webui2 activated)

below the graph show the problem.
graph.

I need livestatus for check_mk and nagvis use.

How can I fix it?

@webladen
Copy link

Anybody has an idea ?
The problem is still there and it is very blocking for us.

graph160217

@yadutaf
Copy link

yadutaf commented Mar 11, 2016

I've seen a similar behavior on one of our shinken instance. It is reliably triggered with /etc/init.d/shinken-arbiter reload.

For now, I've done an ugly patch to fix the symptoms:

diff -u /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py.old /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py
--- /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py.old  2016-03-09 17:39:57.874430134 +0000
+++ /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py  2016-03-09 17:39:12.920622557 +0000
@@ -503,7 +503,7 @@
         # Clean hosts from hosts and hostgroups
         for h in to_del_h:
             safe_print("Deleting", h.get_name())
-            del self.hosts[h.id]
+            #del self.hosts[h.id]

         # Now clean all hostgroups too
         for hg in self.hostgroups:
@@ -514,7 +514,7 @@

         for s in to_del_srv:
             safe_print("Deleting", s.get_full_name())
-            del self.services[s.id]
+            #del self.services[s.id]

         # Now clean service groups
         for sg in self.servicegroups:

This is by no mean a fix, so I'm not submitting a PR. I'm also checking the installation itself.

@vesl
Copy link

vesl commented Mar 20, 2016

Could you tell me in which version cherrypy are you ?
pip list |grep Cherry

@fullmetalucard
Copy link
Author

Hi, we're in CherryPy (3.8.0)

@diogouchoas
Copy link

Any updates?
This issue is getting very annoying now that we passed 10k services monitored.

@vesl
Copy link

vesl commented Jul 16, 2016

Upgrade 2.4.3 man.

@diogouchoas
Copy link

diogouchoas commented Jul 18, 2016

We're already on 2.4.3 man.
The issue still exists.

@floppy84
Copy link

Hi all,
i got the same issue, is there a fix for that? i am on 2.4.3 too

@krpt
Copy link

krpt commented Sep 22, 2016

Same here, would be grateful for a fix

@tandrez
Copy link

tandrez commented Jan 25, 2017

Hello,

I got the same issue for a professional project and it's very annoying towards our customer. We're monitoring about 5K hosts and 20K services!
The workaround is rarely working.
Is there any hope to have a fix?

Thanks in advance for your help!

@oemunoz
Copy link

oemunoz commented May 21, 2017

+1 Hello all, same here.

@Caez83
Copy link

Caez83 commented Jun 15, 2017

I've the same bug since one month

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests