[Livestatus] Unexpected error after an arbiter reload #67

fullmetalucard · 2015-11-09T09:33:50Z

Hi,

Since a few days, we encounter a new problem related to livestatus.
We're running a 2.4.1 version under Debian 8 with the following architecture:

1 shinken master (with no poller running on it)
4 pollers
4 realms
Thruk 2.0 connected to 5 livestatus (master and 4 realms).

The fact is that when when we launch an arbiter-reload, the broker gets mad because of the livestatus module. Thruk interface becomes unusable although the livestatus still seems to be up.
Here is an example of the traceback in brokerd.log:

[1447026298] ERROR: [broker-1] [Livestatus] Unexpected error during process of request 'GET services\nColumns: accept_passive_checks acknowledged action_url action_url_expanded active_checks_enabled check_command check_interval check_options check_period check_type checks_enabled comments current_attempt current_notification_number description event_handler event_handler_enabled custom_variable_names custom_variable_values execution_time first_notification_delay flap_detection_enabled groups has_been_checked high_flap_threshold host_acknowledged host_action_url_expanded host_active_checks_enabled host_address host_alias host_checks_enabled host_check_type host_latency host_plugin_output host_perf_data host_current_attempt host_check_command host_comments host_groups host_has_been_checked host_icon_image_expanded host_icon_image_alt host_is_executing host_is_flapping host_name host_notes_url_expanded host_notifications_enabled host_scheduled_downtime_depth host_state host_accept_passive_checks host_last_state_change icon_image icon_image_alt icon_image_expanded is_executing is_flapping last_check last_notification last_state_change latency long_plugin_output low_flap_threshold max_check_attempts next_check notes notes_expanded notes_url notes_url_expanded notification_interval notification_period notifications_enabled obsess_over_service percent_state_change perf_data plugin_output process_performance_data retry_interval scheduled_downtime_depth state state_type modified_attributes_list last_time_critical last_time_ok last_time_unknown last_time_warning display_name host_display_name host_custom_variable_names host_custom_variable_values in_check_period in_notification_period host_parents\nFilter: host_has_been_checked = 0\nFilter: host_has_been_checked = 1\nFilter: host_state = 0\nAnd: 2\nOr: 2\nFilter: host_scheduled_downtime_depth = 0\nFilter: host_acknowledged = 0\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 1\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 3\nAnd: 2\nFilter: has_been_checked = 1\nFilter: state = 2\nAnd: 2\nOr: 3\nFilter: scheduled_downtime_depth = 0\nFilter: acknowledged = 0\nAnd: 2\nAnd: 4\nOutputFormat: json\nResponseHeader: fixed16\n\n' : 115536
[1447026298] ERROR: [broker-1] [Livestatus] Back trace of this exception: Traceback (most recent call last):
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 74, in handle_request
    return self.handle_request_and_fail(data)
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 135, in handle_request_and_fail
    output, keepalive = query.process_query()
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 283, in process_query
    return self.response.respond()
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 147, in respond
    responselength = 1 + self.get_response_len() # 1 for the final '\n'
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 142, in get_response_len
    if isinstance(rsp, LiveStatusListResponse)
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 83, in total_len
    for generated_data in value:
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 278, in make_live_data_generator
    for value in self.make_live_data_generator2(result, columns, aliases):
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 224, in make_live_data_generator2
    item = next(result)
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 46, in gen_filtered
    for val in values:
  File "/var/lib/shinken/modules/livestatus/livestatus_regenerator.py", line 125, in itersorted
    yield self.items[id]
KeyError: 115536

The only workaround we found consists in restarting the broker each time we want to reload the arbiter. (and this workaround leads to high memory leaks..)
So to not replace a problem with another, we searched and found our issue could be related to issue #47

We tried to manually do the GET requests when everything goes fine and livestatus answers correctly:

echo -e "GET hosts\n\n" | netcat localhost 50000

(works too when doing queries about contacts, services, etc)

Another thing we noticed is that it may occur when livestatus is often asked by thruk, because we never have those errors during the night or weekend. So it might be related to the number of user/operators connected to thruk.

Any help would be appreciated,

Regards

fullmetalucard · 2015-11-23T12:02:27Z

Hello, we still have this annoying problem. It gets worse as the infrastructure monitors more and more hosts.

Regards,

olivierHa · 2015-11-23T16:26:26Z

I don't know about this specific issue, but why restarting broker is
leading to memory issues ?
I am currently restarting broker(s) each time I restart arbiter, and I
don't see any memory leaks so far.

Could you be more specific ?

2015-11-23 13:02 GMT+01:00 fullmetalucard [email protected]:

Hello, we still have this annoying problem. It gets worse as the
infrastructure monitors more and more hosts.

Regards,

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

fullmetalucard · 2015-11-27T11:09:52Z

Hi, it's clearly a bug related to big infrastructures.
We noticed this occurs only on brokers who manage a lots of checks.

To complete informations about our workaround, we made an alias shinken_reload who does this:

config check
arbiter reload
wait 120 seconds
broker restart

With this workaround, the platform seems more stable.
Our guess is that by default arbiter is not fully ready to dispatch orders after config check, we have to wait so it can communicate successfully with the broker.

I may also precise our shinken master isn't especially heavy loaded (load average = 2 on 8 PPC processors)
stats: more than 600 hosts/7000 checks

Regards

vesl · 2016-02-08T19:52:05Z

Hi,

I have EXACTLY same architecture and same problem.
I got it on a second site now (not PPC but 20 realms).

What should i do to debug this (python debugger or else) ?

Anyway thank's shinken is the best solution .

Some logs when bug occurs :

here :

  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 74, in handle_request
    return self.handle_request_and_fail(data)
  File "/var/lib/shinken/modules/livestatus/livestatus_obj.py", line 135, in handle_request_and_fail
    output, keepalive = query.process_query()
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 283, in process_query
    return self.response.respond()
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 147, in respond
    responselength = 1 + self.get_response_len() # 1 for the final '\n'
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 142, in get_response_len
    if isinstance(rsp, LiveStatusListResponse)
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 83, in total_len
    for generated_data in value:
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 278, in make_live_data_generator
    for value in self.make_live_data_generator2(result, columns, aliases):
  File "/var/lib/shinken/modules/livestatus/livestatus_response.py", line 224, in make_live_data_generator2
    item = next(result)
  File "/var/lib/shinken/modules/livestatus/livestatus_query.py", line 46, in gen_filtered
    for val in values:
  File "/var/lib/shinken/modules/livestatus/livestatus_regenerator.py", line 125, in itersorted
    yield self.items[id]

and here :

[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934157] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934161] INFO: [broker-1] Connection OK to the scheduler scheduler-1
[1454934164] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934165] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934165] ERROR: [broker-1] [Livestatus Query] Received a line of input which i can't handle: 'quit'
[1454934165] ERROR: [broker-1] [Livestatus Query] Received a line of input which i can't handle: 'exit'
[1454934165] WARNING: [broker-1] [Livestatus Query Metainfo] Received a line of input which i can't handle: 'quit'
[1454934165] WARNING: [broker-1] [Livestatus Query Metainfo] Received a line of input which i can't handle: 'exit'
[1454934173] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'
[1454934173] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934177] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934183] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934183] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934184] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934186] INFO: [broker-1] The module None is asking me to get all initial data from the scheduler 0
[1454934186] INFO: [broker-1] The module npcdmod is asking me to get all initial data from the scheduler 0
[1454934190] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934190] ERROR: [broker-1] [Livestatus Query] Error: 'Hosts' object has no attribute '__itersorted__'
[1454934192] INFO: [broker-1] Connection OK to the scheduler scheduler-1
[1454934194] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934195] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934198] ERROR: [broker-1] [Livestatus Query] Error: 'Services' object has no attribute '__itersorted__'
[1454934201] ERROR: [broker-1] [Livestatus Query] Error: 'Contactgroups' object has no attribute '__itersorted__'
[1454934204] ERROR: [broker-1] [Livestatus Query] Error: 'Contacts' object has no attribute '__itersorted__'

fullmetalucard · 2016-02-09T07:57:45Z

Hi, I knew i was not the only one ;)
I was also sure that it wasn't related to PPC architecture. We still have this annoying problem.
Our workaround works randomly. Our guess is that when livestatus is constantly accessed (by thruk sessions) during a reload, it becomes mad.
We are also wondering of why livestatus ' updates seems to have been abandoned. No more commits since last september and still a lot of issues with it. Is it for commercial purpose/reasons..?
Many people use thruk/livestatus and are not familiar or ready to migrate their interface on webui..
@olivierHa : you reported a memory issue too #63 As far as i know, broker launches the livestatus module. The problem you reported could be linked. I insist on the fact that this might clearly be related to big infrastructures with many realms and hosts because we don't have this problem on smaller infrastructures with same OS and shinken versions.

Well the only thing i'm sure is that someting has to be done with livestatus.

Thanks in advance for your help and patience.

Regards,

webladen · 2016-02-09T08:20:32Z

Hi,

I have a fresh install of shinken (2.4.2) with webui2 and livestatus for check_mk using.
I have the same problem since I have activated livestatus (increasingly ram consumption).
Yesterday I disable livestatus from broker configuration (but keep webui2 activated)

below the graph show the problem.
.

I need livestatus for check_mk and nagvis use.

How can I fix it?

webladen · 2016-02-17T08:23:24Z

Anybody has an idea ?
The problem is still there and it is very blocking for us.

yadutaf · 2016-03-11T11:47:02Z

I've seen a similar behavior on one of our shinken instance. It is reliably triggered with /etc/init.d/shinken-arbiter reload.

For now, I've done an ugly patch to fix the symptoms:

diff -u /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py.old /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py
--- /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py.old  2016-03-09 17:39:57.874430134 +0000
+++ /usr/local/lib/python2.7/dist-packages/shinken/misc/regenerator.py  2016-03-09 17:39:12.920622557 +0000
@@ -503,7 +503,7 @@
         # Clean hosts from hosts and hostgroups
         for h in to_del_h:
             safe_print("Deleting", h.get_name())
-            del self.hosts[h.id]
+            #del self.hosts[h.id]

         # Now clean all hostgroups too
         for hg in self.hostgroups:
@@ -514,7 +514,7 @@

         for s in to_del_srv:
             safe_print("Deleting", s.get_full_name())
-            del self.services[s.id]
+            #del self.services[s.id]

         # Now clean service groups
         for sg in self.servicegroups:

This is by no mean a fix, so I'm not submitting a PR. I'm also checking the installation itself.

vesl · 2016-03-20T16:48:05Z

Could you tell me in which version cherrypy are you ?
pip list |grep Cherry

fullmetalucard · 2016-03-21T07:44:21Z

Hi, we're in CherryPy (3.8.0)

diogouchoas · 2016-06-30T18:14:44Z

Any updates?
This issue is getting very annoying now that we passed 10k services monitored.

vesl · 2016-07-16T22:45:28Z

Upgrade 2.4.3 man.

diogouchoas · 2016-07-18T20:15:10Z

We're already on 2.4.3 man.
The issue still exists.

floppy84 · 2016-09-21T15:09:56Z

Hi all,
i got the same issue, is there a fix for that? i am on 2.4.3 too

krpt · 2016-09-22T14:48:53Z

Same here, would be grateful for a fix

tandrez · 2017-01-25T16:02:51Z

Hello,

I got the same issue for a professional project and it's very annoying towards our customer. We're monitoring about 5K hosts and 20K services!
The workaround is rarely working.
Is there any hope to have a fix?

Thanks in advance for your help!

oemunoz · 2017-05-21T00:43:54Z

+1 Hello all, same here.

Caez83 · 2017-06-15T15:44:09Z

I've the same bug since one month

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Livestatus] Unexpected error after an arbiter reload #67

[Livestatus] Unexpected error after an arbiter reload #67

fullmetalucard commented Nov 9, 2015

fullmetalucard commented Nov 23, 2015

olivierHa commented Nov 23, 2015

fullmetalucard commented Nov 27, 2015

vesl commented Feb 8, 2016

fullmetalucard commented Feb 9, 2016

webladen commented Feb 9, 2016

webladen commented Feb 17, 2016

yadutaf commented Mar 11, 2016

vesl commented Mar 20, 2016

fullmetalucard commented Mar 21, 2016

diogouchoas commented Jun 30, 2016

vesl commented Jul 16, 2016

diogouchoas commented Jul 18, 2016 •

edited

Loading

floppy84 commented Sep 21, 2016

krpt commented Sep 22, 2016

tandrez commented Jan 25, 2017

oemunoz commented May 21, 2017 •

edited

Loading

Caez83 commented Jun 15, 2017

[Livestatus] Unexpected error after an arbiter reload #67

[Livestatus] Unexpected error after an arbiter reload #67

Comments

fullmetalucard commented Nov 9, 2015

fullmetalucard commented Nov 23, 2015

olivierHa commented Nov 23, 2015

fullmetalucard commented Nov 27, 2015

vesl commented Feb 8, 2016

fullmetalucard commented Feb 9, 2016

webladen commented Feb 9, 2016

webladen commented Feb 17, 2016

yadutaf commented Mar 11, 2016

vesl commented Mar 20, 2016

fullmetalucard commented Mar 21, 2016

diogouchoas commented Jun 30, 2016

vesl commented Jul 16, 2016

diogouchoas commented Jul 18, 2016 • edited Loading

floppy84 commented Sep 21, 2016

krpt commented Sep 22, 2016

tandrez commented Jan 25, 2017

oemunoz commented May 21, 2017 • edited Loading

Caez83 commented Jun 15, 2017

diogouchoas commented Jul 18, 2016 •

edited

Loading

oemunoz commented May 21, 2017 •

edited

Loading