Thruk fails when livestatus Backend unhealthy

267 views
Skip to first unread message

Fabrice Le Dorze

unread,
Mar 22, 2015, 7:07:10 AM3/22/15
to th...@googlegroups.com
Hello
Yesterday, we had a incident with our multiple backends Thruk. (naemon 1.0.0)
Our Monitoring team could log but then got an 'internal server error" on left menu. They were totally blind, no action was possible.
After investigation, I save their life by deleting an unhealthy backenf from thruk config.

We use local states so that (I thought) Thruk is not stalled or slow when a backend is unavailable or unhealthy.

In the beginning, we just used a tcp_ping check on the livestatus port (TCP 822 for us) to check backends availiblity.
But it revealed unsufficient : few weeks ago, one backend was still responding on port TCP 822 but was unable answer to a socat request.
Naemon was still considering it as available.
The result was timeouts in Thruk. Hopefully, our Monitoring team was able to disable this backend but clicking on its button.
But this time, no way.

To fix it, we have developped a plugin to check more precisely the health of backends :
1- a tcp_ping to check that backend can answer on TCP 822 in reasonable delays
2- a 'GET hosts' livestatus request with socat with a timout, a threshold for response time and a threshold for number of hosts in socat answer.
But I fails again to prevent Thruk fails.

We checked the Thruk server health and logs, see extract below.
The consumed memory, usually around 40%, rised to over 97%.
Naemon sees sometimes 2 backends DOWN :
'Thruk DESD'
'Veon2 ADE', with sometimes flapping state 

As expected, we see messages like "b9cf1 -> disabled by local state check (Thruk DESD)" : this backend is properly disabled when Naemon sees it down.
But in thruk log, no such messages for "Veon2 ADE". Indeed, there are no messages concerning this backend at all.
In fact, there are no messages for 3 backends.

That's trouble me : in normal conditions, all 9 backends are working fine.
But why can I see messages in Thruk log for only 6 of them ?
Why the local state for "Veon2 ADE" is not apparently taken in account by Thruk ?
Is there a effect of flapping state ?
Does it explain the failing Thruk interface ?

We absolutely need to insure that the whole system does not collapse when only one backend is unhealty.

The Naemon logs gave :
...
[2015/03/21 19:11:42] HOST ALERT: Thruk DESD;DOWN;SOFT;5;(Host check timed out after 30.04 seconds)
[2015/03/21 19:13:36] HOST ALERT: Thruk DESD;DOWN;SOFT;6;Socat slow to get result (>20)
[2015/03/21 19:13:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;1;host number 0 <10. Strange.
[2015/03/21 19:15:42] HOST ALERT: Thruk DESD;DOWN;SOFT;7;(Host check timed out after 30.04 seconds)
[2015/03/21 19:15:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;2;host number 0 <10. Strange.
[2015/03/21 19:17:25] HOST ALERT: Thruk DESD;UP;SOFT;8;TCP OK - 0,025 second response time on 147.98.136.50 port 822 , 2729 hosts in 11s
[2015/03/21 19:17:49] HOST ALERT: Veon2 ADE;DOWN;SOFT;3;host number 0 <10. Strange.
[2015/03/21 19:19:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;4;host number 0 <10. Strange.
[2015/03/21 19:20:42] HOST ALERT: Thruk DESD;DOWN;SOFT;1;(Host check timed out after 30.03 seconds)
[2015/03/21 19:21:50] HOST ALERT: Veon2 ADE;DOWN;SOFT;5;host number 0 <10. Strange.
[2015/03/21 19:22:24] HOST ALERT: Thruk DESD;UP;SOFT;2;TCP OK - 0,025 second response time on 147.98.136.50 port 822 , 2729 hosts in 9s
[2015/03/21 19:23:41] HOST ALERT: Thruk DESD;DOWN;SOFT;1;Socat slow to get result (>20)
[2015/03/21 19:23:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;6;host number 0 <10. Strange.
[2015/03/21 19:25:11] HOST ALERT: Eon CMDG;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 1l9:25:27] HOST ALERT: Thruk DESD;UP;SOFT;2;TCP OK - 0,029 second response time on 147.98.136.50 port 822 , 2729 hosts in 12s
[2015/03/21 19:25:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;7;host number 0 <10. Strange.
[2015/03/21 19:26:12] HOST ALERT: Eon IUC;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:26:14] HOST ALERT: Thruk DESD;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:26:32] HOST ALERT: Eon BAG;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:26:56] HOST ALERT: Eon SLV;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:27:14] HOST ALERT: Eon CMDG;UP;SOFT;2;TCP OK - 0,021 second response time on 91.151.62.65 port 822 , 87 hosts in 1s
[2015/03/21 19:27:51] HOST ALERT: Veon2 ADE;DOWN;SOFT;8;host number 0 <10. Strange.
[2015/03/21 19:28:12] HOST ALERT: Eon CMDG;DOWN;SOFT;1;(Return code of 254 is out of bounds)
[2015/03/21 19:28:16] HOST ALERT: Eon IUC;UP;SOFT;2;TCP OK - 1,031 second response time on 10.10.11.238 port 822 , 226 hosts in 2s
[2015/03/21 19:28:36] HOST ALERT: Eon BAG;UP;SOFT;2;TCP OK - 0,029 second response time on 10.168.4.200 port 822 , 110 hosts in 3s
[2015/03/21 19:28:44] HOST ALERT: Thruk DESD;DOWN;SOFT;2;(Host check timed out after 30.04 seconds)
[2015/03/21 19:28:57] HOST ALERT: Eon SLV;UP;SOFT;2;TCP OK - 0,055 second response time on 10.120.35.20 port 822 , 29 hosts in 0s
[2015/03/21 19:29:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;9;host number 0 <10. Strange.
[2015/03/21 19:30:12] HOST ALERT: Eon CMDG;DOWN;SOFT;2;(Host check did not exit properly)
[2015/03/21 19:30:15] HOST ALERT: Thruk DESD;DOWN;SOFT;3;(Host check did not exit properly)
[2015/03/21 19:31:49] HOST ALERT: Veon2 ADE;DOWN;HARD;10;host number 0 <10. Strange.
[2015/03/21 19:32:14] HOST ALERT: Eon CMDG;UP;SOFT;3;TCP OK - 0,024 second response time on 91.151.62.65 port 822 , 87 hosts in 1s
[2015/03/21 19:32:27] HOST ALERT: Thruk DESD;UP;SOFT;4;TCP OK - 0,025 second response time on 147.98.136.50 port 822 , 2729 hosts in 11s
[2015/03/21 19:33:43] HOST ALERT: Thruk DESD;DOWN;SOFT;1;Socat slow to get result (>20)
[2015/03/21 19:35:45] HOST ALERT: Thruk DESD;DOWN;SOFT;2;(Host check timed out after 30.01 seconds)
....


The Thruk logs shows :
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] checking auth
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] user not authenticated yet
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] user authenticated as: rsmsuper
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] _set_enabled_backends() by cookie
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 898fc -> enabled by local state check (Eon BAG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 4b7de -> enabled by local state check (Eon IUC)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 4b7de -> enabled by local state check (Eon IUC)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] b9cf1 -> disabled by local state check (Thruk DESD)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] b9cf1 -> disabled by local state check (Thruk DESD)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] b9cf1 -> disabled by local state check (Thruk DESD)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] disabled_backends: $VAR1 = {
          '1c326' => 0,
          '1de35' => 0,
          '328c5' => 0,
          '4b7de' => 0,
          '898fc' => 0,
          'adc60' => 0,
          'b9938' => 0,
          'b9cf1' => 1,
          'd8487' => 0,
          'ea363' => 0
        };



The Apache logs gave :
[Sat Mar 21 19:14:10 2015] [warn] [client 172.27.0.81] mod_fcgid: read data timeout in 120 seconds, referer: http://hypervision0/naemon/cgi-bin/status.cgi?newname=&svc_s0_servicestatustypes=20&svc_s0_val_pre=&svc_s0_val_pre=&hst_s0_hostprops=5515306&svc_s0_value_sel=5&svc_s0_value_sel=5&hst_s0_servicestatustypes=31&hidesearch=2&svc_s0_op=!~&svc_s0_op=!~&svc_s0_serviceprops=4466698&svc_s0_hoststatustypes=3&hst_s0_hoststatustypes=4&style=combined&svc_s0_hostprops=5507114&hidetop=1&hst_s0_serviceprops=0&svc_s0_value=DSI_&svc_s0_value=Service%20Check%20Timed%20Out%20On%20Worker%7CReturn%20code%20of%20255%20is%20out%20of%20bounds&title=Vue%20Supervision&title=Vue%20Supervision&svc_s0_type=host&svc_s0_type=plugin%20output
[Sat Mar 21 19:14:10 2015] [error] [client 172.27.0.81] Premature end of script headers: fcgid_env.sh, referer: http://hypervision0/naemon/cgi-bin/status.cgi?newname=&svc_s0_servicestatustypes=20&svc_s0_val_pre=&svc_s0_val_pre=&hst_s0_hostprops=5515306&svc_s0_value_sel=5&svc_s0_value_sel=5&hst_s0_servicestatustypes=31&hidesearch=2&svc_s0_op=!~&svc_s0_op=!~&svc_s0_serviceprops=4466698&svc_s0_hoststatustypes=3&hst_s0_hoststatustypes=4&style=combined&svc_s0_hostprops=5507114&hidetop=1&hst_s0_serviceprops=0&svc_s0_value=DSI_&svc_s0_value=Service%20Check%20Timed%20Out%20On%20Worker%7CReturn%20code%20of%20255%20is%20out%20of%20bounds&title=Vue%20Supervision&title=Vue%20Supervision&svc_s0_type=host&svc_s0_type=plugin%20output
[Sat Mar 21 19:14:17 2015] [warn] mod_fcgid: process 22857 graceful kill fail, sending SIGKILL

Fabrice Le Dorze

unread,
Mar 25, 2015, 4:12:55 PM3/25/15
to th...@googlegroups.com
Hello,
I followed my investigations.
The difference between backends for which there are messages in thruk log like
[2015/03/21 19:12:29][hypervision0][DEBUG]
[Thruk.Backend.Manager] b9cf1 -> disabled by local state check (Thruk DESD)

and those for which there are no messages is that the former are defined in Naemon config with their IP, and the latter with their DNS name.
I just fixed it and .... it worked : the messages appeared in thruk.log and I could simulate a DOWN status so that Thruk disabled the backend.
Very strange.

The only remaining problem is that when a backend is disabled by local state, its button stay in green in backend panel, even if the onmouse text is :
ERROR: peer check via local instance(s) returned state:DOWN
and its hosts and services status dissapear.

I try to figure where to fix it, unsucessfully for the moment.
Any idea ?
Thx
Reply all
Reply to author
Forward
0 new messages