Hello
Yesterday, we had a incident with our multiple backends Thruk. (naemon 1.0.0)
Our Monitoring team could log but then got an 'internal server error" on left menu. They were totally blind, no action was possible.
After investigation, I save their life by deleting an unhealthy backenf from thruk config.
We use local states so that (I thought) Thruk is not stalled or slow when a backend is unavailable or unhealthy.
In the beginning, we just used a tcp_ping check on the livestatus port (TCP 822 for us) to check backends availiblity.
But it revealed unsufficient : few weeks ago, one backend was still responding on port TCP 822 but was unable answer to a socat request.
Naemon was still considering it as available.
The result was timeouts in Thruk. Hopefully, our Monitoring team was able to disable this backend but clicking on its button.
But this time, no way.
To fix it, we have developped a plugin to check more precisely the health of backends :
1- a tcp_ping to check that backend can answer on TCP 822 in reasonable delays
2- a 'GET hosts' livestatus request with socat with a timout, a threshold for response time and a threshold for number of hosts in socat answer.
But I fails again to prevent Thruk fails.
We checked the Thruk server health and logs, see extract below.
The consumed memory, usually around 40%, rised to over 97%.
Naemon sees sometimes 2 backends DOWN :
'Thruk DESD'
'Veon2 ADE', with sometimes flapping state
As expected, we see messages like "b9cf1 -> disabled by local state check (Thruk DESD)" : this backend is properly disabled when Naemon sees it down.
But in thruk log, no such messages for "Veon2 ADE". Indeed, there are no messages concerning this backend at all.
In fact, there are no messages for 3 backends.
That's trouble me : in normal conditions, all 9 backends are working fine.
But why can I see messages in Thruk log for only 6 of them ?
Why the local state for "Veon2 ADE" is not apparently taken in account by Thruk ?
Is there a effect of flapping state ?
Does it explain the failing Thruk interface ?
We absolutely need to insure that the whole system does not collapse when only one backend is unhealty.
The Naemon logs gave :
...
[2015/03/21 19:11:42] HOST ALERT: Thruk DESD;DOWN;SOFT;5;(Host check timed out after 30.04 seconds)
[2015/03/21 19:13:36] HOST ALERT: Thruk DESD;DOWN;SOFT;6;Socat slow to get result (>20)
[2015/03/21 19:13:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;1;host number 0 <10. Strange.
[2015/03/21 19:15:42] HOST ALERT: Thruk DESD;DOWN;SOFT;7;(Host check timed out after 30.04 seconds)
[2015/03/21 19:15:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;2;host number 0 <10. Strange.
[2015/03/21 19:17:25] HOST ALERT: Thruk DESD;UP;SOFT;8;TCP OK - 0,025 second response time on 147.98.136.50 port 822 , 2729 hosts in 11s
[2015/03/21 19:17:49] HOST ALERT: Veon2 ADE;DOWN;SOFT;3;host number 0 <10. Strange.
[2015/03/21 19:19:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;4;host number 0 <10. Strange.
[2015/03/21 19:20:42] HOST ALERT: Thruk DESD;DOWN;SOFT;1;(Host check timed out after 30.03 seconds)
[2015/03/21 19:21:50] HOST ALERT: Veon2 ADE;DOWN;SOFT;5;host number 0 <10. Strange.
[2015/03/21 19:22:24] HOST ALERT: Thruk DESD;UP;SOFT;2;TCP OK - 0,025 second response time on 147.98.136.50 port 822 , 2729 hosts in 9s
[2015/03/21 19:23:41] HOST ALERT: Thruk DESD;DOWN;SOFT;1;Socat slow to get result (>20)
[2015/03/21 19:23:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;6;host number 0 <10. Strange.
[2015/03/21 19:25:11] HOST ALERT: Eon CMDG;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 1l9:25:27] HOST ALERT: Thruk DESD;UP;SOFT;2;TCP OK - 0,029 second response time on 147.98.136.50 port 822 , 2729 hosts in 12s
[2015/03/21 19:25:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;7;host number 0 <10. Strange.
[2015/03/21 19:26:12] HOST ALERT: Eon IUC;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:26:14] HOST ALERT: Thruk DESD;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:26:32] HOST ALERT: Eon BAG;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:26:56] HOST ALERT: Eon SLV;DOWN;SOFT;1;(Host check did not exit properly)
[2015/03/21 19:27:14] HOST ALERT: Eon CMDG;UP;SOFT;2;TCP OK - 0,021 second response time on 91.151.62.65 port 822 , 87 hosts in 1s
[2015/03/21 19:27:51] HOST ALERT: Veon2 ADE;DOWN;SOFT;8;host number 0 <10. Strange.
[2015/03/21 19:28:12] HOST ALERT: Eon CMDG;DOWN;SOFT;1;(Return code of 254 is out of bounds)
[2015/03/21 19:28:16] HOST ALERT: Eon IUC;UP;SOFT;2;TCP OK - 1,031 second response time on 10.10.11.238 port 822 , 226 hosts in 2s
[2015/03/21 19:28:36] HOST ALERT: Eon BAG;UP;SOFT;2;TCP OK - 0,029 second response time on 10.168.4.200 port 822 , 110 hosts in 3s
[2015/03/21 19:28:44] HOST ALERT: Thruk DESD;DOWN;SOFT;2;(Host check timed out after 30.04 seconds)
[2015/03/21 19:28:57] HOST ALERT: Eon SLV;UP;SOFT;2;TCP OK - 0,055 second response time on 10.120.35.20 port 822 , 29 hosts in 0s
[2015/03/21 19:29:48] HOST ALERT: Veon2 ADE;DOWN;SOFT;9;host number 0 <10. Strange.
[2015/03/21 19:30:12] HOST ALERT: Eon CMDG;DOWN;SOFT;2;(Host check did not exit properly)
[2015/03/21 19:30:15] HOST ALERT: Thruk DESD;DOWN;SOFT;3;(Host check did not exit properly)
[2015/03/21 19:31:49] HOST ALERT: Veon2 ADE;DOWN;HARD;10;host number 0 <10. Strange.
[2015/03/21 19:32:14] HOST ALERT: Eon CMDG;UP;SOFT;3;TCP OK - 0,024 second response time on 91.151.62.65 port 822 , 87 hosts in 1s
[2015/03/21 19:32:27] HOST ALERT: Thruk DESD;UP;SOFT;4;TCP OK - 0,025 second response time on 147.98.136.50 port 822 , 2729 hosts in 11s
[2015/03/21 19:33:43] HOST ALERT: Thruk DESD;DOWN;SOFT;1;Socat slow to get result (>20)
[2015/03/21 19:35:45] HOST ALERT: Thruk DESD;DOWN;SOFT;2;(Host check timed out after 30.01 seconds)
....
The Thruk logs shows :
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] checking auth
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] user not authenticated yet
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] user authenticated as: rsmsuper
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] _set_enabled_backends() by cookie
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 898fc -> enabled by local state check (Eon BAG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] adc60 -> enabled by local state check (Eon CMDG)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 4b7de -> enabled by local state check (Eon IUC)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 4b7de -> enabled by local state check (Eon IUC)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] ea363 -> enabled by local state check (Eon SLV)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] 1de35 -> enabled by local state check (Eon TOR)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] b9cf1 -> disabled by local state check (Thruk DESD)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] b9cf1 -> disabled by local state check (Thruk DESD)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Backend.Manager] b9cf1 -> disabled by local state check (Thruk DESD)
[2015/03/21 19:12:29][hypervision0][DEBUG][Thruk.Action.AddDefaults] disabled_backends: $VAR1 = {
'1c326' => 0,
'1de35' => 0,
'328c5' => 0,
'4b7de' => 0,
'898fc' => 0,
'adc60' => 0,
'b9938' => 0,
'b9cf1' => 1,
'd8487' => 0,
'ea363' => 0
};
The Apache logs gave :
[Sat
Mar 21 19:14:10 2015] [warn] [client 172.27.0.81] mod_fcgid: read data
timeout in 120 seconds, referer:
http://hypervision0/naemon/cgi-bin/status.cgi?newname=&svc_s0_servicestatustypes=20&svc_s0_val_pre=&svc_s0_val_pre=&hst_s0_hostprops=5515306&svc_s0_value_sel=5&svc_s0_value_sel=5&hst_s0_servicestatustypes=31&hidesearch=2&svc_s0_op=!~&svc_s0_op=!~&svc_s0_serviceprops=4466698&svc_s0_hoststatustypes=3&hst_s0_hoststatustypes=4&style=combined&svc_s0_hostprops=5507114&hidetop=1&hst_s0_serviceprops=0&svc_s0_value=DSI_&svc_s0_value=Service%20Check%20Timed%20Out%20On%20Worker%7CReturn%20code%20of%20255%20is%20out%20of%20bounds&title=Vue%20Supervision&title=Vue%20Supervision&svc_s0_type=host&svc_s0_type=plugin%20output[Sat
Mar 21 19:14:10 2015] [error] [client 172.27.0.81] Premature end of
script headers: fcgid_env.sh, referer:
http://hypervision0/naemon/cgi-bin/status.cgi?newname=&svc_s0_servicestatustypes=20&svc_s0_val_pre=&svc_s0_val_pre=&hst_s0_hostprops=5515306&svc_s0_value_sel=5&svc_s0_value_sel=5&hst_s0_servicestatustypes=31&hidesearch=2&svc_s0_op=!~&svc_s0_op=!~&svc_s0_serviceprops=4466698&svc_s0_hoststatustypes=3&hst_s0_hoststatustypes=4&style=combined&svc_s0_hostprops=5507114&hidetop=1&hst_s0_serviceprops=0&svc_s0_value=DSI_&svc_s0_value=Service%20Check%20Timed%20Out%20On%20Worker%7CReturn%20code%20of%20255%20is%20out%20of%20bounds&title=Vue%20Supervision&title=Vue%20Supervision&svc_s0_type=host&svc_s0_type=plugin%20output[Sat Mar 21 19:14:17 2015] [warn] mod_fcgid: process 22857 graceful kill fail, sending SIGKILL