Thruk is slown when livestatus Backend unreachable

Fabrice Le Dorze

unread,

Nov 30, 2014, 8:38:22 AM11/30/14

to th...@googlegroups.com

I'm using Thruk with 8 Livestatus Backends.
When all backends are reachable, it takes around 6 seconds to display all problems.
When one backend is unreachable, it takes more than 20 seconds
If I desactivate this backend from GUI, the display takes 6 seconds again.

I tried to use check_local_states=1 and state_host in backends section. to do that, I have :
- defined host checks for all backends in local backend, actually naemon.
- linked them to remote backends by using their host names as 'state_host' in for corresponding backend in Thruk config.

The unreachable backend has its host check to DOWN in naemon.
But it has no effect on display performances : it takes always more that 20 seconds to print all problems page, that 4 more than when all backends are reachable or unreachable backend has been disabled.

Any help ?
As we really have the project to use Thruk at an industrial scale to centralize host and services status of many Nagios-like satellites, display performances are really critical. The unreachability of one single backend cannot impact the performance of the wole system.

Thx

Sven Nierlein

unread,

Nov 30, 2014, 9:51:24 AM11/30/14

to th...@googlegroups.com

Hi,

could you increase the loglevel to get some details about the local states? It seems like they are not used at all.

Besides that, if you use naemon already, you could have a look at the "shadownaemon" tool which shadows a remote
core locally and drastically reduces page load times. I wrote (or still writing, since the project is not finished yet)
that for a customer where we put over 200 remote instances into a single thruk display. Response times dropped
from ~50seconds to less than 5, depending on the kind of page of course.
However, this tool is still kind of beta stage right now and depending on that projects requirements and budget, i cannot
predict anything and there is not much of documentation right now.
Basically all you need to do is to add this to your thruk_local.conf and adjust the paths:

shadow_naemon_dir=/tmp/remotecache
shadow_naemon_bin=/home/sven/git/naemon/naemon-core/naemon/shadownaemon
shadow_naemon_ls=/home/sven/git/naemon/naemon-livestatus/src/.libs/livestatus.so

It works best with latest git versions of Thruk and Naemon since its still work in progess.

Sven

Fabrice Le Dorze

unread,

Nov 30, 2014, 11:14:05 AM11/30/14

to th...@googlegroups.com, sv...@nierlein.de

Hi Sven
Good point.
Thanks to log inspection, I could figure that thruk was using a config file from an old Thruk install.
I will keep an eyes on your dev for huge number of remote instances.
When do you plan to put in stable release ? in Debian packages ?
THank you.

Sven Nierlein

unread,

Nov 30, 2014, 11:44:21 AM11/30/14

to th...@googlegroups.com

On 30/11/14 17:14, Fabrice Le Dorze wrote:
> When do you plan to put in stable release ? in Debian packages ?

There will be definitly a release coming this year.

Fabrice Le Dorze

unread,

Dec 13, 2014, 4:54:09 AM12/13/14

to th...@googlegroups.com

Hi Sven.
In waiting, do you advise to use the 'connection_pool_size' option to improve performance ?
Thx

Sven Nierlein

unread,

Dec 14, 2014, 9:27:41 AM12/14/14

to th...@googlegroups.com

On 13/12/14 10:54, Fabrice Le Dorze wrote:
> Hi Sven.
> In waiting, do you advise to use the 'connection_pool_size' option to improve performance ?
> Thx

The connection pool is automatically enabled since a couple of versions. Altough it only makes
Thruk connect to the backends in parallel, it will be still as slow as the slowest backend.

Sven

Fabrice Le Dorze

unread,

Jan 24, 2015, 3:49:55 PM1/24/15

to th...@googlegroups.com

Hi Sven.
I have just upgraded to naemon 0.9.1 and gave a test to shadownaemon.
But the first two shadownaemon start fine and the other fails :
[2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon3 GOP for peer b9938 (172.27.0.93:822) crashed, restarting...
[2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Eon SLV for peer ea363 (10.120.35.20:822) crashed, restarting...
[2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon1 RMS for peer d8487 (172.27.0.91:822) crashed, restarting...
[2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Eon TOR for peer 1de35 (172.16.0.101:822) crashed, restarting...
[2015/01/24 21:47:21][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon2 ADE for peer 1c326 (172.27.0.92:822) crashed, restarting...
[2015/01/24 21:47:21][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Eon IUC for peer 4b7de (10.10.11.238:822) crashed, restarting...

What is wrong ?

Fabrice Le Dorze

unread,

Mar 1, 2015, 6:31:25 AM3/1/15

to th...@googlegroups.com

I just upgraded to naemon 1.0.0.
Shadownaemon gives same result as below.
I don't see what I did wrong :
use_shadow_naemon=1
shadow_naemon_dir=/var/cache/naemon/
shadow_naemon_bin=/usr/bin/shadownaemon
shadow_naemon_ls=/usr/lib/naemon/naemon-livestatus/livestatus.so

Sven Nierlein

unread,

Mar 1, 2015, 7:01:07 AM3/1/15

to th...@googlegroups.com

Hi,

Atm there is not much documentation about shadownaemon since this is still an experimental feature.
Is the shadow_naemon_dir created? Does it contain folders for your backends? Do they contain logs
with errors?
You could also enabled debug logging which should print the complete command line for starting
the shadownaemon. Then try to start it manually to see if it works.

Sven

On 01/03/15 12:31, Fabrice Le Dorze wrote:
> I just upgraded to naemon 1.0.0.
> Shadownaemon gives same result as below.
> I don't see what I did wrong :
> use_shadow_naemon=1
> shadow_naemon_dir=/var/cache/naemon/
> shadow_naemon_bin=/usr/bin/shadownaemon
> shadow_naemon_ls=/usr/lib/naemon/naemon-livestatus/livestatus.so
>
>
>
> Le samedi 24 janvier 2015 21:49:55 UTC+1, Fabrice Le Dorze a écrit :
>
> Hi Sven.
> I have just upgraded to naemon 0.9.1 and gave a test to shadownaemon.
> But the first two shadownaemon start fine and the other fails :

> [2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon3 GOP for peer b9938 (172.27.0.93:822 <http://172.27.0.93:822>) crashed, restarting...
> [2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Eon SLV for peer ea363 (10.120.35.20:822 <http://10.120.35.20:822>) crashed, restarting...
> [2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon1 RMS for peer d8487 (172.27.0.91:822 <http://172.27.0.91:822>) crashed, restarting...
> [2015/01/24 21:47:20][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Eon TOR for peer 1de35 (172.16.0.101:822 <http://172.16.0.101:822>) crashed, restarting...
> [2015/01/24 21:47:21][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon2 ADE for peer 1c326 (172.27.0.92:822 <http://172.27.0.92:822>) crashed, restarting...
> [2015/01/24 21:47:21][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Eon IUC for peer 4b7de (10.10.11.238:822 <http://10.10.11.238:822>) crashed, restarting...

>
> What is wrong ?
>
>
> Le dimanche 14 décembre 2014 15:27:41 UTC+1, Sven Nierlein a écrit :
>
> On 13/12/14 10:54, Fabrice Le Dorze wrote:
> > Hi Sven.
> > In waiting, do you advise to use the 'connection_pool_size' option to improve performance ?
> > Thx
>
> The connection pool is automatically enabled since a couple of versions. Altough it only makes
> Thruk connect to the backends in parallel, it will be still as slow as the slowest backend.
>
> Sven
>

> --
> You received this message because you are subscribed to the Google Groups "Thruk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to thruk+un...@googlegroups.com <mailto:thruk+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Fabrice Le Dorze

unread,

Mar 1, 2015, 8:45:23 AM3/1/15

to th...@googlegroups.com, sv...@nierlein.de

Hi again
No problem, I'm trying it on our Lab server. I'm very interested in this fonctionnality as we plan to have more and more backends, some of them with a lot of hosts/services.
So I will do any test I can do to help debugging.

I suspected a memory shortage.
I have 9 backends plus the local Naemon. But only 2 shadownaemon start.
The memory use of one shadownaemon is around 60MB. It remains around 600MB free for the 7 remaining backend, that is 420MB. Am I wrong ?

I just cleanup the shadow_naemon_dir from all backend ID dirs and restarted.
All of them are properly created.

The 2 working ones have a shadowdaemon.log like that :
[1425215585] livestatus: Naemon Livestatus 1.0.0-naemon Socket: '/var/cache/naemon//adc60/live'
[1425215585] livestatus: Finished initialization. Further log messages go to /var/cache/naemon/adc60/tmp/livestatus.log
[1425215585] Event broker module '/usr/lib/naemon/naemon-livestatus/livestatus.so' initialized successfully.
[1425215586] TIMEPERIOD TRANSITION: 24x7;-1;0
[1425215586] started caching 91.151.62.65:822 to /var/cache/naemon//adc60/live

The non working ones have a shadowdaemon.log like that :
[1425215585] query failed: 400
query:
---
GET services
---
[1425215585] Error: Could not find a service matching host name 'ADE_01UTA02' and description 'WAN' (config file '/var/cache/naemon/1c326/tmp/objects.cfg', starting on line 473)
[1425215585] Error: Could not expand members specified in servicegroup 'ADE_WAN' (config file '/var/cache/naemon/1c326/tmp/objects.cfg', starting at line 473)
[1425215591] query failed: 400
query:
---

even if the configuration files in /var/cache/naemon/tmp seem to be OK.

The debug Thruk log shows the regular start retries of failed shadow naemons :

[2015/03/01 14:21:59][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon1 RMS for peer d8487 (172.27.0.91:822) crashed, restarting...
[2015/03/01 14:21:59][s-hypervision0][DEBUG][Thruk.Utils.Livecache] /usr/bin/shadownaemon -d -i 172.27.0.91:822 -o /var/cache/naemon//d8487 -l /usr/lib/naemon/naemon-livestatus/livestatus.so >> /var/cache/naemon//d8487/tmp/shadownaemon.log 2>&1

I tried it manually :
/usr/bin/shadownaemon -v -d -i 172.27.0.91:822 -o /var/cache/naemon//d8487 -l /usr/lib/naemon/naemon-livestatus/livestatus.so

which gives no output and no new shadownaemon process is created.

Sven Nierlein

unread,

Mar 1, 2015, 9:44:27 AM3/1/15

to th...@googlegroups.com

Yes, each shadownaemon process requires some memory to run. 50-100mb per process is a good value to calculate with.

The "crash" looks like the autogenerated config is wrong somehow. Each shadownaemon pulls in all hosts and services
from the remote site via livestatus and writes out a dummy naemon object config. It then trys to start with that
objects. Seems like this fails in some cases here.
Could you investigate and verify if those hosts and services or servicegroup mentioned in the error message are
created correctly?

On 01/03/15 14:45, Fabrice Le Dorze wrote:
>
> Hi again
> No problem, I'm trying it on our Lab server. I'm very interested in this fonctionnality as we plan to have more and more backends, some of them with a lot of hosts/services.
> So I will do any test I can do to help debugging.
>
> I suspected a memory shortage.
> I have 9 backends plus the local Naemon. But only 2 shadownaemon start.
> The memory use of one shadownaemon is around 60MB. It remains around 600MB free for the 7 remaining backend, that is 420MB. Am I wrong ?
>
> I just cleanup the shadow_naemon_dir from all backend ID dirs and restarted.
> All of them are properly created.
>
> The 2 working ones have a shadowdaemon.log like that :

> / //[1425215585] livestatus: Naemon Livestatus 1.0.0-naemon Socket: '/var/cache/naemon//adc60/live'//
> //[1425215585] livestatus: Finished initialization. Further log messages go to /var/cache/naemon/adc60/tmp/livestatus.log//
> //[1425215585] Event broker module '/usr/lib/naemon/naemon-livestatus/livestatus.so' initialized successfully.//
> //[1425215586] TIMEPERIOD TRANSITION: 24x7;-1;0//
> //[1425215586] started caching 91.151.62.65:822 to /var/cache/naemon//adc60/live/

>
> The non working ones have a shadowdaemon.log like that :

> /[1425215585] query failed: 400//
> //query://
> //---//
> //GET services//
> //---//
> //[1425215585] Error: Could not find a service matching host name 'ADE_01UTA02' and description 'WAN' (config file '/var/cache/naemon/1c326/tmp/objects.cfg', starting on line 473)//
> //[1425215585] Error: Could not expand members specified in servicegroup 'ADE_WAN' (config file '/var/cache/naemon/1c326/tmp/objects.cfg', starting at line 473)//
> //[1425215591] query failed: 400//
> //query://
> //---/

>
> even if the configuration files in /var/cache/naemon/tmp seem to be OK.
>
> The debug Thruk log shows the regular start retries of failed shadow naemons :

> /
> //[2015/03/01 14:21:59][s-hypervision0][ERROR][Thruk.Utils.Livecache] shadownaemon Veon1 RMS for peer d8487 (172.27.0.91:822) crashed, restarting...//
> //[2015/03/01 14:21:59][s-hypervision0][DEBUG][Thruk.Utils.Livecache] /usr/bin/shadownaemon -d -i 172.27.0.91:822 -o /var/cache/naemon//d8487 -l /usr/lib/naemon/naemon-livestatus/livestatus.so >> /var/cache/naemon//d8487/tmp/shadownaemon.log 2>&1/
>
> I tried it manually :
> / /usr/bin/shadownaemon -v -d -i 172.27.0.91:822 -o /var/cache/naemon//d8487 -l /usr/lib/naemon/naemon-livestatus/livestatus.so/

Fabrice Le Dorze Free

unread,

Mar 2, 2015, 9:44:19 AM3/2/15

to sv...@nierlein.de, th...@googlegroups.com

Hi again
I'm debugging temporary config files with naemon -v /var/cache/naemon/<backend_id>/tmp/naemon.cfg
It reveals something wrong with servicegroups.
Example : on original backend (an old EyesofNetwork with Lilac configurator), I have :

define servicegroup {
        servicegroup_name       RMS_MYSQL
        alias   RMS / All Mysql services
}

and the membership is configured in service :
define service {
        host_name       RMS_veon2
        service_description     MYSQL_REPLICATION
        initial_state   o
        is_volatile     0
        max_check_attempts      3
        normal_check_interval   10
        retry_interval 4
        active_checks_enabled   1
        passive_checks_enabled 0
        check_period    _always
        parallelize_check       1
        obsess_over_service     0
        check_freshness 0
        event_handler_enabled   0
        flap_detection_enabled 0
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
        notification_interval   12
        notification_period     RMS_notification-24x7
        notifications_enabled   0
        failure_prediction_enabled      0
        action_url      http://veon1.inf.rms.loc/pnp4nagios/index.php/graph?host=$HOSTNAME$&srv=$SERVICEDESC$' class='tips' rel='/pnp4nagios/index.php/popup?host=$HOSTNAME$&srv=$SERVICEDESC$
        display_name    MYSQL_REPLICATION
        check_command   GENERIC_DIRECT_MYSQL-REP!!-s 172.27.4.92 -u replication -p ReP4Mi5qL@rMs!-w 500 -c 1000
        notes_url       https://wiki.rms.loc/MysqlDepannage
        notification_options    w,c,r
        flap_detection_options o,w,u,c
        stalking_options        o,w,c
        contact_groups notifiers
        servicegroups   RMS,RMS_MYSQL
}

But shadownaemon side, I have :
define servicegroup {
    servicegroup_name     RMS_MYSQL
    alias                 RMS / All Mysql services
   members RMS_veon2,MYSQL_REPLICATION
}

Which of course generate errors :
Error: Could not find a service matching host name 'RMS_veon2' and description 'MYSQL_REPLICATION' (config file '/var/cache/naemon/d8487/tmp/objects.cfg', starting on line 709)
Error: Could not expand members specified in servicegroup 'RMS_MYSQL' (config file '/var/cache/naemon/d8487/tmp/objects.cfg', starting at line 709)

Sven Nierlein

unread,

Apr 26, 2015, 9:13:36 AM4/26/15

to fabrice...@free.fr, th...@googlegroups.com

On 02/03/15 15:44, Fabrice Le Dorze Free wrote:
> Hi again
> I'm debugging temporary config files with naemon -v /var/cache/naemon/<backend_id>/tmp/naemon.cfg
> It reveals something wrong with servicegroups.
> Example : on original backend (an old EyesofNetwork with Lilac configurator), I have :
>

> /define servicegroup {//
> // servicegroup_name RMS_MYSQL//
> // alias RMS / All Mysql services//
> //}/

>
> and the membership is configured in service :

> /define service {//
> // host_name RMS_veon2//
> // service_description MYSQL_REPLICATION//
> ...
> // servicegroups RMS,RMS_MYSQL//
> //}/

>
>
> But shadownaemon side, I have :

> /define servicegroup {//
> // servicegroup_name RMS_MYSQL//
> // alias RMS / All Mysql services//
> // members RMS_veon2,MYSQL_REPLICATION//
> //}//
> /

>
> Which of course generate errors :

> /Error: Could not find a service matching host name 'RMS_veon2' and description 'MYSQL_REPLICATION' (config file '/var/cache/naemon/d8487/tmp/objects.cfg', starting on line 709)//
> //Error: Could not expand members specified in servicegroup 'RMS_MYSQL' (config file '/var/cache/naemon/d8487/tmp/objects.cfg', starting at line 709)/

Hi,

i cannot see any obvious mistakes? Whats wrong with that configuration?

Cheers,
Sven

Fabrice Le Dorze

unread,

Apr 26, 2015, 3:58:13 PM4/26/15

to Sven Nierlein, th...@googlegroups.com

Hi Sven
You are right.
In deed, for some backends, there are no services generated in shadow
naemon files , causing its shadow naemon crash and I believe I know why
: the livestatus version. These failing backends are old EyesOfNetwork
servers with livestatus 1.1.*. I just upgraded one ot then to livestatus
1.2.4 and it is working fine.

Do you confirm that shadownaemon at least requires 1.2.* version of
livestatus ?

I could test the responsiveness of Thruk with shadow naemons. around 5
seconds for every kinds of search, instead of 15s/20s without shadow
naemons. Well done.

Reply all

Reply to author

Forward