Thruk for large environments

615 views
Skip to first unread message

nuriel....@clustervision.com

unread,
Jul 30, 2013, 3:17:44 AM7/30/13
to th...@googlegroups.com
Hi Sven,

First I would like to complement Thruk, it is a great initiative and a great product, and it is using Catalyst!

I am running Thruk on five remote instances and one central instance which lets me view and control the remote instances.
I am using Nagios with mk-live status broker.

Some of the remote instances are monitoring up to 3500 services altogether with a total of 500 hosts.
I am using a VPN to have all the instances "see" each other (I realize this might add some overhead, but not significant).
The sites are all located in Europe.


I have a few questions, I hope you have the time to have a look.


1. I had a quick attempt to deploy Thruk using Plackup and Starman server, with the hopes it would do two things:
 a. increase speed      
 b. allow me to load (at least) some Perl libraries on the command line when loading Starman (this should help speed up things a little).

Would it be possible to help me deploy Thruk as a standalone server and not fcgi via Apache as it is done now?


2. I am wondering about the need to use the live broker. I have not gone deep into the code of Thruk (yet), but I wonder about the following:
  a. extracting data from status.dat seems to be very fast
  b. writing commands into nagios.cmd seem to be as fast as they can get
Why then, is there use of live broker?
For example, (this has been tested): Read data from status.dat and return it as JSON to the remote-requesting end.
To read out status.dat (which in my largest site is 7MB in size) takes about 0.40 milliseconds.
Sending this over HTTP would also be very quick.

Why then, does it take Thruk a significant amount of time to communicate with remote back ends? Is it because it needs to load all its Thruk/Perl libraries and run a "1001" checks before handling a request? (I am just asking)

3. Could the interface be more of a ajax type? I know this would require more JS time - as one would need to parse the table and update data for each ajax call that returns, but this should not be complicated to do.

4. If any of the above makes sense, I would also be willing to help code (I have some basic proficiency with Catalyst / Perl / jQuery ).

5. Could this be a future plan: CatalystX::SimpleLogin ?

Thanks again for a great product!

Nuriel Shem-Tov

Sven Nierlein

unread,
Jul 30, 2013, 4:13:20 AM7/30/13
to th...@googlegroups.com
Hi Nuriel,

thanks :-) See my answers inline:

On 7/30/13 9:17, nuriel....@clustervision.com wrote:
> 1. I had a quick attempt to deploy Thruk using Plackup and Starman server, with the hopes it would do two things:
> a. increase speed
> b. allow me to load (at least) some Perl libraries on the command line when loading Starman (this should help speed up things a little).
>
> Would it be possible to help me deploy Thruk as a standalone server and not fcgi via Apache as it is done now?

Catalyst uses Plack already, so this should be possible somehow. What do you expect from that? I don't think its faster, once
the thruk process is running, it shouldn't make any difference. There is also a 'thruk' command line utility already. If you use the
packages, it will be installed in /usr/bin/thruk or scripts/thruk for source installations.



> 2. I am wondering about the need to use the live broker. I have not gone deep into the code of Thruk (yet), but I wonder about the following:
> a. extracting data from status.dat seems to be very fast
> b. writing commands into nagios.cmd seem to be as fast as they can get
> Why then, is there use of live broker?
> For example, (this has been tested): Read data from status.dat and return it as JSON to the remote-requesting end.
> To read out status.dat (which in my largest site is 7MB in size) takes about 0.40 milliseconds.
> Sending this over HTTP would also be very quick.
>
> Why then, does it take Thruk a significant amount of time to communicate with remote back ends? Is it because it needs to load all its Thruk/Perl libraries and run a "1001" checks before handling a request? (I am just asking)

Livestatus is definitly much faster than parsing the status.dat all the time. Thats one reason why Thruk is faster than the original CGIs when it comes to larger installation. Next thing is, status.dat is always behind realtime because its written every x seconds, livestatus is a broker and has access to the real live data. This
also makes it possible to see which host/service actually runs its check. You would have to parse the hole status.dat for every request whereas livestatus directly reads the data for the requested hosts and services (which reduces network traffic a lot).
In addition, Livestatus offers a sane way to access logfiles, status data and the command file from remote with one interface. And you would have to implement all the
filtering and statistics logic.



> 3. Could the interface be more of a ajax type? I know this would require more JS time - as one would need to parse the table and update data for each ajax call that returns, but this should not be complicated to do.
Thruk uses ajax where it makes sense, for example the search field. What do you expect from using more ajax?


> 5. Could this be a future plan: CatalystX::SimpleLogin ?
Again, what would you expect from that? Apache does a great job in authentication and offers a wide range of authentication modules. Doing that in perl you
would have to implement basic auth, mysql, ldap(s), kerberos, etc... and there are even companys which have written their own apache sso authentication modules.
Besides that, Thruk supports basic auth popup or a form based login which still makes use of all available apache auth modules, i guess this covers almost
any usecase.

Sven

nuriel....@clustervision.com

unread,
Jul 30, 2013, 5:47:02 AM7/30/13
to th...@googlegroups.com
Hi Sven,

Thanks for the prompt reply.

I am replying to your email and also added a few questions I have, I hope you don't mind.



- About running Thruk as a standalone server:

What I would expect is speed gain (don't we all :-))
The question about running Thruk as a standalone comes because it was to my experience that Catalyst applications run faster on plackup/starman especially when pre-loading some heavy libraries ( -MSome::Class etc... ).
Another benefit is that one could run Thruk without Apache. This scenario could happen on the "central" Thruk instance (where I don't even have Nagios installed, so I don't really need any web-server. Alternatively could use nginx if really need a lighter webserver, ssl termination etc.).
A disadvantage of a standalone would be that Thruk needs to handle its own authentication - large coding overhead, but on the other hand could add some nice RBAC and ACL functionality.




- Status.dat vs live broker:

I have no experience with live status as a developer. But I can understand what you say, if it also provides the flexibility, methods and all the helpful stuff for a developer, I would probably also benefit from having a look at it myself...(as long as I have more time and it can offer the same, or better performance comparing to the current method I am using to parse status.dat).

I honestly have no idea how long it would take mklive status to parse this data, and I am not sure if it would consist of the same data, but it seems to me that status.dat has all the live status data. (apologies if I am making a wrong assumption, I still haven't had a profound look at mklive).

I am posting the results to serve as an example - meaning: I have no idea how long it would take mklive status.
For a 7.0M file (500 hosts, 3500 services), I get the following results (hardware specified in the bottom of this email) :

ls -lh /usr/share/nagios/var/spool/status.dat
-rw-r--r-- 1 nagios nagios 7.0M Jul 30 10:02 /usr/share/nagios/var/spool/status.dat
----------------------------------------------------------------------------------------------------------------------
Read Nagios file: 1.12017 wallclock secs ( 0.98 usr +  0.11 sys =  1.09 CPU)
Process Nagios file:: 0.906096 wallclock secs ( 0.89 usr +  0.01 sys =  0.90 CPU)
Evaluate status data: 0.015445 wallclock secs ( 0.01 usr +  0.00 sys =  0.01 CPU)
Evaluate status data: 0.189962 wallclock secs ( 0.18 usr +  0.01 sys =  0.19 CPU)
-----------------------------------------------------------------------------------------------------------------------

I parse the data into a Perl hash, end result looks somewhat like this:

panasas02 => {
    services => {
      Check_SSH => {
        active_checks_enabled => 1,
        check_command => "check_ssh!22",
        check_execution_time => "0.021",
        check_interval => "5.000000",
        current_attempt => 1,
        current_state => 0,
        is_flapping => 0,
        last_check => "1375175023",
        last_update => "1375175295",
        max_attempts => 3,
        percent_state_change => "0.00",
        plugin_output => "SSH OK - OpenSSH_5.1p1 FreeBSD-20080901 (protocol 2.0)",
        problem_has_been_acknowledged => 0,
        service_description => "Check SSH"
      }
    },
    status => {
      active_checks_enabled => 1,
      check_command => "check-host-alive",
      check_execution_time => "0.010",
      check_interval => "2.000000",
      current_attempt => 1,
      current_state => 0,
      is_flapping => 0,
      last_check => "1375175244",
      last_update => "1375175295",
      max_attempts => 3,
      percent_state_change => "0.00",
      plugin_output => "OK - 10.133.246.244: rta 0.270ms, lost 0%",
      problem_has_been_acknowledged => 0,
      service_description => "HOST"
    }
  },

The reason I am spending a significant amount of time pondering about performance is because I am experiencing slow performance and I am trying to battle it.
The hardware used on most backends are Dell 720 servers with double Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz and 32GB memory, the storage is probably one of the fastest in the market (panasas - multiple shelves), so I don't think the hardware is a bottleneck concern (and system load is generally low).

I am trying to think what will happen when I will add five more sites. The current trend of performance is degrading significantly for each remote site added. At the end of the year I will probably have 15 sites altogether, some will be very large installations.





- Logcache MySQL - local, remote, or both? Timeout values might be non-realistic...:

Yesterday I took time to run the thruk command-line and use MySQL for the logcache.
Each site took about 60-70 minutes to run this process (about a month worth of data per site).
This was done locally on the remote sites. Today I wanted to let the "central" Thruk do the same.
I am wondering if there will be a benefit by having the "central" do this for its remote back-ends? Because, the timeout value specified in lib/Thruk/Backend/Provider/HTTP.pm states 100 seconds, which is out of range - if the command-line process takes 60-70 minutes to run locally there is no chance of logcaching a remote site?).
I did set the value to 4200 seconds, but the command-line utility seems to kill itself after approx. 5 minutes.

Help on this issue is most appreciated and welcome, unless you think it is sufficient that I've updated the cache locally on the remote instances?




- Nagios latency? From time to time - commands being ignored.

It happens, from time to time, especially on the larger remote sites, that I add a comment, or execute a command, I get command successful from Thruk but I never see it being processed by Nagios on the remote site. (external command interval=15s).
Is this a Nagios latency related issue? Is Nagios ignoring external commands? (or the commands are done via the live broker?)
I don't understand what might be causing this problem. This does not happen if I issue the command from a command-line utility I made for Nagios which directly to the nagios.cmd).



- Call backs for remote command execution?

Would a callback be an interesting option for running remote commands?
For example: sending a command or set of commands to a remote backend, wait to get a 200 OK code (command received, format okay, auth okay etc etc)... The call ends here - the call does not wait for the backend to process the command.
As an alternative, a callback from the remote site to the "initiator" will take place, which will contain: confirmation that the command or set of commands have been processed successfully by Nagios (not hard to determine) or report any command-execution errors.
When the callback returns this data, there can be an additional timer set on the javascript to run ajax calls to check for call backs return and the like, and then display the "Command executed successfully".
At the moment, it seems the interface is waiting for quite a long while and then returns "command successful", and this is not done async, but blocking.
Call back could battle this, or running such command in async not occupying the rest of the interface. Just an idea (this is what I normally do if I built a user control interface - let the commands run async in the "background" (or some ajax spinner) and let the interface remain accessible. (Just an idea).



I hope you don't mind the suggestions, and I hope you can help me out with the performance issues I am experiencing.

I appreciate your time, 

Thanks in advance,
Nuriel Shem-Tov

Sven Nierlein

unread,
Jul 30, 2013, 7:34:20 AM7/30/13
to th...@googlegroups.com
On 7/30/13 11:47, nuriel....@clustervision.com wrote:
> - About running Thruk as a standalone server:
>
> What I would expect is speed gain (don't we all :-))
> The question about running Thruk as a standalone comes because it was to my experience that Catalyst applications run faster on plackup/starman especially when pre-loading some heavy libraries ( -MSome::Class etc... ).

Sounds like starman could reduce startup time, but thats usually just the first request after an apache restart and there is an thruk init script so your users won't notice that. But i would be happy if you share your results.


> Another benefit is that one could run Thruk without Apache. This scenario could happen on the "central" Thruk instance (where I don't even have Nagios installed, so I don't really need any web-server. Alternatively could use nginx if really need a lighter webserver, ssl termination etc.).
> A disadvantage of a standalone would be that Thruk needs to handle its own authentication - large coding overhead, but on the other hand could add some nice RBAC and ACL functionality.

There are guides to get Thruk running with lighthttp, so it should be possible with nginx as well. Standalone installation can be done with the development server but thats not recommended. This is definitly slower than an fcgid setup with apache because in such a setup apache will serve the static content.
Look at it the other way round, you don't need apache for Nagios at all, but you safe a lot when using a webserver with Thruk.


> - Status.dat vs live broker:
>
> I have no experience with live status as a developer. But I can understand what you say, if it also provides the flexibility, methods and all the helpful stuff for a developer, I would probably also benefit from having a look at it myself...(as long as I have more time and it can offer the same, or better performance comparing to the current method I am using to parse status.dat).
>
> I honestly have no idea how long it would take mklive status to parse this data, and I am not sure if it would consist of the same data, but it seems to me that status.dat has all the live status data. (apologies if I am making a wrong assumption, I still haven't had a profound look at mklive).

livestatus does not parse the data at all, thats why it is that fast. It runs in the same memory like nagios itself and therefor directly accesses runtime data from the nagios memory. It's like turning the nagios process in a database. But Thruk supports different Backend Types, there are some already besides Livestatus. It would be cool to have a Backend Type status.dat which make it possible to use Thruk without Livestatus too. Although i think this only makes sense in smaller setups.


> I am trying to think what will happen when I will add five more sites. The current trend of performance is degrading significantly for each remote site added. At the end of the year I will probably have 15 sites altogether, some will be very large installations.

Sounds like you are not using the connection pool which access all backends in parallel. Listing multiple backends should be as fast as the slowest backend. That way
we have customers with more than 60 backends connected to one Thruk instance. Try to find out where the performance is lost. You can enable the debug mode by exporting THRUK_DEBUG=1.



> - Logcache MySQL - local, remote, or both? Timeout values might be non-realistic...:
>
> Yesterday I took time to run the thruk command-line and use MySQL for the logcache.
> Each site took about 60-70 minutes to run this process (about a month worth of data per site).
> This was done locally on the remote sites. Today I wanted to let the "central" Thruk do the same.
> I am wondering if there will be a benefit by having the "central" do this for its remote back-ends? Because, the timeout value specified in lib/Thruk/Backend/Provider/HTTP.pm states 100 seconds, which is out of range - if the command-line process takes 60-70 minutes to run locally there is no chance of logcaching a remote site?).
> I did set the value to 4200 seconds, but the command-line utility seems to kill itself after approx. 5 minutes.
>
> Help on this issue is most appreciated and welcome, unless you think it is sufficient that I've updated the cache locally on the remote instances?

In larger setups we use the logcache for every thruk instance, especcially for remote instances. It makes sense whereever you access logfiles through Thruk, ex. reports.
That sounds strange, the import is usually split into 24h packages to avoid that. When importing from the command line, you could make that time frame smaller.
ex.: thruk -a logcacheimport=3600


> - Nagios latency? From time to time - commands being ignored.
>
> It happens, from time to time, especially on the larger remote sites, that I add a comment, or execute a command, I get command successful from Thruk but I never see it being processed by Nagios on the remote site. (external command interval=15s).
> Is this a Nagios latency related issue? Is Nagios ignoring external commands? (or the commands are done via the live broker?)
> I don't understand what might be causing this problem. This does not happen if I issue the command from a command-line utility I made for Nagios which directly to the nagios.cmd).

It is generally a good idea to leave the external command check interval at a small value, 1s or -1. Commands are submited via livestatus but nagios will read them during the external command check. Thruk waits up to 10seconds to see if the command is processed but it cannot guarantee what nagios does with that command.



> - Call backs for remote command execution?
>
> Would a callback be an interesting option for running remote commands?
> For example: sending a command or set of commands to a remote backend, wait to get a 200 OK code (command received, format okay, auth okay etc etc)... The call ends here - the call does not wait for the backend to process the command.
> As an alternative, a callback from the remote site to the "initiator" will take place, which will contain: confirmation that the command or set of commands have been processed successfully by Nagios (not hard to determine) or report any command-execution errors.
> When the callback returns this data, there can be an additional timer set on the javascript to run ajax calls to check for call backs return and the like, and then display the "Command executed successfully".
> At the moment, it seems the interface is waiting for quite a long while and then returns "command successful", and this is not done async, but blocking.
> Call back could battle this, or running such command in async not occupying the rest of the interface. Just an idea (this is what I normally do if I built a user control interface - let the commands run async in the "background" (or some ajax spinner) and let the interface remain accessible. (Just an idea).

As mentioned above, this shouldn't be a problem if you lower the external check command interval. But you can display the "waiting feature" in your thruk_local.conf or lower the time thruk waits.

Sven

nuriel....@clustervision.com

unread,
Jul 30, 2013, 8:21:25 AM7/30/13
to th...@googlegroups.com
Hi Sven,

I would be able to share results if I get it working with Starman.
I had an attempt about a week ago, but didn't have much time to troubleshoot so I gave up on it.
I will search documentation about running Thruk with lighthttp and even try it with nginx, would be nice being able to adapt.
If I do manage to get it running with Starman I will share the results.
In any case, you certainly have a point about the static content being served much faster by a web-server.
This is also true about applications I've made with Catalyst, I let them run static content via the web-server (when present).

I did miss out the connection pool. I enabled it now for the number of backends I am using.
I will report if I notice improvements in performance.
Will also try the debug flag to get more information.

Regarding the logcacheupdate -> it doesn't timeout when I run it locally (even large installations which took up to 70 minutes to process 30 days).
I did notice it updates data in 24h packages, but with the large installations, when running the tool from remote, it didn't even start doing anything besides saying "running update for site XXXX...". And then it dies ("Killed").
It worked well though for smaller remote installations (up to 50 hosts with 300 service checks).
I will try playing around with the time frame as you suggested.

Regarding command execution time, I now set the check external command value to 1 second, and I will report if this clears the problem where Nagios ignores (not delays) some commands sent to it.


Thanks again for your prompt reply and insights,
Nuriel

nuriel....@clustervision.com

unread,
Jul 31, 2013, 8:31:29 AM7/31/13
to th...@googlegroups.com
Hi Sven,

Things seem to be working much better now:
- I get better speed now when executing remote commands.
- I played around with the logcacheimport=[val] and it keeps the large installations from timing out (does take longer to perform a large import, but at least it doesn't break anymore)
- Had a look and tested mklive-status: great stuff! Very fast, concise and to the point.
- Still haven't had time to deploy with starman 

Thanks again!

Nuriel.

Reply all
Reply to author
Forward
0 new messages