Hi Sven,
Thanks for the prompt reply.
I am replying to your email and also added a few questions I have, I hope you don't mind.
- About running Thruk as a standalone server:
What I would expect is speed gain (don't we all :-))
The question about running Thruk as a standalone comes because it was to my experience that Catalyst applications run faster on plackup/starman especially when pre-loading some heavy libraries ( -MSome::Class etc... ).
Another benefit is that one could run Thruk without Apache. This scenario could happen on the "central" Thruk instance (where I don't even have Nagios installed, so I don't really need any web-server. Alternatively could use nginx if really need a lighter webserver, ssl termination etc.).
A disadvantage of a standalone would be that Thruk needs to handle its own authentication - large coding overhead, but on the other hand could add some nice RBAC and ACL functionality.
- Status.dat vs live broker:
I have no experience with live status as a developer. But I can understand what you say, if it also provides the flexibility, methods and all the helpful stuff for a developer, I would probably also benefit from having a look at it myself...(as long as I have more time and it can offer the same, or better performance comparing to the current method I am using to parse status.dat).
I honestly have no idea how long it would take mklive status to parse this data, and I am not sure if it would consist of the same data, but it seems to me that status.dat has all the live status data. (apologies if I am making a wrong assumption, I still haven't had a profound look at mklive).
I am posting the results to serve as an example - meaning: I have no idea how long it would take mklive status.
For a 7.0M file (500 hosts, 3500 services), I get the following results (hardware specified in the bottom of this email) :
ls -lh /usr/share/nagios/var/spool/status.dat
-rw-r--r-- 1 nagios nagios 7.0M Jul 30 10:02 /usr/share/nagios/var/spool/status.dat
----------------------------------------------------------------------------------------------------------------------
Read Nagios file: 1.12017 wallclock secs ( 0.98 usr + 0.11 sys = 1.09 CPU)
Process Nagios file:: 0.906096 wallclock secs ( 0.89 usr + 0.01 sys = 0.90 CPU)
Evaluate status data: 0.015445 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU)
Evaluate status data: 0.189962 wallclock secs ( 0.18 usr + 0.01 sys = 0.19 CPU)
-----------------------------------------------------------------------------------------------------------------------
I parse the data into a Perl hash, end result looks somewhat like this:
panasas02 => {
services => {
Check_SSH => {
active_checks_enabled => 1,
check_command => "check_ssh!22",
check_execution_time => "0.021",
check_interval => "5.000000",
current_attempt => 1,
current_state => 0,
is_flapping => 0,
last_check => "1375175023",
last_update => "1375175295",
max_attempts => 3,
percent_state_change => "0.00",
plugin_output => "SSH OK - OpenSSH_5.1p1 FreeBSD-20080901 (protocol 2.0)",
problem_has_been_acknowledged => 0,
service_description => "Check SSH"
}
},
status => {
active_checks_enabled => 1,
check_command => "check-host-alive",
check_execution_time => "0.010",
check_interval => "2.000000",
current_attempt => 1,
current_state => 0,
is_flapping => 0,
last_check => "1375175244",
last_update => "1375175295",
max_attempts => 3,
percent_state_change => "0.00",
problem_has_been_acknowledged => 0,
service_description => "HOST"
}
},
The reason I am spending a significant amount of time pondering about performance is because I am experiencing slow performance and I am trying to battle it.
The hardware used on most backends are Dell 720 servers with double Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz and 32GB memory, the storage is probably one of the fastest in the market (panasas - multiple shelves), so I don't think the hardware is a bottleneck concern (and system load is generally low).
I am trying to think what will happen when I will add five more sites. The current trend of performance is degrading significantly for each remote site added. At the end of the year I will probably have 15 sites altogether, some will be very large installations.
- Logcache MySQL - local, remote, or both? Timeout values might be non-realistic...:
Yesterday I took time to run the thruk command-line and use MySQL for the logcache.
Each site took about 60-70 minutes to run this process (about a month worth of data per site).
This was done locally on the remote sites. Today I wanted to let the "central" Thruk do the same.
I am wondering if there will be a benefit by having the "central" do this for its remote back-ends? Because, the timeout value specified in lib/Thruk/Backend/Provider/HTTP.pm states 100 seconds, which is out of range - if the command-line process takes 60-70 minutes to run locally there is no chance of logcaching a remote site?).
I did set the value to 4200 seconds, but the command-line utility seems to kill itself after approx. 5 minutes.
Help on this issue is most appreciated and welcome, unless you think it is sufficient that I've updated the cache locally on the remote instances?
- Nagios latency? From time to time - commands being ignored.
It happens, from time to time, especially on the larger remote sites, that I add a comment, or execute a command, I get command successful from Thruk but I never see it being processed by Nagios on the remote site. (external command interval=15s).
Is this a Nagios latency related issue? Is Nagios ignoring external commands? (or the commands are done via the live broker?)
I don't understand what might be causing this problem. This does not happen if I issue the command from a command-line utility I made for Nagios which directly to the nagios.cmd).
- Call backs for remote command execution?
Would a callback be an interesting option for running remote commands?
For example: sending a command or set of commands to a remote backend, wait to get a 200 OK code (command received, format okay, auth okay etc etc)... The call ends here - the call does not wait for the backend to process the command.
As an alternative, a callback from the remote site to the "initiator" will take place, which will contain: confirmation that the command or set of commands have been processed successfully by Nagios (not hard to determine) or report any command-execution errors.
When the callback returns this data, there can be an additional timer set on the javascript to run ajax calls to check for call backs return and the like, and then display the "Command executed successfully".
At the moment, it seems the interface is waiting for quite a long while and then returns "command successful", and this is not done async, but blocking.
Call back could battle this, or running such command in async not occupying the rest of the interface. Just an idea (this is what I normally do if I built a user control interface - let the commands run async in the "background" (or some ajax spinner) and let the interface remain accessible. (Just an idea).
I hope you don't mind the suggestions, and I hope you can help me out with the performance issues I am experiencing.
I appreciate your time,
Thanks in advance,
Nuriel Shem-Tov