Nagios Check Download Speed

0 views

Skip to first unread message

Lorette Hanscom

unread,

Jan 24, 2024, 6:00:38 PM1/24/24

to bitpitherwi

I'm trying to add my esxi servers (I have 3 of them) to my nagios monitoring.I'm using the perl SDK and script as documented in various places on the internet, the latest script is here: =nagios/op5plugins.git;a=blob_plain;f=check_esx3.pl;hb=HEAD

nagios check download speed

Download ✔✔✔ https://t.co/WGDBe2TMsH

I also notice login in to the server via vsphere is reasonably slow, also about 3-4 seconds before it starts loading. This could be unrelated.The esxi server isn't under huge load, though it has iscsi luns mounted and maybe about 6-7 active VMs. I've checked (and increased) the resource allocation for the host and also checked esxtop (with no findings) while running the check command.

This delay is an issue, because the perl processes from nagios run at 100% cpu while they're trying to connect, and it's happening on all 3 of my esxi servers. So as nagios issues more and more checks, the monitoring server CPU and load averages go through the roof as all of the processes are waiting for responses. This only serves to exasperate the delay issue and cause all of the checks to time out.

A friend, also running nagios, the same monitoring script and the same esxi update can run the same check and it completes in less than a second, whereas for me it takes up to 10 seconds (as you can see in the dproff output)

I have been actively diagnosing this since posting.I have found that my friends nagios server is an x64 machine, so I stood up a new ubuntu 10.4 x64 VM (on a different host mind you) After installing all the stuff needed for the esxi checks (lots of cpan modules were required) I can time the checks on that new install and a CPU check completes in around 2 seconds.

An agent that allows you to remotely execute Nagios plugins on Linux/Unix machines and monitor remote machine metrics (disk usage, CPU load, etc.). NRPE can also communicate with Windows agent add-ons like NSClient++, so you can check metrics on remote Windows machines.

NSCA allows you to integrate passive alerts and checks from remote machines and applications with Nagios. This agent is useful for processing security alerts as well as deploying redundant and distributed Nagios setups.

I have created a custom plugin for NRPE in Python 3. The script runs as expected when I run it from the command line as root and when I run it as nagios user. (I had a problem before when the nagios user didn't have permissions, but I fixed this is the sudoers file). I have an Arduino Uno taking input from a wind sensor and outputting the speed as type float to the serial monitor. The python script on my Raspberry Pi (os similar to debian 9) is reading the serial output and determining if the speed is in a specific range and then exiting with the respective code. Here's the python script:

For some reason it is always exiting with a '1'(warning) error code and never entering the if statement when run from the nagios server. When I run the program from the command line, it works perfect.

Although writing a Nagios plugin may vary from language to language in the specifics of it, the basic way in which it works is the same: Nagios will run a check on a host/service and then use the check's return code + its output as the result (OK/Warning/Critical).

Finally, the flag package allows you to specify command-line flags, which will allow you to create re-usable checks (e.g. run the same check on 2 machines, but have different thresholds for warning/critical alerts on the two of them).

Use aggregated status updates. Enabling aggregated status updates (with the aggregate_status_updates option) will greatly reduce the load on your monitoring host because it won't be constantly trying to update the status log. This is especially recommended if you are monitoring a large number of services. The main trade-off with using aggregated status updates is that changes in the states of hosts and services will not be reflected immediately in the status file. This may or may not be a big concern for you.
Use a ramdisk for holding status data. If you're using the standard status log and you're not using aggregated status updates, consider putting the directory where the status log is stored on a ramdisk. This will speed things up quite a bit (in both the core program and the CGIs) because it saves a lot of interrupts and disk thrashing.
Check service latencies to determine best value for maximum concurrent checks. Nagios can restrict the number of maximum concurrently executing service checks to the value you specify with the max_concurrent_checks option. This is good because it gives you some control over how much load Nagios will impose on your monitoring host, but it can also slow things down. If you are seeing high latency values (> 10 or 15 seconds) for the majority of your service checks (via the extinfo CGI), you are probably starving Nagios of the checks it needs. That's not Nagios's fault - its yours. Under ideal conditions, all service checks would have a latency of 0, meaning they were executed at the exact time that they were scheduled to be executed. However, it is normal for some checks to have small latency values. I would recommend taking the minimum number of maximum concurrent checks reported when running Nagios with the -s command line argument and doubling it. Keep increasing it until the average check latency for your services is fairly low. More information on service check scheduling can be found here.
Use passive checks when possible. The overhead needed to process the results of passive service checks is much lower than that of "normal" active checks, so make use of that piece of info if you're monitoring a slew of services. It should be noted that passive service checks are only really useful if you have some external application doing some type of monitoring or reporting, so if you're having Nagios do all the work, this won't help things.
Avoid using interpreted plugins. One thing that will significantly reduce the load on your monitoring host is the use of compiled (C/C++, etc.) plugins rather than interpreted script (Perl, etc) plugins. While Perl scripts and such are easy to write and work well, the fact that they are compiled/interpreted at every execution instance can significantly increase the load on your monitoring host if you have a lot of service checks. If you want to use Perl plugins, consider compiling them into true executables using perlcc(1) (a utility which is part of the standard Perl distribution) or compiling Nagios with an embedded Perl interpreter (see below).
Use the embedded Perl interpreter. If you're using a lot of Perl scripts for service checks, etc., you will probably find that compiling an embedded Perl interpreter into the Nagios binary will speed things up. In order to compile in the embedded Perl interpreter, you'll need to supply the --enable-embedded-perl option to the configure script before you compile Nagios. Also, if you use the --with-perlcache option, the compiled version of all Perl scripts processed by the embedded interpreter will be cached for later reuse.
Optimize host check commands. If you're checking host states using the check_ping plugin you'll find that host checks will be performed much faster if you break up the checks. Instead of specifying a max_attempts value of 1 in the host definition and having the check_ping plugin send 10 ICMP packets to the host, it would be much faster to set the max_attempts value to 10 and only send out 1 ICMP packet each time. This is due to the fact that Nagios can often determine the status of a host after executing the plugin once, so you want to make the first check as fast as possible. This method does have its pitfalls in some situations (i.e. hosts that are slow to respond may be assumed to be down), but I you'll see faster host checks if you use it. Another option would be to use a faster plugin (i.e. check_fping) as the host_check_command instead of check_ping.
Don't schedule regular host checks. Do NOT schedule regular checks of hosts unless absolutely necessary. There are not many reasons to do this, as host checks are performed on-demand as needed. To disable regular checks of a host, set the check_interval directive in the host definition to 0. If you do need to have regularly scheduled host checks, try to use a longer check interval and make sure your host checks are optimized (see above).
Don't use agressive host checking. Unless you're having problems with Nagios recognizing host recoveries, I would recommend not enabling the use_aggressive_host_checking option. With this option turned off host checks will execute much faster, resulting in speedier processing of service check results. However, host recoveries can be missed under certain circumstances when this it turned off. For example, if a host recovers and all of the services associated with that host stay in non-OK states (and don't "wobble" between different non-OK states), Nagios may miss the fact that the host has recovered. A few people may need to enable this option, but the majority don't and I would recommend not using it unless you find it necessary...
Increase external command check interval. If you're processing a lot of external commands (i.e. passive checks in a distributed setup, you'll probably want to set the command_check_interval variable to -1. This will cause Nagios to check for external commands as often as possible. This is important because most systems have small pipe buffer sizes (i.e. 4KB). If Nagios doesn't read the data from the pipe fast enough, applications that write to the external command file (i.e. the NSCA daemon) will block and wait until there is enough free space in the pipe to write their data.
Optimize hardware for maximum performance. Your system configuration and your hardware setup are going to directly affect how your operating system performs, so they'll affect how Nagios performs. The most common hardware optimization you can make is with your hard drives. CPU and memory speed are obviously factors that affect performance, but disk access is going to be your biggest bottlenck. Don't store plugins, the status log, etc on slow drives (i.e. old IDE drives or NFS mounts). If you've got them, use UltraSCSI drives or fast IDE drives. An important note for IDE/Linux users is that many Linux installations do not attempt to optimize disk access. If you don't change the disk access parameters (by using a utility like hdparam), you'll loose out on a lot of the speedy features of the new IDE drives.