If you manage a network of any size, you want to be notified of problems before your customers or your bosses find out, but you don't want to be tied to a console checking for the availability of hosts and services. This is where Nagios shines. If you put in the time it takes to install and customize Nagios for your environment, you'll be rewarded with a superb monitoring and notification solution that happens to be free. In this PET, I will guide you through the installation and configuration of Nagios, and I will provide examples of customizations you can add using plugins you can write yourself.
I will use Redhat Enterprise Linux AS 4.0 in these examples, but they can be adapted for any Linux distribution. The following are required packages for HTTPD services that will drive Nagios's web interface:
Apache
httpd
httpd-suexec
apr-util
Optional (for secure sockets layer, HTTPS interface)
mod_ssl
If you selected the default package set during installation, these are already installed. If you opted not to make Apache available during Redhat install, you can grab the packages from RHN using up2date or by manually downloading them.
The following are needed for Nagios basic functionality, really it's the Nagios framework we get. Nagios's checks are accomplished entirely through the use of plugins, which are available in a separate package. From here on out, I will suggest getting prebuilt packages from Dag Wieers's collection, and occasionally from CPAN. To make it easier on yourself, add Dag's repositories if you use YUM.
Nagios
nagios-2.2-1.el4.rf.i386.rpm http://dag.wieers.com/packages/nagios/
The following are needed for Nagios to actually perform checks
Nagios Plugins
nagios-plugins-1.4.1-1.2.el4.rf.i386.rpm http://dag.wieers.com/packages/nagios-plugins/
fping-2.4-1.b2.2.el4.rf.i386.rpm http://dag.wieers.com/packages/fping/
perl-Crypt-DES-2.03-3.2.el4.rf.i386.rpm http://dag.wieers.com/packages/perl-Crypt-DES/
perl-Net-SNMP-5.0.1-1.2.el4.rf.noarch.rpm http://dag.wieers.com/packages/perl-Net-SNMP/
perl-IO-Socket-INET6-2.51-1.2.el4.rf.noarch.rpm http://dag.wieers.com/packages/perl-IO-Socket-INET6/
Digest-HMAC-1.01.tar.gz http://search.cpan.org/~gaas/Digest-HMAC-1.01/lib/Digest/HMAC.pm
Digest-SHA1-2.11.tar.gz http://search.cpan.org/~gaas/Digest-SHA1-2.11/SHA1.pm
We can begin installation of the packages by first installing Nagios:
rpm -ivh nagios-2.2-1.el4.rf.i386.rpm
Now we begin satisfying nagios-plugins dependencies:
rpm -ivh fping-2.4-1.b2.2.el4.rf.i386.rpm
rpm -ivh perl-Crypt-DES-2.03-3.2.el4.rf.i386.rpm
mkdir /tmp/perltmp
cp *gz /tmp/perltmp
cd /tmp/perltmp
find . -name "*gz" -exec tar xvzf {} \;
cd Digest-SHA1-2.11
perl Makefile.pl
make test
make install
cd ../Digest-HMAC-1.01
perl Makefile.pl
make test
make install
cd ../Socket6-0.19
perl Makefile.pl
make test
make install
These next two Dag perl packages expect SHA1, HMAC and Socket6 to be available as rpms, but since they were not, we have to tell rpm not to check dependencies.
rpm -ivh --nodeps perl-Net-SNMP-5.0.1-1.2.el4.rf.noarch.rpm
rpm -ivh --nodeps perl-IO-Socket-INET6-2.51-1.2.el4.rf.noarch.rpm
rpm -ivh nagios-plugins-1.4.1-1.2.el4.rf.i386.rpm
Nagios has two methods for arranging its configuration files. One way relies on a single file where you specify hosts, groups, services etc. The other allows you to split these files up by purpose for ease of administration. The single file method can become unwieldy as you add machines and services to monitor. Here, we'll assume the multiple definition file method.
Let's become familiar with the file locations that the Dag provided packages use as defaults:
Main Nagios Configs
/etc/nagios
Plugins and CGIs
/usr/lib/nagios
Nagios Web Files
/usr/share/nagios
Here, we see the example config files in /etc/nagios:
[radar@test2 ~]$ ls -lh /etc/nagios
total 160K
-rw-rw-r-- 1 root root 30K Apr 8 08:28 bigger.cfg
-rw-rw-r-- 1 root root 9.4K Apr 8 08:28 cgi.cfg
-rw-rw-r-- 1 root root 4.8K Apr 8 08:28 checkcommands.cfg
-rw-r--r-- 1 root root 16K Aug 5 2005 command-plugins.cfg
-rw-rw-r-- 1 root root 14K Apr 8 08:28 minimal.cfg
-rw-rw-r-- 1 root root 4.2K Apr 8 08:28 misccommands.cfg
-rw-rw-r-- 1 root root 30K Apr 8 08:28 nagios.cfg
-rw-rw---- 1 root root 1.3K Apr 8 08:28 resource.cfg
The first file we're interested in is nagios.cfg, the main config file. This file specifies, among other things, the object config (definition) files. Those are what we are most interested in at this point. We want to open /etc/nagios/nagios.cfg in an editor and comment out the line that contains minimal.cfg. Then we'll uncomment the lines containing the object config files that we'll need to create, and populate with our definitions. Let's go ahead and do that, then.
# You can split other types of object definitions across several
# config files if you wish (as done here), or keep them all in a
# single config file.
#cfg_file=/etc/nagios/minimal.cfg
Here, I have commented out minimal.cfg
cfg_file=/etc/nagios/contactgroups.cfg
cfg_file=/etc/nagios/contacts.cfg
#cfg_file=/etc/nagios/dependencies.cfg
#cfg_file=/etc/nagios/escalations.cfg
cfg_file=/etc/nagios/hostgroups.cfg
cfg_file=/etc/nagios/hosts.cfg
cfg_file=/etc/nagios/services.cfg
cfg_file=/etc/nagios/timeperiods.cfg
And here I have uncommented the object config files we will work with first, to get basic functionality. We will now create these and populate them with some hosts, services, groups, etc.
While we're at it we want to enable service commands in the CGIs, and enable flap detection:
Still in nagios.cfg, change:
check_external_commands=0
check_external_commands=1
and change:
enable_flap_detection=0
enable_flap_detection=1
open minimal.cfg and copy the timeperiod definition and paste it into a new file called timeperiods.cfg and save it.
define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
Do the same for the contact definition and contact group definition. For hosts, copy the generic-host definition, along with the localhost definition and paste into hosts.cfg.
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
# Since this is a simple configuration file, we only monitor one host - the
# local host (this machine).
define host{
use generic-host ; Name of host template to use
host_name localhost
alias localhost
address 127.0.0.1
check_command check-host-alive
max_check_attempts 10
notification_interval 120
notification_period 24x7
notification_options d,r
contact_groups admins
}
define host{
use generic-host ; Name of host template to use
host_name testbox
alias Testbox
address
192.168.0.4
check_command check-host-alive
max_check_attempts 10
notification_interval 120
notification_period 24x7
notification_options d,r
contact_groups admins
}
I have added a networked host to check. Copy the hostgroup definition from minimal.cfg and paste into the new hostgroups.cfg.
define hostgroup{
hostgroup_name test
alias Test Servers
members localhost,testbox
}
I added our testbox to this group. We will need to copy the services definitions from minimal.cfg and paste them all into the new services.cfg file. Now we verify our work using nagios:
[radar@test2 nagios]$ sudo nagios -v /etc/nagios/nagios.cfg
Nagios 2.2
Copyright (c) 1999-2006 Ethan Galstad ( http://www.nagios.org)
Last Modified: 04-07-2006
License: GPL
Reading configuration data...
Running pre-flight check on configuration data...
Checking services...
Checked 5 services.
Checking hosts...
Warning: Host 'testbox' has no services associated with it!
Checked 2 hosts.
Checking host groups...
Checked 1 host groups.
Checking service groups...
Checked 0 service groups.
Checking contacts...
Checked 1 contacts.
Checking contact groups...
Checked 1 contact groups.
Checking service escalations...
Checked 0 service escalations.
Checking service dependencies...
Checked 0 service dependencies.
Checking host escalations...
Checked 0 host escalations.
Checking host dependencies...
Checked 0 host dependencies.
Checking commands...
Checked 22 commands.
Checking time periods...
Checked 1 time periods.
Checking extended host info definitions...
Checked 0 extended host info definitions.
Checking extended service info definitions...
Checked 0 extended service info definitions.
Checking for circular paths between hosts...
Checking for circular host and service dependencies...
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 1
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
If we had made a mistake, nagios would do its best to hint toward the problem. So all looks good for us to have a basic functioning setup. I will address the warning about no services set up for the testbox in a bit. We will now set up apache for authentication.
Look at /etc/httpd/conf.d/nagios.conf to see how authentication files are set:
AuthName "Nagios Access"
AuthType Basic
AuthUserFile /etc/nagios/htpasswd.users
Require valid-user
So we need to add nagiosadmin, who's defined as a contact, in htpasswd.users:
sudo /usr/bin/htpasswd -c /etc/nagios/htpasswd.users nagiosadmin
Make sure this file is readable by the apache user, if not already:
sudo chmod 644 /etc/nagios/htpasswd.users
Now edit cgi.cfg, uncommenting the lines containing allowed actions for the nagiosadmin user.
[radar@test2 ~]$ sudo /sbin/chkconfig --level 35 httpd on
[radar@test2 ~]$ sudo /sbin/chkconfig --level 35 nagios on
Unfortunately, before we proceed, we have to disable SELinux. There is no policy (that I know of) created to allow nagios functionality with SELinux enabled apache. If anyone knows the solution, please see contact info at the end of this PET, and discuss. The easiest way to disable SELinux, is to go to applications, system settings, security level and select the selinux tab. Uncheck "Enabled (Modification Requires Reboot". Then click ok and reboot.
When the machine is up, we can point the browser to https://machine/nagios. We'll see right away in the control panel that there's an issue with the total processes check. By looking at /etc/nagios/services.cfg for check_local_procs we see the check definition:
check_local_procs!250!400
So lets look at our checkcommands.cfg file to see how that's defined:
$USER1$/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$
Right away, we see there's a mismatch. The default service definition supplies only 2 arguments (delimited by the '!'), yet the command definition is looking for 3. Lets see what that -s is for:
cd /usr/lib/nagios/plugins
./check_procs -h | less
The help tells us that the -s is optional:
Optional Filters:
-s, --state=STATUSFLAGS
So we'll remove that from the command definition for now:
define command{
command_name check_local_procs
command_line $USER1$/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$
}
define command{
command_name check_local_procs
command_line $USER1$/check_procs -w $ARG1$ -c $ARG2$
}
We've removed the optional ps status flag.
Restart nagios:
[radar@test2 plugins]$ sudo /sbin/service nagios restart
Running configuration check...done
Stopping network monitor: nagios
Waiting for nagios to exit . done.
Starting network monitor: nagios
Now all is green! We have basic Nagios functionality and can start adding our customizations.
Remember that when we verified nagios's configuration, we got a warning about our testbox host not having any services associated with it. What this means is that, besides the obvious, nagios will not do any host alive checks against it. Nagios tries to spread out the checks in an efficient manner and will normally only check a host's alive state when a service is failing. Once we establish a service for testbox. It will count the host as alive if the service associated with it succeeds. You can set up a service just to ping the box, but we'll set up a custom command using one of the provided plugins.
I have started apache on our testbox, and will use the check_http plugin to define a command, and then from that, define a service to run against testbox. We can test the plugin directly so we know what to expect:
/usr/lib/nagios/plugins/check_http -h
Gives us the usage
[radar@test2 www]$ /usr/lib/nagios/plugins/check_http -H testbox -u /error/noindex.html
HTTP OK HTTP/1.1 200 OK - 4177 bytes in 0.007 seconds |time=0.006624s;;;0.000000 size=4177B;;;0
Gives us the default new install page. We can use that to set up a service to test whether apache is up on testbox. Create a new config file in /etc/nagios called custom_cmds.cfg and place the following in it:
define command{
command_name check_apache
command_line $USER1$/check_http -H $ARG1$ -S -u $ARG2$
}
Now open services.cfg in an editor and define a service to use this command definition:
define service{
use generic-service ; Name of service template to use
host_name testbox
service_description Check Apache
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_apache!testbox!/error/noindex.html
We have to tell nagios that this new command file exists by adding the path to the file:
cfg_file=/etc/nagios/custom_cmds.cfg
I added that under the existing command definition. Now we can use this file to add custom command definitions. We need to verify that we did'nt make any mistakes:
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
Good. We can restart nagios:
sudo /sbin/service nagios restart
We see that the new service is there, but it's pending. We can force it by rescheduling the next check and accepting the default time, which is immediate. We now can see that the service is working.
Pretty easy, but we may also want to write our own plugin and make a service check from that. Let's emulate the functionality of the check_http plugin, for illustration purposes, using available tools and wrap it up in a bash script.
To use this example, curl needs to be installed. It is by default on RHEL.
Nagios expects plugins to return a code telling what the status of the check is. The following details what the codes are:
0 = OK
1 = WARNING
2 = CRITICAL
3 = UNKNOWN
The warning and critical exit codes are ideal for setting thresholds, such as CPU usage and load averages. But since our service is either on or off, we can use critical, ok, and unknown (for bad parameters passed).
This script takes arguments and passes them to the curl command. We'll use it to get similar functionality as the check_http plugin.
#!/bin/bash
#
# testweb.sh
#
#
BADCALL="Wrong combination of parameters $@"
printuse ()
{
cat <<End-of-usage
Usage: ./testweb.sh -h [hostname] [-H|S]
./testweb.sh -h [hostname] [-H|S] -p [port]
Example: ./testweb.sh -h www.redhat.com -S
./testweb.sh -h 192.168.0.10 -p 7778
End-of-usage
}
# Rudimentary check for proper number and combination of parameters
if [ "$#" -lt 3 ] || [ "$#" -gt 5 ] || [ "$#" -eq 4 ] || [ "$1" != "-h" ] || \
[ ! `echo "$3" | grep [S,H]` ]
then
echo "$BADCALL"
printuse
exit 3
elif [ "$#" -eq 5 ] && [ "$4" != "-p" ] || [ `echo "$5" | grep [^0-9]` ]
then
echo $BADCALL
printuse
exit 3
fi
# Set the URL prefix based on parameter 3
if [ "$3" == "-S" ]
then
PRE=https://
else
PRE=http://
fi
# Build URL
HOST="$2"
if [ "$#" -eq 5 ]
then
PORT=":$5"
URL="$PRE$HOST$PORT"
else
URL="$PRE$HOST"
fi
curl -k -s -I -w "%{size_header} bytes in %{time_total} seconds\n\n" $URL >/tmp/$HOST.header.txt
case "$?" in
"7")
MSG=`cat /tmp/$HOST.header.txt`
echo "CRITICAL - Failed to connect => $MSG"
exit 2
;;
"0")
STAT=`grep seconds /tmp/$HOST.header.txt`
SRV=`grep Server /tmp/$HOST.header.txt | awk '{print $2}'`
echo "OK - $SRV => $STAT"
rm -f /tmp/$HOST.header.txt
exit 0
;;
esac
And we save this in /usr/lib/nagios/plugins as testweb.sh and make it executable:
chmod 755 /usr/lib/nagios/plugins/testweb.sh
Let's see how to use the plugin:
[radar@test2 nagios]$ /usr/lib/nagios/plugins/testweb.sh -h testbox -S
OK - Apache/2.0.52 => 199 bytes in 0.354 seconds
[radar@test2 nagios]$ /usr/lib/nagios/plugins/testweb.sh -h testbox -H
OK - Apache/2.0.52 => 199 bytes in 0.008 seconds
SSL seems considerably slower, as can be expected.
We can use this now to define a new service. Let's edit /etc/nagios/custom_cmds.cfg and add a command.
define command{
command_name check_apache_also
command_line $USER1$/testweb.sh -h $ARG1$ -S
}
Now we edit services.cfg and define the service:
define service{
use generic-service ; Name of service template to use
host_name testbox
service_description Check Apache Also
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_apache_also!testbox
}
And we verify our changes with nagios:
[radar@test2 nagios]$ sudo nagios -v /etc/nagios/nagios.cfg
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems were detected during the pre-flight check
Now restart nagios:
[radar@test2 nagios]$ sudo /sbin/service nagios restart
The service will show pending, so force its schedule as before. And we see it works!
It took a little configuration, but it's quite easy to have a functioning Nagios install, with reliable checks. There is quite a bit more to nagios, all of which you'll want to get working. Things like service groups, notifications, dependencies and escalations will further refine the way Nagios works for you. Nagios is well documented - you can view the help files right from within a working install, or go over to Nagios's project site.