Chapter 3. Monitoring with Nagios

Contents

3.1. Features of Nagios
3.2. Installing Nagios
3.3. Nagios Configuration Files
3.4. Configuring Nagios
3.5. Troubleshooting
3.6. For More Information

Nagios is a stable, scalable and extensible enterprise-class network and system monitoring tool which allows administrators to monitor network and host resources such as HTTP, SMTP, POP3, disk usage and processor load. Originally Nagios was designed to run under Linux, but it can also be used on several UNIX operating systems. This chapter covers the installation and parts of the configuration of Nagios (http://www.nagios.org/).

3.1. Features of Nagios

The most important features of Nagios are:

  • Monitoring of network services (SMTP, POP3, HTTP, NNTP, etc.).

  • Monitoring of host resources (processor load, disk usage, etc.).

  • Simple plug-in design that allows administrators to develop further service checks.

  • Support for redundant Nagios servers.

3.2. Installing Nagios

Install Nagios either with zypper or using YaST.

For further information on how to install packages see:

  • Section “Using Zypper” (Chapter 8, Managing Software with Command Line Tools, ↑Reference)

  • Section “Installing and Removing Packages or Patterns” (Chapter 4, Installing or Removing Software, ↑Reference)

Both methods install the packages nagios and nagios-www. The later RPM package contains a Web interface for Nagios which allows, for example, to view the service status and the problem history. However, this is not absolutely necessary.

Nagios is modular designed and, thus, uses external check plug-ins to verify whether a service is available or not. It is recommended to install the nagios-plugin RPM package that contains ready-made check plug-ins. However, it is also possible to write your own, custom check plug-ins.

3.3. Nagios Configuration Files

Nagios organizes the configuration files as follows:

/etc/nagios/nagios.cfg

Main configuration file of Nagios containing a number of directives which define how Nagios operates. See http://nagios.sourceforge.net/docs/3_0/configmain.html for a complete documentation.

/etc/nagios/resource.cfg

Containing path to all Nagios plug-ins (default: /usr/lib/nagios/plugins).

/etc/nagios/command.cfg

Defining the programs to be used to determine the availability of services or the commands which are used to send e-mail notifications.

/etc/nagios/cgi.cfg

Contains options regarding the Nagios Web interface.

/etc/nagios/objects/

A directory containing object definition files. See Section 3.3.1, “Object Definition Files” for a more complete documentation.

3.3.1. Object Definition Files

In addition to those configuration files Nagios comes with very flexible and highly customizable configuration files called Object Definition configuration files. Those configuration files are very important since they define the following objects:

  • Hosts

  • Services

  • Contacts

The flexibility lies in the fact that objects are easily enhanceable. Imagine you are responsible for a host with only one service running. However, you want to install another service on the same host machine and you want to monitor that service as well. It is possible to add another service object and assign it to the host object without huge efforts.

Right after the installation, Nagios offers default templates for object definition configuration files. They can be found at /etc/nagios/objects. In the following see a description on how hosts, services and contacts are added:

Example 3.1. A Host Object Definition

define host {
 name                   SRV1             
 host_name              SRV1
 address                192.168.0.1
 use                    generic-host     
 check_period           24x7            
 check_interval         5           
 retry_interval         1              
 max_check_attempts     10            
 notification_period    workhours     
 notification_interval  120
 notification_options   d,u,r
}

The host_name option defines a name to identify the host that has to be monitored. address is the IP address of this host. The use statement tells Nagios to inherit other configuration values from the generic-host template. check_period defines whether the machine has to be monitored 24x7. check_interval makes Nagios checking the service every 5 minutes and retry_interval tells Nagios to schedule host check retries at 1 minute intervals. Nagios tries to execute the checks multiple times when they do not pass. You can define how many attempts Nagios should do with the max_check_attempts directive. All configuration flags beginning with notification handle how Nagios should behave when a failure of a monitored service occurs. In the host definition above, Nagios notifies the administrators only on working hours. However, this can be adjusted with notification_period. According to notification_interval notifications will be resend every two hours. notification_options contains four different flags: d, u, r and n. They control in which state Nagios should notify the administrator. d stands for a down state, u for unreachable and r for recoveries. n does not send any notifications anymore.

Example 3.2. A Service Object Definition

define service {
 use                    generic-service
 host_name              SRV1
 service_description    PING
 contact_groups         router-admins
 check_command          check_ping!100.0,20%!500.0,60%
} 

The first configuration directive use tells Nagios to inherit from the generic-service template. host_name is the name that assigns the service to the host object. The host itself is defined in the host object definition. A description can be set with service_description. In the example above the description is just PING. Within the contact_groups option it is possible to refer to a group of people who will be contacted on a failure of the service. This group and its members are later defined in a contact group object definition. check_command sets the program that checks whether the service is available, or not.

Example 3.3. A Contact and Contactgroup Definition

define contact {
 contact_name           admins                
 use                    generic-contact        
 alias                  Nagios Admin
 email                  nagios@localhost        
}
                
define contactgroup {
 contactgroup_name      router-admins
 alias                  Administrators
 members                admins
}

The example listing above shows the direct contact definition and its proper contactgroup. The contact definition contains the e-mail address and the name of the person who is contacted on a failure of a service. Usually this is the responsible administrator. use inherits configuration values from the generic-contact definition.

An overview of all Nagios objects and further information about them can be found at: http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html.

3.4. Configuring Nagios

Learn step-by-step how to configure Nagios to monitor different things like remote services or remote host-resources.

3.4.1. Monitoring Remote Services with Nagios

This section explains how to monitor remote services with Nagios. Proceed as follows to monitor a remote service:

Procedure 3.1. Monitoring a Remote HTTP Service with Nagios

  1. Create a directory inside /etc/nagios/objects using mkdir. You can use any desired name for it.

  2. Open /etc/nagios/nagios.conf and set cfg_dir (configuration directory) to the directory you have created in the first step.

  3. Change to the configuration directory created in the first step and create the following files: hosts.cfg, services.cfg and contacts.cfg

  4. Insert a host object in hosts.cfg:

    define host {
     name                   host.name.com
     host_name              host.name.com
     address                192.168.0.1
     use                    generic-host
     check_period           24x7
     check_interval         5
     retry_interval         1
     max_check_attempts     10
     contact_groups         admins
     notification_interval  60
     notification_options   d,u,r
    }                        
                            
  5. Insert a service object in services.cfg:

    define service {
     use                    generic-service
     host_name              host.name.com
     service_description    HTTP
     contact_groups         router-admins
     check_command          check_http
    }
                    
  6. Insert a contact and contactgroup object in contacts.cfg:

    define contact {
     contact_name           max-mustermann
     use                    generic-contact
     alias                  Webserver Administrator
     email                  mmustermann@localhost
    }
    
    define contactgroup {
     contactgroup_name      admins
     alias                  Administrators
     members                max-mustermann
    }
                        
  7. Execute rcnagios restart to (re)start Nagios.

  8. Execute cat /var/log/nagios/nagios.log and verify whether the following content appears:

     [1242115343] Nagios 3.0.6 starting... (PID=10915)
     [1242115343] Local time is Tue May 12 10:02:23 CEST 2009
     [1242115343] LOG VERSION: 2.0
     [1242115343] Finished daemonizing... (New PID=10916)

If you need to monitor a different remote service, it is possible to adjust check_command in step Step 5. A full list of all available check programs can be obtained by executing ls /usr/lib/nagios/plugins/check_*

See Section 3.5, “Troubleshooting” if an error occurred.

3.4.2. Monitoring Remote Host-Resources with Nagios

This section explains how to monitor remote host resources with Nagios.

Proceed as follows on the Nagios server:

Procedure 3.2. Monitoring a Remote Host Resource with Nagios (Server)

  1. Install nagios-nsca (for example, zypper in nagios-nsca).

  2. Set the following options in /etc/nagios/nagios.cfg:

    check_external_commands=1
     accept_passive_service_checks=1
     accept_passive_host_checks=1
     command_file=/var/spool/nagios/nagios.cmd
  3. Set the command_file option in /etc/nagios/nsca.conf to the same file defined in /etc/nagios/nagios.conf.

  4. Add another host and service object:

    define host {
     name                            foobar
     host_name                       foobar
     address                         10.10.4.234
     use                             generic-host
     check_period                    24x7
     check_interval                  0
     retry_interval                  1
     max_check_attempts              1
     active_checks_enabled           0
     passive_checks_enabled          1
     contact_groups                  router-admins
     notification_interval           60
     notification_options            d,u,r
    }
    define service {
     use                             generic-service
     host_name                       foobar
     service_description             diskcheck
     active_checks_enabled           0
     passive_checks_enabled          1
     contact_groups                  router-admins
     check_command                   check_ping
    }
  5. Execute rcnagios restart and rcnsca restart.

Proceed as follows on the client you want to monitor:

Procedure 3.3. Monitoring a Remote Host Resource with Nagios (client)

  1. Install nagios-nsca-client on the host you want to monitor.

  2. Write your test scripts (for example a script that checks the disk usage) like this:

    #!/bin/bash
                            
     NAGIOS_SERVER=10.10.4.166
     THIS_HOST=foobar
                            
     #
     # Write own test algorithm here
     #
                         
     # Execute On SUCCESS:
     echo "$THIS_HOST;diskcheck;0;OK: test ok" \
              | send_nsca -H $NAGIOS_SERVER -p 5667 -c /etc/nagios/send_nsca.cfg -d ";"
                            
     # Execute On Warning:
     echo "$THIS_HOST;diskcheck;1;Warning: test warning"  \
              | send_nsca -H $NAGIOS_SERVER -p 5667 -c /etc/nagios/send_nsca.cfg -d ";"
                            
     # Execute On FAILURE:
     echo "$THIS_HOST;diskcheck;2;CRITICAL: test critical"  \
              | send_nsca -H $NAGIOS_SERVER -p 5667 -c /etc/nagios/send_nsca.cfg -d ";"
  3. Insert a new cron entry with crontab -e. A typical cron entry could look like this:

    */5 * * * * /directory/to/check/program/check_diskusage
                            
                        

3.5. Troubleshooting

Error: ABC 'XYZ' specified in ... '...' is not defined anywhere!

Make sure that you have defined all necessary objects correctly. Be careful with the spelling.

(Return code of 127 is out of bounds - plugin may be missing)

Make sure that you have installed nagios-plugins.

E-mail notification does not work

Make sure that you have installed and configured a mail server like postfix or exim correctly. You can verify if your mail server works with echo "Mail Server Test!" | mail foo@bar.com which sends an e-mail to foo@bar.com. If this e-mail arrives, your mail server is working correctly. Otherwise, check the log files of the mail server.

3.6. For More Information