Nagios 简明教程

Nagios - Quick Guide

Nagios - Overview

DevOps 生命周期是几个阶段的连续循环,持续监控是该循环的最后一个阶段。持续监控是该生命周期中的一个阶段。在本章中,让我们详细了解持续监控是什么,以及 Nagios 如何对此目的有所帮助。

DevOps lifecycle is a continuous loop of several stages, continuous monitoring is the last stage of this loop. Continuous monitoring is one of the stages in this lifecycle. In this chapter, let us learn in detail about what continuous monitoring is and how Nagios is helpful for this purpose.

What is Continuous Monitoring

持续监控从在生产服务器上完成部署时开始。从那时起,此阶段负责监控所发生的一切。此阶段对于业务生产力至关重要。

Continuous monitoring starts when the deployment is done on the production servers. From then on, this stage is responsible to monitor everything happening. This stage is very crucial for the business productivity.

使用持续监控有以下好处:

There are several benefits of using Continuous monitoring −

  1. It detects all the server and network problems.

  2. It finds the root cause of the failure.

  3. It helps in reducing the maintenance cost.

  4. It helps in troubleshooting the performance issues.

  5. It helps in updating infrastructure before it gets outdated.

  6. It can fix problems automatically when detected.

  7. It makes sure the servers, services, applications, network is always up and running.

  8. It monitors complete infrastructure every second.

What is Nagios

Nagios 是一款开源的持续监控工具,可监测网络、应用程序和服务器。它可以找到并修复基础设施中检测到的问题,在影响最终用户之前阻止未来的问题。它提供 IT 基础设施及其性能的完整状态。

Nagios is an open source continuous monitoring tool which monitors network, applications and servers. It can find and repair problems detected in the infrastructure, and stop future issues before they affect the end users. It gives the complete status of your IT infrastructure and its performance.

Why Nagios

Nagios 提供以下功能,使其可供大量用户群体使用:

Nagios offers the following features making it usable by a large group of user community −

  1. It can monitor Database servers such as SQL Server, Oracle, Mysql, Postgres

  2. It gives application level information (Apache, Postfix, LDAP, Citrix etc.).

  3. Provides active development.

  4. Has excellent support form huge active community.

  5. Nagios runs on any operating system.

  6. It can ping to see if host is reachable.

Benefits of Nagios

Nagios 为用户提供以下好处:

Nagios offers the following benefits for the users −

  1. It helps in getting rid of periodic testing.

  2. It detects split-second failures when the wrist strap is still in the “intermittent” stage.

  3. It reduces maintenance cost without sacrificing performance.

  4. It provides timely notification to the management of control and breakdown.

Nagios - Architecture

本章详细讨论了 Nagios 架构。

This chapter talks in detail about Nagios architecture.

Nagios Architecture

关于 Nagios 架构,以下要点值得注意:

The following points are worth notable about Nagios architecture −

  1. Nagios has server-agent architecture.

  2. Nagios server is installed on the host and plugins are installed on the remote hosts/servers which are to be monitored.

  3. Nagios sends a signal through a process scheduler to run the plugins on the local/remote hosts/servers.

  4. Plugins collect the data (CPU usage, memory usage etc.) and sends it back to the scheduler.

  5. Then the process schedules send the notifications to the admin/s and updates Nagios GUI.

nagios architecture

下图详细展示了 Nagios 服务器代理体系结构−

The following figure shows Nagios Server Agent Architecture in detail −

server agent architecture

Nagios - Products

Nagios 包含以下几种产品,详细讨论如下 −

Nagios contains various products as discussed in detail below −

Nagios XI

它为包括应用程序、服务、网络、操作系统等在内的完整 IT 基础设施组件提供监视功能。它可以全面展示你的基础设施和业务流程。GUI 容易定制,为用户提供灵活性。此工具的标准版售价 1995 美元,企业版售价 3495 美元。

It provides monitoring for complete IT infrastructure components like applications, services, network, operating systems etc. It gives a complete view of your infrastructure and business processes. The GUI is easily customizable giving the used flexibility. The standard edition of this tool costs $1995 and enterprise edition costs $3495.

Nagios Core

它是 IT 基础设施监控的核心。Nagios XI 产品也主要基于 Nagios 内核。每当基础设施中出现故障问题时,它就会向管理员发送警报或通知,管理员可以快速采取行动解决该问题。此工具是完全免费的。

It is the core on monitoring IT infrastructure. Nagios XI product is also fundamentally based on Nagios core. Whenever there is any issue of failure in the infrastructure, it sends an alert/notification to the admin who can take the action quickly to resolve the issue. This tool is absolutely free.

Nagios Log Server

它使日志数据搜索变得非常简单。它将所有日志数据保存在一个位置,并具有极高的可用性设置。如果在日志数据中发现任何问题,它都能轻松地发送警报。它可以扩展到 1000 台服务,为你的日志分析平台提供更强大的能力、速度、存储和可靠性。此工具的价格取决于实例数 - 1 个实例 3995 美元,2 个实例 4995 美元,3 个实例 5995 美元,4 个实例 6995 美元,10 个实例 14995 美元。

It makes searching of log data very simple and easy. It keeps all the log data at one location with high availability setup. It can easily send alerts if any issue is found in the log data. It can scale to 1000s of severs giving more power, speed, storage, and reliability to your log analysis platform. The price of this tool depends on the number of instances - 1 Instance $3995, 2 Instances $4995, 3 Instances $5995, 4 Instances $6995, 10 Instances $14995.

Nagios Fusion

此产品提供了一个完整监视系统的集中视图。通过 Nagios Fusion,你可以为不同的地理区域设置单独的监视服务器。它可以轻松与 Nagios XI 和 Nagios 内核集成,让你全面了解基础设施。该工具售价 2495 美元。

This product provides a centralized view of complete monitoring system. With Nagios Fusion, you scan setup separate monitoring servers for separate geographies. It can be easily integrated with Nagios XI and Nagios core to give the complete visibility of the infrastructure. This tools costs $2495.

Nagios Network Analyser

它向管理员提供有关网络基础设施的完整信息,包括潜在网络威胁,以便管理员可以快速采取行动。在进行深入的网络分析后,它会分享关于网络的非常详细的数据。该工具售价 1995 美元。

It gives the complete information of the network infrastructure to the admin with the potential threats on the network so that admin can take quick actions. It shares very detailed data about the network after in-depth network analysis. This tools costs $1995.

Nagios - Installation

本章详细讨论了在 Ubuntu 系统上安装 Nagios 的步骤。

In this chapter, the steps to setup Nagios on Ubuntu are discussed in detail.

在安装 Nagios 之前,Ubuntu 系统上必须存在 Apache、PHP、构建包等一些包。因此,让我们先安装它们。

Before you install Nagios, some packages such as Apache, PHP, building packages etc., are required to be present on your Ubuntu system. Hence, let us install them first.

Step 1 − 运行以下命令安装预先需要的包−

Step 1 − Run the following command to install pre-required packages −

sudo apt-get install wget build-essential apache2 php apache2-mod-php7.0 php-gd
libgd-dev sendmail unzip

Step 2 − 接下来,为 Nagios 创建用户和组,并将它们添加到 Apache www-data 用户中。

Step 2 − Next, create user and group for Nagios and add them to Apache www-data user.

sudo useradd nagios
sudo groupadd nagcmd
sudo usermod -a -G nagcmd nagios
sudo usermod -a -G nagios,nagcmd www-data

Step 3 − 下载最新的 Nagios 包。

Step 3 − Download the latest Nagios package.

wget https://assets.nagios.com/downloads/nagioscore/releases/nagios-
4.4.3.tar.gz

Step 4 − 提取 tarball 文件。

Step 4 − Extract the tarball file.

tar -xzf nagios-4.4.3.tar.gz
cd nagios-4.4.3/

Step 5 − 运行以下命令从源代码编译 Nagios。

Step 5 − Run the following command to compile Nagios from source.

./configure --with-nagios-group=nagios --with-command-group=nagcmd

Step 6 − 运行以下命令构建 Nagios 文件。

Step 6 − Run the following command to build Nagios files.

make all

Step 7 − 运行下面显示的命令安装所有 Nagios 文件。

Step 7 − Run the command shown below to install all the Nagios files.

sudo make install

Step 8 − 运行以下命令安装 init 和外部命令配置文件。

Step 8 − Run the following commands to install init and external command configuration files.

sudo make install-commandmode
sudo make install-init
sudo make install-config
sudo /usr/bin/install -c -m 644 sample-config/httpd.conf /etc/apache2/sitesavailable/
nagios.conf

Step 9 − 现在将事件处理程序目录复制到 Nagios 目录。

Step 9 − Now copy the event handler directory to Nagios directory.

sudo cp -R contrib/eventhandlers/ /usr/local/nagios/libexec/
sudo chown -R nagios:nagios /usr/local/nagios/libexec/eventhandlers

Step 10 − 下载并解压 Nagios 插件。

Step 10 − Download and extract Nagios plugins.

cd
wget https://nagios-plugins.org/download/nagiosplugins-
2.2.1.tar.gz
tar -xzf nagios-plugins*.tar.gz
cd nagios-plugins-2.2.1/

Step 11 − 使用以下命令安装 Nagios 插件。

Step 11 − Install Nagios plugins using the below command.

./configure --with-nagios-user=nagios --with-nagios-group=nagios --with-openssl
make
sudo make install

Step 12 − 现在编辑 Nagios 配置文件并取消注释行号 51 → cfg_dir=/usr/local/nagios/etc/servers

Step 12 − Now edit the Nagios configuration file and uncomment line number 51 → cfg_dir=/usr/local/nagios/etc/servers

sudo gedit /usr/local/nagios/etc/nagios.cfg

Step 13 − 现在,创建一个服务器目录。

Step 13 − Now, create a server directory.

sudo mkdir -p /usr/local/nagios/etc/servers

Step 14 − 编辑联系人配置文件。

Step 14 − Edit contacts configuration file.

sudo gedit /usr/local/nagios/etc/objects/contacts.cfg
contacts configuration

Step 15 − 现在启用 Apache 模块并配置用户 nagiosadmin。

Step 15 − Now enable the Apache modules and configure a user nagiosadmin.

sudo a2enmod rewrite
sudo a2enmod cgi
sudo htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin
sudo ln -s /etc/apache2/sites-available/nagios.conf /etc/apache2/sites-enabled/

Step 16 − 现在,重新启动 Apache 和 Nagios。

Step 16 − Now, restart Apache and Nagios.

service apache2 restart
service nagios start
cd /etc/init.d/
sudo cp /etc/init.d/skeleton /etc/init.d/Nagios

Step 17 − 编辑 Nagios 文件。

Step 17 − Edit the Nagios file.

sudo gedit /etc/init.d/Nagios
DESC = "Nagios"
NAME = nagios
DAEMON = /usr/local/nagios/bin/$NAME
DAEMON_ARGS = "-d /usr/local/nagios/etc/nagios.cfg"
PIDFILE = /usr/local/nagios/var/$NAME.lock

Step 18 − 使 Nagios 文件可执行并启动 Nagios。

Step 18 − Make the Nagios file executable and start Nagios.

sudo chmod +x /etc/init.d/nagios
service apache2 restart
service nagios start

Step 19 − 现在转到你的浏览器并打开 URL → http://localhost/nagios 。现在使用用户名 nagiosadmin 登录 Nagios,并使用你之前设置的密码。Nagios 登录屏幕如下图所示 −

Step 19 − Now go to your browser and open url → http://localhost/nagios. Now login to Nagios with username nagiosadmin and use the password which you had set earlier. The login screen of Nagios is as shown in the screenshot given below −

contacts screenshot

如果你正确地按照所有步骤操作,你的 Nagios Web 界面将会显示。你可以找到如下图所示的 Nagios 仪表板 −

If you have followed all the steps correctly, you Nagios web interface will show up. You can find the Nagios dashboard as shown below −

nagios dashboard

Nagios - Configuration

在上一章中,我们已经了解了 Nagios 的安装。在本章中,让我们详细了解其配置。

In the previous chapter, we have seen the installation of Nagios. In this chapter, let us understand its configuration in detail.

Nagios 的配置文件位于 /usr/local/nagios/etc 中。这些文件如下图所示 −

The configuration files of Nagios are located in /usr/local/nagios/etc. These files are shown in the screenshot given below −

nagios configuration

让我们现在了解每个文件的用途 −

Let us understand the importance of each file now −

nagios.cfg

这是 Nagios 核心配置文件。此文件包含 Nagios 日志文件、主机和服务状态更新间隔、锁定文件和 status.dat 文件的位置。在此文件中定义了运行实例的 Nagios 用户和组。它包含所有单个对象配置文件的路径,如命令、联系人、模板等。

This is the main configuration file of Nagios core. This file contains the location of log file of Nagios, hosts and services state update interval, lock file and status.dat file. Nagios users and groups on which the instances are running are defined in this file. It has path of all the individual object config files like commands, contacts, templates etc.

cgi.cfg

默认情况下,Nagios 的 CGI 配置文件名为 cgi.cfg。它告诉 CGI 在哪里找到主配置文件。CGI 将读取主配置和主机配置文件来获取它们可能需要的任何其他数据。它包含所有用户和组信息以及它们的权限。它还包含 Nagios 所有前端文件的路径。

By default, the CGI configuration file of Nagios is named cgi.cfg. It tells the CGIs where to find the main configuration file. The CGIs will read the main and host config files for any other data they might need. It contains all the user and group information and their rights and permissions. It also has the path for all frontend files of Nagios.

resource.cfg

你可以在此文件中定义 $USERx$ 宏,然后可以在主机配置文件中的命令定义中使用它们。$USERx$ 宏对于存储敏感信息(如用户名、密码等)很有用。

You can define $USERx$ macros in this file, which can in turn be used in command definitions in your host config file(s). $USERx$ macros are useful for storing sensitive information such as usernames, passwords, etc.

它们在指定插件和事件处理程序路径时也很方便 - 如果你决定将来将插件或事件处理程序移动到其他目录,你只需更新一两个 $USERx$ 宏,而无需修改许多命令定义。资源文件还可以用来存储外部数据源(如 MySQL)的配置指令。

They are also handy for specifying the path to plugins and event handlers - if you decide to move the plugins or event handlers to a different directory in the future, you can just update one or two $USERx$ macros, instead of modifying a lot of command definitions. Resource files may also be used to store configuration directives for external data sources like MySQL.

resource
external data sources

objects 目录中的配置文件用于定义命令、联系人、主机、服务等。

The configuration files inside objects directory have are used to define commands, contacts, hosts, services etc.

commands.cfg

这个配置文件为您提供了一些示例命令定义,您可以在主机、服务和联系人定义中引用它们。这些命令用于检查和监视主机和服务。您可以在 Linux 控制台上本地运行这些命令,您还可以获取所运行命令的输出。

This config file provides you with some example command definitions that you can refer in host, service, and contact definitions. These commands are used to check and monitor hosts and services. You can run these commands locally on a Linux console where you will also get the output of the command you run.

Example

define command {
   command_name check_local_disk
   command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}

define command {
   command_name check_local_load
   command_line $USER1$/check_load -w $ARG1$ -c $ARG2$
}

define command {
   command_name check_local_procs
   command_line $USER1$/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$
}

contacts.cfg

此文件包含 Nagios 的联系人信息和组信息。默认情况下,已存在一个联系人 Nagios admin。

This file contains contacts and groups information of Nagios. By default, one contact is already present Nagios admin.

Example

define contact {
   contact_name nagiosadmin
   use generic-contact
   alias Nagios Admin
   email avi.dunken1991@gmail.com
}

define contactgroup {
   contactgroup_name admins
   alias Nagios Administrators
   members nagiosadmin
}

templates.cfg

此配置文件为您提供了一些示例对象定义模板,其他配置文件中的其他主机、服务、联系人等定义引用了这些模板。

This config file provides you with some example object definition templates that are referred by other host, service, contact, etc. definitions in other config files.

timeperiods.cfg

此配置文件为您提供了一些示例时间段定义,您可以在主机、服务、联系人以及依赖项定义中引用它们。

This config file provides you with some example timeperiod definitions that you can refer in host, service, contact, and dependency definitions.

Nagios - Features

Nagios 是一款具有以下众多功能的监视工具:

Nagios is the monitoring tool with multitude of features as given below −

  1. Nagios Core is open source, hence free to use.

  2. Powerful monitoring engine which can scale and manage 1000s of hosts and servers.

  3. Comprehensive web dashboard giving the visibility of complete network components and monitoring data.

  4. It has multi-tenant capabilities where multiple users have access to Nagios dashboard.

  5. It has extendable architecture which can easily integrate with third-party applications with multiple APIs.

  6. Nagios has a very active and big community with over 1 million + users across the globe.

  7. Fast alerting system, sends alerts to admins immediately after any issue is identified.

  8. Multiple plugins available to support Nagios, custom coded plugins can also be used with Nagios.

  9. It has good log and database system storing everything happening on the network with ease.

  10. Proactive Planning feature helps to know when it’s time to upgrade the infrastructure.

Nagios - Applications

Nagios 可应用于广泛的应用程序。它们在此处列出:

Nagios can be applicable to a wide range of applications. They are given here −

  1. Monitor host resources such as disk space, system logs etc.

  2. Monitor network resources – http, ftp, smtp, ssh etc.

  3. Monitor log files continuously to identify infra-issue.

  4. Monitor windows/linux/unix/web applications and its state.

  5. Nagios Remote Plugin Executer (NRPE) can monitor services remotely.

  6. Run service checks in parallel.

  7. SSH or SSL tunnels can also be used for remote monitoring.

  8. Send alerts/notifications

  9. via email, sms, pager of any issue on infrastructure

  10. Recommending when to upgrade the IT infrastructure.

Nagios - Hosts and Services

Nagios 是用于监控 IT 基础架构中运行的主机和服务的最流行的工具。主机和服务配置是 Nagios Core 的构建块。

Nagios is the most popular tool which is used to monitor hosts and services running in your IT infrastructure. Hosts and service configurations are the building blocks of Nagios Core.

  1. Host is just like a computer; it can be a physical device or virtual.

  2. Services are those which are used by Nagios to check something about a host.

你可以在 Nagios 的服务器目录内创建一个主机文件,并提及主机和服务定义。例如:

You can create a host file inside the server directory of Nagios and mention the host and service definitions. For example −

sudo gedit /usr/local/nagios/etc/servers/ubuntu_host.cfg

===

define host {
   use linux-server
   host_name ubuntu_host
   alias Ubuntu Host
   address 192.168.1.10
   register 1
}
define service {
   host_name ubuntu_host
   service_description PING
   check_command check_ping!100.0,20%!500.0,60%
   max_check_attempts 2
   check_interval 2
   retry_interval 2
   check_period 24x7
   check_freshness 1
   contact_groups admins
   notification_interval 2
   notification_period 24x7
   notifications_enabled 1
   register 1
}

上面的定义添加了一个名为 ubuntu_host 的主机并定义将在该主机上运行的服务。重新启动 Nagios 时,Nagios 将开始监视此主机,并运行指定的服务。

The above definitions add a host called ubuntu_host and defines the services which will run on this host. When you restart the Nagios, this host will start getting monitored by Nagios and the specified services will run.

Nagios 中还有更多服务,可用于监控正在运行的主机上的几乎所有内容。

There are many more services in Nagios which can be used to monitor pretty much anything on the running host.

Nagios - Commands

命令定义定义了一个命令。命令包括服务检查、服务通知、服务事件处理程序、主机检查、主机通知和主机事件处理程序。Nagios 的命令定义在 commands.cfg 文件中定义。

A command definition defines a command. Commands include service checks, service notifications, service event handlers, host checks, host notifications, and host event handlers. Command definitions for Nagios are defined in commands.cfg file.

以下是在命令定义格式 −

The following is the format for defining of a Command −

define command {
   command_name command_name
   command_line command_line
}

Command name − 此指令用于标识命令。联系方式、主机和服务的定义通过命令名称引用。

Command name − This directive is used to identify the command. The definitions of contact, host, and service is referenced by command name.

Command line − 此指令用于定义 Nagios 在将命令用于服务或主机检查、通知或事件处理程序时执行的操作。

Command line − This directive is used to define what is executed by Nagios when the command is used for service or host checks, notifications, or event handlers.

Example

define command{
   command_name check_ssh
   command_line /usr/lib/nagios/plugins/check_ssh ‘$HOSTADDRESS$’
}

此命令将执行插件 − /usr/libl/nagios/plugins/check_ssh 带有 1 个参数 : '$HOSTADDRESS$'

This command will execute the plugin − /usr/libl/nagios/plugins/check_ssh with 1 parameter : '$HOSTADDRESS$'

使用此检查命令的非常简短的主机定义可能类似于此处所示内容 −

A very short host definition that would use this check command could be similar to the one shown here −

define host{
   host_name host_tutorial
   address 10.0.0.1
   check_command check_ssh
}

命令定义告知如何执行主机/服务检查。如果发现任何问题,它还定义如何生成通知并处理任何事件。有几个命令来执行检查,例如检查 SSH 是否正常工作的命令、检查数据库是否启动并正在运行的命令、检查主机是否处于活动状态的命令等等。

The command definitions tell how to perform host/service checks. The also define how to generate notifications if any issue is identified and to handle any event. There are several commands to perform the checks, such as commands to check if SSH is working properly or not, command to check that database is up and running, command to check if a host is alive or not and many more.

有一些命令告诉用户基础设施中存在哪些问题。您可以在 Nagios 中创建自己的自定义命令或使用任何第三方命令,它们与 Nagios 插件项目类似,它们之间没有区别。

There are commands which tell users what issues are present in the infrastructure. You can create your own custom commands or use any third-party command in Nagios, and they are treated similar to Nagios plugins project, there is no distinction between them.

您还可以在命令中传递参数,这在执行检查方面提供了更大的灵活性。以下是如何定义带参数的命令 −

You can also pass arguments in the command, this give more flexibility in performing the checks. This is how you need to define a command with parameter −

define command {
   command_name check-host-alive-limits
   command_line $USER5$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
}

上述命令的主机定义 −

The host definition for the above command −

define host {
   host_name system2
   address 10.0.15.1
   check_command check-host-alive-limits!1000.0,70%!5000.0,100%
}

您可以通过将外部命令添加到 Nagios 周期性处理的命令文件中,在 Nagios 中运行外部命令。

You can run external commands in Nagios by adding them to commands file which is processed by Nagios daemon periodically.

通过外部命令,您可以在 Nagios 运行时完成很多检查。您可以暂时禁用一些检查,或强制一些检查立即运行,暂时禁用通知等。以下是必须写在命令文件中的 Nagios 中外部命令的语法 −

With External commands you can achieve lot many checks while Nagios is running. You can temporarily disable few checks, or force some checks to run immediately, disable notifications temporarily etc. The following is the syntax of external commands in Nagios that must be written in command file −

[time] command_id;command_arguments

您还可以在此处 https://assets.nagios.com/downloads/nagioscore/docs/externalcmds/ 查看可在 Nagios 中使用的所有外部命令的列表 −

You can also check out the list of all external commands that can be used in Nagios here − https://assets.nagios.com/downloads/nagioscore/docs/externalcmds/

Nagios - Checks and States

主机和服务在 Nagios 上配置好之后,检查用于查看主机和服务是否按预期工作。让我们看一个执行主机检查示例:

Once the host and services are configured on Nagios, checks are used to see if the hosts and services are working as they are supposed to or not. Let us see an example to perform checks on host −

假设您已将主机定义放入 /usr/local/nagios/etc/objects 目录中的 host1.cfg 文件中。

Consider that you have put your host definitions inside host1.cfg file in /usr/local/nagios/etc/objects directory.

cd /usr/local/nagios/etc/objects
gedit host1.cfg

当前主机定义如下所示:

This is how your host definitions look currently −

define host {
   host_name host1
   address 10.0.0.1
}

现在,让我们添加 check_interval 指令。此指令用于按照您设置的次数定期检查主机;默认以分钟为单位。使用以下定义,将每 3 分钟对主机执行一次检查。

Now let us add check_interval directive. This directive is used to perform scheduled checks of the hosts for the number you set; by default it is in minutes. Using the definition below, checks on the host will be performed after every 3 minutes.

define host {
   host_name host1
   address 10.0.0.1
   check_interval 3
}

在 Nagios 中,对主机和服务执行 2 种类型的检查:

In Nagios, 2 types of checks are performed on hosts and services −

  1. Active Checks

  2. Passive Checks

Active Checks

活动检查是由 Nagios 进程启动的,然后定期运行。Nagios 进程内部的检查逻辑启动活动检查。为了监视在远程机器上运行的主机和服务,Nagios 执行插件并告诉其要收集哪些信息。然后在远程机器上执行插件,插件在那里收集所需信息,然后将其发送回 Nagios 守护进程。根据接收到的主机和服务状态采取适当的操作。

Active checks are initiated by Nagios process and then run on a regular scheduled basis. The check logic inside Nagios process starts the Active check. To monitor hosts and services running on remote machines, Nagios executes plugins and tells what information to collect. Plugin then gets executed on the remote machine where is collects the required information and sends then back to Nagios daemon. Depending on the status received on hosts and services, appropriate action is taken.

下图显示了活动检查:

The figure shown below shows an active check −

active check

这些检查按照 check_interval 和 retry_interval 定义的规则定期执行。

These are executed on regular intervals, as defined by check_interval and retry_interval.

被动检查由外部进程执行,结果反馈给 Nagios 进行处理。

Passive checks are performed by external processes and the results are given back to Nagios for processing.

被动检查的工作原理如下:

Passive checks work as explained here −

外部应用程序检查主机/服务状态,并将结果写入外部命令文件。当 Nagios 守护进程读取外部命令文件时,它会读取队列中的所有被动检查并发送,以便在稍后进行处理。当这些检查在定期检查时,会根据检查结果中的信息发送通知或警报。

An external application checks the status on hosts/services and writes the result to External Command File. When Nagios daemon reads external command file, it reads and sends all the passive checks in the queue to process them later. Periodically when these checks are processed, notifications or alerts are sent depending on the information in check result.

下图显示了被动检查:

The figure shown below shows a passive check −

passive check

因此,主动检查和被动检查之间的区别在于主动检查由 Nagios 运行,而被动检查由外部应用程序运行。

Thus, the difference between active and passive check is that active checks are run by Nagios and passive checks are run by external applications.

当您无法定期监视主机/服务时,这些检查很有用。

These checks are useful when you cannot monitor hosts/services on a regular basis.

Nagios 存储其监视的主机和服务的的状态,以确定它们是否正常工作。在许多情况下,故障会随机发生并且是暂时的;因此 Nagios 使用状态检查主机或服务当前的状态。

Nagios stores the status of the hosts and services it is monitoring to determine if they are working properly or not. There would be many cases when the failures will happen randomly and they are temporary; hence Nagios uses states to check the current status of a host or service.

有两种类型状态:

There are two types of states −

  1. Soft state

  2. Hard state

Soft state

当主机或服务停机的时间非常短,并且其状态未知或与之前不同时,则使用软状态。主机或服务将被反复测试,直到状态永久为止。

When a host or service is down for a very short duration of time and its status is not known or different from previous one, then soft states are used. The host or the services will be tested again and again till the time the status is permanent.

Hard State

当 max_check_attempts 执行且主机或服务的状态仍然不是“正常”时,则使用硬状态。Nagios 执行事件处理程序来处理硬状态。

When max_check_attempts is executed and status of the host or service is still not OK, then hard state is used. Nagios executes event handlers to handle hard states.

下图显示了软状态和硬状态。

The following figure shows soft states and hard states.

soft hard states

Nagios - Ports and Protocols

本章介绍了 Nagios 所包含的端口和协议。

This chapter gives an idea of ports and protocols that Nagios comprises.

Protocols

Nagios 使用的默认协议如下:

The default protocols used by Nagios are as given under −

  1. http(s), ports 80 and 443 − The product interfaces are web-based in Nagios. Nagios agents can use http to move data.

  2. snmp, ports 161 and 162 − snmp is an important part of network monitoring. Port 161 is used to send requests to nodes and post 162 is used to receive results.

  3. ssh, port 22 − Nagios is built to run natively on CentOS or RHEL Linux. Administrator can login into Nagios through SSH whenever they feel to do so and perform checks.

Ports

常见 Nagios 插件使用的默认端口如下:

The Default ports used by common Nagios Plugins are as given under −

  1. Butcheck_nt (nsclient++) 12489

  2. NRPE 5666

  3. NSCA 5667

  4. NCPA 5693

  5. MSSQL 1433

  6. MySQL 3306

  7. PostgreSQL 5432

  8. MongoDB 27017, 27018

  9. OracleDB 1521

  10. Email (SMTP) 25, 465, 587

  11. WMI 135, 445 / additionaldynamically-assigned ports in 1024-1034 range

Nagios - Add-ons/Plugins

插件有助于使用 Nagios 监控数据库、操作系统、应用程序、网络设备、协议。插件是经过编译的可执行文件或脚本(Perl 或非 Perl),它扩展了 Nagios 的功能,以监控服务器和主机。Nagios 将执行插件来检查服务或主机的状态。Nagios 可以编译为支持嵌入式 Perl 解释器,以便执行 Perl 插件。如果没有,Nagios 将通过 fork 进程并执行插件作为外部命令来执行 Perl 和非 Perl 插件。

Plugins helps to monitor databases, operating systems, applications, network equipment, protocols with Nagios. Plugins are compiled executables or script (Perl or non-Perl) that extends Nagios functionality to monitor servers and hosts. Nagios will execute a Plugin to check the status of a service or host. Nagios can be compiled with support for an embedded Perl interpreter to execute Perl plugins. Without it, Nagios executes Perl and non-Perl plugins by forking and executing the plugins as an external command.

Types of Nagios Plugins

Nagios 具有以下插件:

Nagios has the following plugins available in it −

Official Nagios Plugins - 50 个 Nagios 官方插件。 官方 Nagios 插件由官方 Nagios 插件团队开发和维护。

Official Nagios Plugins − There are 50 official Nagios Plugins. Official Nagios plugins are developed and maintained by the official Nagios Plugins Team.

Community Plugins - 3000 多个第三方 Nagios 插件由数百名 Nagios 社区成员开发。

Community Plugins − There are over 3,000 third party Nagios plugins that have been developed by hundreds of Nagios community members.

Custom Plugins - 还可以编写自己的自定义插件。编写自定义插件必须遵循特定指南。

Custom Plugins − You can also write your own Custom Plugins. There are certain guidelines that must be followed to write Custom Plugins.

Guidelines for Writing Custom Nagios Plugins

在 Nagios 中编写自定义插件时,需要遵循以下指南:

While writing custom plugin in Nagios, you need to follow the guidelines given below −

  1. Plugins should provide a "-V" command-line option (verify the configuration changes)

  2. Print only one line of text

  3. Print the diagnostic and only part of the help message

  4. Network plugins use DEFAULT_SOCKET_TIMEOUT to timeout

  5. "-v", or "--verbose“ is related to verbosity level

  6. "-t" or "--timeout" (plugin timeout);

  7. "-w" or "--warning" (warning threshold);

  8. "-c" or "--critical" (critical threshold);

  9. "-H" or "--hostname" (name of the host to check)

多个 Nagios 插件同时运行并执行检查,为了让它们平稳地一起运行,Nagios 插件遵循一个状态码。下表列出了退出码状态及其说明 -

Multiple Nagios plugin run and perform checks at the same time, for all of them to run smoothly together, Nagios plugin follow a status code. The table given below tells the exit code status and its description −

Exit Code

Status

Description

0

OK

Working fine

1

WARNING

Working fine, but needs attention

2

CRITICAL

Not working Correctly

3

UNKNOWN

When the plugin is unable to determine the status of the host/service

Nagios 插件使用选项对其进行配置。以下是 Nagios 插件接受的一些重要参数 -

Nagios plugins use options for their configuration. The following are few important parameters accepted by Nagios plugin −

Sr.No

Option & Description

1

-h, --help This provides help

2

-V, --version This prints the exact version of the plugin

3

-v, --verbose This makes the plugin give a more detailed information on what it is doing

4

-t, --timeout This provides the timeout (in seconds); after this time, the plugin will report CRITICAL status

5

-w, --warning This provides the plugin-specific limits for the WARNING status

6

-c, --critical This provides the plugin-specific limits for the CRITICAL status

7

-H, --hostname This provides the hostname, IP address, or Unix socket to communicate with

8

-4, --use-ipv4 This lets you use IPv4 for network connectivity

9

-6, --use-ipv6 This lets you use IPv6 for network connectivity

10

-p, --port This is used to connect to the TCP or UDP port

11

-s, — send This provides the string that will be sent to the server

12

-e, --expect This provides the string that should be sent back from the server

13

-q, --quit This provides the string to send to the server to close the connection

Nagios 插件包有许多适用于主机和服务的检查,用于监测基础架构。我们尝试使用 Nagios 插件执行一些检查。

Nagios plugin package has lot of checks available for hosts and services to monitor the infrastructure. Let us try out Nagios plugins to perform few checks.

SMTP 是一种用于发送电子邮件的协议。Nagios 标准插件有用于执行 SMTP 检查的命令。SMTP 的命令定义如下:

SMTP is a protocol that is used for sending emails. Nagios standard plugins have commands for perform checks for SMTP. The command definition for SMTP −

define command {
   command_name check_smtp
   command_line $USER2$/check_smtp -H $HOSTADDRESS$
}

让我们使用 Nagios 插件监测 MySQL。Nagios 提供 2 个插件用于监测 MySQL。第一个插件检查 mysql 连接是否正常工作,第二个插件用于计算运行 SQL 查询所需的时间。

Let us use Nagios plugin to monitor MySQL. Nagios offers 2 plugins to monitor MySQL. The first plugin checks if mysql connection is working or not, and the second plugin is used to calculate the time taken to run a SQL query.

以下是这两个插件的命令定义:

The commands definitions for both are as follows −

define command {
   command_name check_mysql
   command_line $USER1$/check_mysql –H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -d
   $ARG3$ -S –w 10 –c 30
}

define command {
   command_name check_mysql_query
   command_line $USER1$/check_mysql_query –H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ -d
   $ARG3$ -q $ARG4$ –w $ARG5$ -c $ARG6$
}

Note − 在这两个命令中,用户名、密码和数据库名称都是必须提供的参数。

Note − Username, password, and database name are required as arguments in both the commands.

Nagios 提供了用于检查安装在所有分区上的磁盘空间的插件。命令定义如下:

Nagios offers plugin to check the disk space mounted on all the partitions. The command definition is as follows

define command {
   command_name check_partition
   command_line $USER1$/check_disk –p $ARG1$ –w $ARG2$ -c $ARG3$
}

大多数检查都可以通过 Nagios 标准插件完成。但是有一些应用程序需要进行特殊检查才能进行监测,对于这些应用程序,你可以使用第三方 Nagios 插件,它们可以提供对应用程序更为复杂的检查。在你使用 Nagios Exchange 的第三方插件或从其他网站下载该插件时,重要的是了解其安全性和许可问题。

Majority of checks can be done through standard Nagios plugins. But there are applications which require special checks to monitor them, in which case you can use 3rd party Nagios plugins which will provide more sophisticated checks on the application. It is important to know about security and licensing issues when you are using a 3rd party plugin form Nagios exchange or downloading the plugin from another website.

Nagios - NRPE

Nagios 服务程序在 NRPE(Nagios 远程插件执行器)中对远程机器执行检查。允许你在其他机器上远程运行 Nagios 插件。你可以监测远程机器的指标,例如磁盘使用情况、CPU 负载等。它还可以通过一些 Windows 代理插件检查远程 Windows 机器指标。

The Nagios daemon which run checks on remote machines in NRPE (Nagios Remote Plugin Executor). It allows you to run Nagios plugins on other machines remotely. You can monitor remote machine metrics such as disk usage, CPU load etc. It can also check metrics of remote windows machines through some windows agent addons.

plugin executor

让我们分步了解如何在需要监测的客户端机器上安装并配置 NRPE。

Let us see how to install and configure NRPE step by step on client machine which needs to be monitored.

Step 1 − 运行以下命令在需要监测的远程 Linux 机器上安装 NRPE。

Step 1 − Run below command to install NRPE on the remote linux machine to be monitored.

sudo apt-get install nagios-nrpe-server nagios-plugins

Step 2 − 现在,在服务器目录中创建一个主机文件,并放入该主机的所有必要定义。

Step 2 − Now, create a host file inside the server directory, and put all the necessary definitions for the host.

sudo gedit /usr/local/nagios/etc/servers/ubuntu_host.cfg
# Ubuntu Host configuration file

define host {
   use linux-server
   host_name ubuntu_host
   alias Ubuntu Host
   address 192.168.1.10
   register 1
}

define service {
   host_name ubuntu_host
   service_description PING
   check_command check_ping!100.0,20%!500.0,60%
   max_check_attempts 2
   check_interval 2
   retry_interval 2
   check_period 24x7
   check_freshness 1
   contact_groups admins
   notification_interval 2
   notification_period 24x7
   notifications_enabled 1
   register 1
}

define service {
   host_name ubuntu_host
   service_description Check Users
   check_command check_local_users!20!50
   max_check_attempts 2
   check_interval 2
   retry_interval 2
   check_period 24x7
   check_freshness 1
   contact_groups admins
   notification_interval 2
   notification_period 24x7
   notifications_enabled 1
   register 1
}

define service {
   host_name ubuntu_host
   service_description Local Disk
   check_command check_local_disk!20%!10%!/
   max_check_attempts 2
   check_interval 2
   retry_interval 2
   check_period 24x7
   check_freshness 1
   groups admins
   notification_interval 2
   notification_period 24x7
   notifications_enabled 1
   register 1
}

define service {
   host_name ubuntu_host
   service_description Check SSH
   check_command check_ssh
   max_check_attempts 2
   check_interval 2
   retry_interval 2
   check_period 24x7
   check_freshness 1
   contact_groups admins
   notification_interval 2
   notification_period 24x7
   notifications_enabled 1
   register 1
}

define service {
   host_name ubuntu_host
   service_description Total Process
   check_command check_local_procs!250!400!RSZDT
   max_check_attempts 2
   check_interval 2
   retry_interval 2
   check_period 24x7
   check_freshness 1
   contact_groups admins
   notification_interval 2
   notification_period 24x7
   notifications_enabled 1
   register 1
}

Step 3 − 运行以下命令显示的命令以验证配置文件。

Step 3 − Run the command shown below for the verification of configuration file.

sudo /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
verification configuration

Step 4 − 如果没有错误,请重启 NRPE、Apache 和 Nagios。

Step 4 − Restart NRPE, Apache and Nagios if there are no errors.

service nagios-nrpe-server restart
service apache2 restart
service nagios restart

Step 5 − 打开浏览器并访问 Nagios Web 界面。你可以看到需要监测的主机已被添加到 Nagios 核心服务中。同样,你可以添加更多的主机供 Nagios 监测。

Step 5 − Open your browser and go to Nagios web interface. You can see the host which needs to be monitored has been added to Nagios core service. Similarly, you can add more hosts to be monitored by Nagios.

nagios web interface

Nagios - V Shell

V-Shell 是一个使用 PHP 编写的 Nagios Core 的轻量级 Web 界面。该界面易于安装和使用,并且是 Nagios 输出的替代品。VShell 的前端使用的是 AngularJs,因此设计具有响应性和现代感。它提供了 Quicksearch 功能和由 CodeIgniter 提供支持的 RESTful API。

V-Shell is a lightweight web interface to Nagios Core written in PHP. It is easy to install and use and it is an alternative to Nagios output. The frontend of VShell is on AngularJs, hence the design is responsive and modern. It provides Quicksearch functionality and RESTful API powered by CodeIgniter.

Nagios VShell 兼容 Nagios XI 和 Nagios Core 3.x。它需要在系统中安装 php 5.3 或更高版本、php-cli 和 apache。让我们了解如何安装 Nagios VShell。

Nagios VShell is compatible with Nagios XI and Nagios Core 3.x. It requires php 5.3 or higher, php-cli and apache installed in the system. Let us see how to install Nagios VShell.

Step 1 − 转到 tmp 目录并下载 vshell tar 文件。

Step 1 − Go to tmp directory and download the vshell tar file.

cd /tmp
wget http://assets.nagios.com/downloads/exchange/nagiosvshell/vshell.tar.gz
tar file

Step 2 − 提取 tar 文件。

Step 2 − Extract the tar file.

tar zxf vshell.tar.gz

Step 3 − 转到 vshell 目录,并向 install.php 文件授予可执行权限。最后,运行安装脚本。

Step 3 − Go to vshell directory and give executable permission to install.php file. Finally, run the install script.

cd vshell
chmod +x install.php
./install.php
install script

Step 4 − 现在在浏览器中转到 https://192.168.56.101/vshell ,使用 nagiosadmin 登录,Vshell 即会显示出来。

Step 4 − Now go to https://192.168.56.101/vshell in your browser, login with nagiosadmin and your Vshell will appear.

nagios admin

Nagios - Case Study

在本节中,让我们浏览两个已成功实施 Nagios 的组织的案例研究。

In this chapter, let us look into case studies of two organizations that have successfully implemented Nagios.

Bitnetix with Nagios

Bitnetix 是 IT 咨询组织,业务范围为网络、数据中心、监控和 IP 语音。通过他们的产品,他们让小型企业看起来像大型企业。他们的解决方案通过增加客户参与度和提高满意度帮助您以更好的方式管理客户关系。他们表示其业务为通信,因此在恰当的时间向客户传达恰当信息对他们而言非常重要。

Bitnetix in an IT consulting organization which is into networking, datacenter, monitoring and Voice over IP. Through their offerings, they make small businesses look big. Their solutions help you in managing customer relationships in a better way by increasing more engagement and improving their satisfaction. They say they are in business of communication, hence communicating right message to their customers at the right time is very important for them.

Bitnetix 与一家从事电子邮件营销的客户合作。他们习惯于监控按需分配的 AWS 服务器,负责向客户发送数千封电子邮件。他们刚开始使用 Nagios 核心,但希望升级到新 Nagios XI,并将其与 chef 集成,且不产生停机时间。在将旧 Nagios 核心上的活动状态配置迁移到 Nagios XI 中的相应检查时遇到了挑战。但他们能够使用 Nagios 为 chef 集成设置 Nagios XI 配置文件。他们设法在不产生停机时间的情况下,将所有客户从 Nagios 核心迁移至 Nagios XI。Nagios XI 还能够与 PagerDuty 集成以便发送即时通知。

Bitnetix was working with a customer who were into Email Marketing. They used to monitor AWS servers which were dynamically allocated and were responsible to deliver thousands of emails to customers. They were using Nagios core earlier but wanted to move to new Nagios XI and integrate with chef with zero downtime. There were challenges in moving live status configuration on Nagios core to appropriate checks in Nagios XI. But with Nagios, they were able to setup Nagios XI configuration file with chef integrated. They were able to move all the customers from Nagios core to Nagios XI with Zero downtime. Nagios XI was also able to integrate with PagerDuty for sending instant notifications.

EverWatch.gobal with Nagios

EverWatch.global 是 IT 管理及咨询组织,可帮助非盈利组织和中小型组织。其总部位于纽约罗切斯特。他们凭借对 Nagios 的使用赢得了多项奖项。

EverWatch.global is an IT management and consulting organization which helps non-profit and small/medium organizations. Its headquarter is based in Rochester, New York. They have won numerous awards for their work with Nagios.

EverWatch.global 与一家电子商务零售客户合作,该客户的年收入达数十亿美元。他们负责让网站始终在线运行,监控购物车和检出功能,在出现毁损时向必要人员发送通知。面临的挑战是其客户的服务器距纽约总部 500 英里。为了在同一平台上监控产品、发布、质量保证和开发活动,配置必须相同,并且对这两块业务区域都适用。

EverWatch.global was working with an ecommerce retail client with a billion-dollar annual revenue. They were responsible to keeping the website up and running at all the time, monitoring cart and checkout functionality, send notifications to necessary staff in case of defamation. The challenge was their client’s servers were located 500 miles from its headquarters in New York. For monitoring production, staging, quality assurance and development on the same platform, the configurations were supposed to be unique and similar for both areas.

在 Nagios 的帮助下,他们针对设备和网络运行中心创建了 ssh 防火墙规则。他们还能够对毁损事件进行检查,并减少了误报。通过在 Nagios 中配置事件处理器,通知的数量急剧下降。Nagios 通过帮助将客户网站的正常运行时间从 85% 逐年提高到 98%,帮助他们取得了巨大的成功。

With the help of Nagios, they were able to create ssh firewall rules for equipment and Network Operations Center. They were also able to perform checks for defamation occurrences and reduced false positives. By configuring event handlers in Nagios, the number of notifications drastically decreased. Nagios helped them by keeping their client’s website uptime to 98% annually from 85% annually, this was a huge success.

“以实际美元价值计算,该公司因此获得了近 125,000,000 美元的额外销售额。”Eric Loyd,首席执行官,EverWatch Global。

“In real dollar terms, the company was able to achieve almost $125,000,000 in additional sales as a result.” Eric Loyd, CEOEverWatch Global.