icinga2/doc/07-agent-based-monitoring.md

16 KiB

Agent-based Checks

If the remote services are not directly accessible through the network, a local agent installation exposing the results to check queries can become handy.

Prior to installing and configuration an agent service, evaluate possible options based on these requirements:

  • Security (authentication, TLS certificates, secure connection handling, etc.)
  • Connection direction
    • Master/satellite can execute commands directly or
    • Agent sends back passive/external check results
  • Availability on specific OS types and versions
    • Packages available
  • Configuration and initial setup
  • Updates and maintenance, compatibility

Available agent types:

Icinga Agent

For the most common setups on Linux/Unix and Windows, we recommend to setup the Icinga agent in a distributed environment.

Icinga 2 Distributed Master with Agents

Key benefits:

  • Directly integrated into the distributed monitoring stack of Icinga
  • Works on Linux/Unix and Windows
  • Secure communication with TLS
  • Connection can be established from both sides. Once connected, command execution and check results are exchanged.
    • Master/satellite connects to agent
    • Agent connects to parent satellite/master
  • Same configuration language and binaries
  • Troubleshooting docs and community best practices

Follow the setup and configuration instructions here.

On Windows hosts, the Icinga agent can query a local NSClient++ service for additional checks in case there are no plugins available. The NSCP installer is bundled with Icinga and can be installed with the setup wizard.

Icinga 2 Windows Setup

SSH

Tip

This is the recommended way for systems where the Icinga agent is not available Be it specific hardware architectures, old systems or forbidden to install an additional software.

This method uses the SSH service on the remote host to execute an arbitrary plugin command line. The output and exit code is returned and used by the core.

The check_by_ssh plugin takes care of this. It is available in the Monitoring Plugins package. For your convenience, the Icinga template library provides the by_ssh CheckCommand already.

SSH: Preparations

SSH key pair for the Icinga daemon user. In case the user has no shell, temporarily enable this. When asked for a passphrase, do not set it and press enter.

sudo su - icinga

ssh-keygen -b 4096 -t rsa -C "icinga@$(hostname) user for check_by_ssh" -f $HOME/.ssh/id_rsa

On the remote agent, create the icinga user and generate a temporary password.

useradd -m icinga
passwd icinga

Copy the public key from the Icinga server to the remote agent, e.g. with ssh-copy-id or manually into /home/icinga/.ssh/authorized_keys. This will ask for the password once.

sudo su - icinga

ssh-copy-id -i $HOME/.ssh/id_rsa icinga@ssh-agent1.localdomain

After the SSH key is copied, test at the connection at least once and accept the host key verification. If you forget about this step, checks will become UNKNOWN later.

ssh -i $HOME/.ssh/id_rsa icinga@ssh-agent1.localdomain

After the SSH key login works, disable the previously enabled logins.

  • Remote agent user's password with passwd -l icinga
  • Local icinga user terminal

Also, ensure that the permissions are correct for the .ssh directory as otherwise logins will fail.

  • .ssh directory: 700
  • .ssh/id_rsa.pub public key file: 644
  • .ssh/id_rsa private key file: 600

SSH: Configuration

First, create a host object which has SSH configured and enabled. Mark this e.g. with the custom variable agent_type to later use this for service apply rule matches. Best practice is to store that in a specific template, either in the static configuration or inside the Director.

template Host "ssh-agent" {
  check_command = "hostalive"

  vars.agent_type = "ssh"
  vars.os_type = "linux"
}

object Host "ssh-agent1.localdomain" {
  import "ssh-agent"

  address = "192.168.56.115"
}

Example for monitoring the remote users:

apply Service "users" {
  check_command = "by_ssh"

  vars.by_ssh_command = [ "/usr/lib/nagios/plugins/check_users" ]

  // Follows the same principle as with command arguments, e.g. for ordering
  vars.by_ssh_arguments = {
    "-w" = {
      value = "$users_wgreater$" // Can reference an existing custom variable defined on the host or service, evaluated at runtime
    }
    "-c" = {
      value = "$users_cgreater$"
    }
  }

  vars.users_wgreater = 3
  vars.users_cgreater = 5

  assign where host.vars.os_type == "linux" && host.vars.agent_type == "ssh"
}

A more advanced example with better arguments is shown in this blogpost.

SNMP

The SNMP daemon runs on the remote system and answers SNMP queries by plugin scripts. The Monitoring Plugins package provides the check_snmp plugin binary, but there are plenty of existing plugins for specific use cases already around, for example monitoring Cisco routers.

The following example uses the SNMP ITL CheckCommand and sets the snmp_oid custom variable. A service is created for all hosts which have the snmp-community custom variable.

template Host "snmp-agent" {
  check_command = "hostalive"

  vars.agent_type = "snmp"

  vars.snmp_community = "public-icinga"
}

object Host "snmp-agent1.localdomain" {
  import "snmp-agent"
}
apply Service "uptime" {
  import "generic-service"

  check_command = "snmp"
  vars.snmp_oid = "1.3.6.1.2.1.1.3.0"
  vars.snmp_miblist = "DISMAN-EVENT-MIB"

  assign where host.vars.agent_type == "snmp" && host.vars.snmp_community != ""
}

If no snmp_miblist is specified, the plugin will default to ALL. As the number of available MIB files on the system increases so will the load generated by this plugin if no MIB is specified. As such, it is recommended to always specify at least one MIB.

Additional SNMP plugins are available using the Manubulon SNMP Plugins.

For network monitoring, community members advise to use nwc_health for example.

SNMP Traps and Passive Check Results

SNMP Traps can be received and filtered by using SNMPTT and specific trap handlers passing the check results to Icinga 2.

Following the SNMPTT Format documentation and the Icinga external command syntax found here we can create generic services that can accommodate any number of hosts for a given scenario.

Simple SNMP Traps

A simple example might be monitoring host reboots indicated by an SNMP agent reset. Building the event to auto reset after dispatching a notification is important. Setup the manual check parameters to reset the event from an initial unhandled state or from a missed reset event.

Add a directive in snmptt.conf

EVENT coldStart .1.3.6.1.6.3.1.1.5.1 "Status Events" Normal
FORMAT Device reinitialized (coldStart)
EXEC echo "[$@] PROCESS_SERVICE_CHECK_RESULT;$A;Coldstart;2;The snmp agent has reinitialized." >> /var/run/icinga2/cmd/icinga2.cmd
SDESC
A coldStart trap signifies that the SNMPv2 entity, acting
in an agent role, is reinitializing itself and that its
configuration may have been altered.
EDESC
  1. Define the EVENT as per your need.
  2. Construct the EXEC statement with the service name matching your template applied to your n hosts. The host address inferred by SNMPTT will be the correlating factor. You can have snmptt provide host names or ip addresses to match your Icinga convention.

Note

Replace the deprecated command pipe EXEC statement with a curl call to the REST API action process-check-result.

Add an EventCommand configuration object for the passive service auto reset event.

object EventCommand "coldstart-reset-event" {
  command = [ ConfigDir + "/conf.d/custom/scripts/coldstart_reset_event.sh" ]

  arguments = {
    "-i" = "$service.state_id$"
    "-n" = "$host.name$"
    "-s" = "$service.name$"
  }
}

Create the coldstart_reset_event.sh shell script to pass the expanded variable data in. The $service.state_id$ is important in order to prevent an endless loop of event firing after the service has been reset.

#!/bin/bash

SERVICE_STATE_ID=""
HOST_NAME=""
SERVICE_NAME=""

show_help()
{
cat <<-EOF
	Usage: ${0##*/} [-h] -n HOST_NAME -s SERVICE_NAME
	Writes a coldstart reset event to the Icinga command pipe.

	  -h                  Display this help and exit.
	  -i SERVICE_STATE_ID The associated service state id.
	  -n HOST_NAME        The associated host name.
	  -s SERVICE_NAME     The associated service name.
EOF
}

while getopts "hi:n:s:" opt; do
    case "$opt" in
      h)
          show_help
          exit 0
          ;;
      i)
          SERVICE_STATE_ID=$OPTARG
          ;;
      n)
          HOST_NAME=$OPTARG
          ;;
      s)
          SERVICE_NAME=$OPTARG
          ;;
      '?')
          show_help
          exit 0
          ;;
      esac
done

if [ -z "$SERVICE_STATE_ID" ]; then
    show_help
    printf "\n  Error: -i required.\n"
    exit 1
fi

if [ -z "$HOST_NAME" ]; then
    show_help
    printf "\n  Error: -n required.\n"
    exit 1
fi

if [ -z "$SERVICE_NAME" ]; then
    show_help
    printf "\n  Error: -s required.\n"
    exit 1
fi

if [ "$SERVICE_STATE_ID" -gt 0 ]; then
    echo "[`date +%s`] PROCESS_SERVICE_CHECK_RESULT;$HOST_NAME;$SERVICE_NAME;0;Auto-reset (`date +"%m-%d-%Y %T"`)." >> /var/run/icinga2/cmd/icinga2.cmd
fi

Note

Replace the deprecated command pipe EXEC statement with a curl call to the REST API action process-check-result.

Finally create the Service and assign it:

apply Service "Coldstart" {
  import "generic-service-custom"

  check_command         = "dummy"
  event_command         = "coldstart-reset-event"

  enable_notifications  = 1
  enable_active_checks  = 0
  enable_passive_checks = 1
  enable_flapping       = 0
  volatile              = 1
  enable_perfdata       = 0

  vars.dummy_state      = 0
  vars.dummy_text       = "Manual reset."

  vars.sla              = "24x7"

  assign where (host.vars.os == "Linux" || host.vars.os == "Windows")
}

Complex SNMP Traps

A more complex example might be passing dynamic data from a traps varbind list for a backup scenario where the backup software dispatches status updates. By utilizing active and passive checks, the older freshness concept can be leveraged.

By defining the active check as a hard failed state, a missed backup can be reported. As long as the most recent passive update has occurred, the active check is bypassed.

Add a directive in snmptt.conf

EVENT enterpriseSpecific <YOUR OID> "Status Events" Normal
FORMAT Enterprise specific trap
EXEC echo "[$@] PROCESS_SERVICE_CHECK_RESULT;$A;$1;$2;$3" >> /var/run/icinga2/cmd/icinga2.cmd
SDESC
An enterprise specific trap.
The varbinds in order denote the Icinga service name, state and text.
EDESC
  1. Define the EVENT as per your need using your actual oid.
  2. The service name, state and text are extracted from the first three varbinds. This has the advantage of accommodating an unlimited set of use cases.

Note

Replace the deprecated command pipe EXEC statement with a curl call to the REST API action process-check-result.

Create a Service for the specific use case associated to the host. If the host matches and the first varbind value is Backup, SNMPTT will submit the corresponding passive update with the state and text from the second and third varbind:

object Service "Backup" {
  import "generic-service-custom"

  host_name             = "host.domain.com"
  check_command         = "dummy"

  enable_notifications  = 1
  enable_active_checks  = 1
  enable_passive_checks = 1
  enable_flapping       = 0
  volatile              = 1
  max_check_attempts    = 1
  check_interval        = 87000
  enable_perfdata       = 0

  vars.sla              = "24x7"
  vars.dummy_state      = 2
  vars.dummy_text       = "No passive check result received."
}

Agents sending Check Results via REST API

Whenever the remote agent cannot run the Icinga agent, or a backup script should just send its current state after finishing, you can use the REST API as secure transport and send passive external check results.

Use the process-check-result API action to send the external passive check result. You can either use curl or implement the HTTP requests in your preferred programming language. Examples for API clients are available in this chapter.

Feeding check results from remote hosts requires the host/service objects configured on the master/satellite instance.

NSClient++ on Windows

NSClient++ works on both Windows and Linux platforms and is well known for its magnificent Windows support. There are alternatives like the WMI interface, but using NSClient++ will allow you to run local scripts similar to check plugins fetching the required output and performance counters.

Tip

Best practice is to use the Icinga agent as secure execution bridge (check_nt and check_nrpe are considered insecure) and query the NSClient++ service locally.

You can use the check_nt plugin from the Monitoring Plugins project to query NSClient++. Icinga 2 provides the nscp check command for this:

Example:

object Service "disk" {
  import "generic-service"

  host_name = "remote-windows-host"

  check_command = "nscp"

  vars.nscp_variable = "USEDDISKSPACE"
  vars.nscp_params = "c"
  vars.nscp_warn = 70
  vars.nscp_crit = 80
}

For details on the NSClient++ configuration please refer to the official documentation.

WMI on Windows

The most popular plugin is check_wmi_plus.

Check WMI Plus uses the Windows Management Interface (WMI) to check for common services (cpu, disk, sevices, eventlog…) on Windows machines. It requires the open source wmi client for Linux.

Community examples: