64 KiB
Monitoring Basics
This part of the Icinga 2 documentation provides an overview of all the basic
monitoring concepts you need to know to run Icinga 2.
Keep in mind these examples are made with a linux server in mind, if you are
using Windows you will need to change the services accordingly. See the ITL reference
for further information.
Hosts and Services
Icinga 2 can be used to monitor the availability of hosts and services. Hosts and services can be virtually anything which can be checked in some way:
- Network services (HTTP, SMTP, SNMP, SSH, etc.)
- Printers
- Switches / routers
- Temperature sensors
- Other local or network-accessible services
Host objects provide a mechanism to group services that are running on the same physical device.
Here is an example of a host object which defines two child services:
object Host "my-server1" {
address = "10.0.0.1"
check_command = "hostalive"
}
object Service "ping4" {
host_name = "my-server1"
check_command = "ping4"
}
object Service "http" {
host_name = "my-server1"
check_command = "http"
}
The example creates two services ping4
and http
which belong to the
host my-server1
.
It also specifies that the host should perform its own check using the hostalive
check command.
The address
attribute is used by check commands to determine which network
address is associated with the host object.
Details on troubleshooting check problems can be found here.
Host States
Hosts can be in any of the following states:
Name | Description |
---|---|
UP | The host is available. |
DOWN | The host is unavailable. |
Service States
Services can be in any of the following states:
Name | Description |
---|---|
OK | The service is working properly. |
WARNING | The service is experiencing some problems but is still considered to be in working condition. |
CRITICAL | The service is in a critical state. |
UNKNOWN | The check could not determine the service's state. |
Hard and Soft States
When detecting a problem with a host/service Icinga re-checks the object a number of
times (based on the max_check_attempts
and retry_interval
settings) before sending
notifications. This ensures that no unnecessary notifications are sent for
transient failures. During this time the object is in a SOFT
state.
After all re-checks have been executed and the object is still in a non-OK
state the host/service switches to a HARD
state and notifications are sent.
Name | Description |
---|---|
HARD | The host/service's state hasn't recently changed. |
SOFT | The host/service has recently changed state and is being re-checked. |
Host and Service Checks
Hosts and services determine their state by running checks in a regular interval.
object Host "router" {
check_command = "hostalive"
address = "10.0.0.1"
}
The hostalive
command is one of several built-in check commands. It sends ICMP
echo requests to the IP address specified in the address
attribute to determine
whether a host is online.
A number of other built-in check commands are also available. In addition to these commands the next few chapters will explain in detail how to set up your own check commands.
Templates
Templates may be used to apply a set of identical attributes to more than one object:
template Service "generic-service" {
max_check_attempts = 3
check_interval = 5m
retry_interval = 1m
enable_perfdata = true
}
apply Service "ping4" {
import "generic-service"
check_command = "ping4"
assign where host.address
}
apply Service "ping6" {
import "generic-service"
check_command = "ping6"
assign where host.address6
}
In this example the ping4
and ping6
services inherit properties from the
template generic-service
.
Objects as well as templates themselves can import an arbitrary number of other templates. Attributes inherited from a template can be overridden in the object if necessary.
You can also import existing non-template objects. Note that templates and objects share the same namespace, i.e. you can't define a template that has the same name like an object.
Custom Attributes
In addition to built-in attributes you can define your own attributes:
object Host "localhost" {
vars.ssh_port = 2222
}
Valid values for custom attributes include:
- Strings and numbers
- Arrays and dictionaries
- Functions
Functions as Custom Attributes
Icinga 2 lets you specify functions for custom attributes. The special case here is that whenever Icinga 2 needs the value for such a custom attribute it runs the function and uses whatever value the function returns:
object CheckCommand "random-value" {
import "plugin-check-command"
command = [ PluginDir + "/check_dummy", "0", "$text$" ]
vars.text = {{ Math.random() * 100 }}
}
This example uses the abbreviated lambda syntax.
These functions have access to a number of variables:
Variable | Description |
---|---|
user | The User object (for notifications). |
service | The Service object (for service checks/notifications/event handlers). |
host | The Host object. |
command | The command object (e.g. a CheckCommand object for checks). |
Here's an example:
vars.text = {{ host.check_interval }}
In addition to these variables the macro
function can be used to retrieve the
value of arbitrary macro expressions:
vars.text = {{
if (macro("$address$") == "127.0.0.1") {
log("Running a check for localhost!")
}
return "Some text"
}}
Acessing object attributes at runtime inside these functions is described in the advanced topics chapter.
Runtime Macros
Macros can be used to access other objects' attributes at runtime. For example they are used in command definitions to figure out which IP address a check should be run against:
object CheckCommand "my-ping" {
import "plugin-check-command"
command = [ PluginDir + "/check_ping", "-H", "$ping_address$" ]
arguments = {
"-w" = "$ping_wrta$,$ping_wpl$%"
"-c" = "$ping_crta$,$ping_cpl$%"
"-p" = "$ping_packets$"
}
vars.ping_address = "$address$"
vars.ping_wrta = 100
vars.ping_wpl = 5
vars.ping_crta = 250
vars.ping_cpl = 10
vars.ping_packets = 5
}
object Host "router" {
check_command = "my-ping"
address = "10.0.0.1"
}
In this example we are using the $address$
macro to refer to the host's address
attribute.
We can also directly refer to custom attributes, e.g. by using $ping_wrta$
. Icinga
automatically tries to find the closest match for the attribute you specified. The
exact rules for this are explained in the next section.
Evaluation Order
When executing commands Icinga 2 checks the following objects in this order to look up macros and their respective values:
- User object (only for notifications)
- Service object
- Host object
- Command object
- Global custom attributes in the
Vars
constant
This execution order allows you to define default values for custom attributes in your command objects.
Here's how you can override the custom attribute ping_packets
from the previous
example:
object Service "ping" {
host_name = "localhost"
check_command = "my-ping"
vars.ping_packets = 10 // Overrides the default value of 5 given in the command
}
If a custom attribute isn't defined anywhere an empty value is used and a warning is written to the Icinga 2 log.
You can also directly refer to a specific attribute - thereby ignoring these evaluation rules - by specifying the full attribute name:
$service.vars.ping_wrta$
This retrieves the value of the ping_wrta
custom attribute for the service. This
returns an empty value if the service does not have such a custom attribute no matter
whether another object such as the host has this attribute.
Host Runtime Macros
The following host custom attributes are available in all commands that are executed for hosts or services:
Name | Description |
---|---|
host.name | The name of the host object. |
host.display_name | The value of the display_name attribute. |
host.state | The host's current state. Can be one of UNREACHABLE , UP and DOWN . |
host.state_id | The host's current state. Can be one of 0 (up), 1 (down) and 2 (unreachable). |
host.state_type | The host's current state type. Can be one of SOFT and HARD . |
host.check_attempt | The current check attempt number. |
host.max_check_attempts | The maximum number of checks which are executed before changing to a hard state. |
host.last_state | The host's previous state. Can be one of UNREACHABLE , UP and DOWN . |
host.last_state_id | The host's previous state. Can be one of 0 (up), 1 (down) and 2 (unreachable). |
host.last_state_type | The host's previous state type. Can be one of SOFT and HARD . |
host.last_state_change | The last state change's timestamp. |
host.downtime_depth | The number of active downtimes. |
host.duration_sec | The time since the last state change. |
host.latency | The host's check latency. |
host.execution_time | The host's check execution time. |
host.output | The last check's output. |
host.perfdata | The last check's performance data. |
host.last_check | The timestamp when the last check was executed. |
host.check_source | The monitoring instance that performed the last check. |
host.num_services | Number of services associated with the host. |
host.num_services_ok | Number of services associated with the host which are in an OK state. |
host.num_services_warning | Number of services associated with the host which are in a WARNING state. |
host.num_services_unknown | Number of services associated with the host which are in an UNKNOWN state. |
host.num_services_critical | Number of services associated with the host which are in a CRITICAL state. |
Service Runtime Macros
The following service macros are available in all commands that are executed for services:
Name | Description |
---|---|
service.name | The short name of the service object. |
service.display_name | The value of the display_name attribute. |
service.check_command | The short name of the command along with any arguments to be used for the check. |
service.state | The service's current state. Can be one of OK , WARNING , CRITICAL and UNKNOWN . |
service.state_id | The service's current state. Can be one of 0 (ok), 1 (warning), 2 (critical) and 3 (unknown). |
service.state_type | The service's current state type. Can be one of SOFT and HARD . |
service.check_attempt | The current check attempt number. |
service.max_check_attempts | The maximum number of checks which are executed before changing to a hard state. |
service.last_state | The service's previous state. Can be one of OK , WARNING , CRITICAL and UNKNOWN . |
service.last_state_id | The service's previous state. Can be one of 0 (ok), 1 (warning), 2 (critical) and 3 (unknown). |
service.last_state_type | The service's previous state type. Can be one of SOFT and HARD . |
service.last_state_change | The last state change's timestamp. |
service.downtime_depth | The number of active downtimes. |
service.duration_sec | The time since the last state change. |
service.latency | The service's check latency. |
service.execution_time | The service's check execution time. |
service.output | The last check's output. |
service.perfdata | The last check's performance data. |
service.last_check | The timestamp when the last check was executed. |
service.check_source | The monitoring instance that performed the last check. |
Command Runtime Macros
The following custom attributes are available in all commands:
Name | Description |
---|---|
command.name | The name of the command object. |
User Runtime Macros
The following custom attributes are available in all commands that are executed for users:
Name | Description |
---|---|
user.name | The name of the user object. |
user.display_name | The value of the display_name attribute. |
Notification Runtime Macros
Name | Description |
---|---|
notification.type | The type of the notification. |
notification.author | The author of the notification comment, if existing. |
notification.comment | The comment of the notification, if existing. |
Global Runtime Macros
The following macros are available in all executed commands:
Name | Description |
---|---|
icinga.timet | Current UNIX timestamp. |
icinga.long_date_time | Current date and time including timezone information. Example: 2014-01-03 11:23:08 +0000 |
icinga.short_date_time | Current date and time. Example: 2014-01-03 11:23:08 |
icinga.date | Current date. Example: 2014-01-03 |
icinga.time | Current time including timezone information. Example: 11:23:08 +0000 |
icinga.uptime | Current uptime of the Icinga 2 process. |
The following macros provide global statistics:
Name | Description |
---|---|
icinga.num_services_ok | Current number of services in state 'OK'. |
icinga.num_services_warning | Current number of services in state 'Warning'. |
icinga.num_services_critical | Current number of services in state 'Critical'. |
icinga.num_services_unknown | Current number of services in state 'Unknown'. |
icinga.num_services_pending | Current number of pending services. |
icinga.num_services_unreachable | Current number of unreachable services. |
icinga.num_services_flapping | Current number of flapping services. |
icinga.num_services_in_downtime | Current number of services in downtime. |
icinga.num_services_acknowledged | Current number of acknowledged service problems. |
icinga.num_hosts_up | Current number of hosts in state 'Up'. |
icinga.num_hosts_down | Current number of hosts in state 'Down'. |
icinga.num_hosts_unreachable | Current number of unreachable hosts. |
icinga.num_hosts_flapping | Current number of flapping hosts. |
icinga.num_hosts_in_downtime | Current number of hosts in downtime. |
icinga.num_hosts_acknowledged | Current number of acknowledged host problems. |
Apply Rules
Instead of assigning each object (Service,
Notification, Dependency,
ScheduledDowntime)
based on attribute identifiers for example host_name
objects can be applied.
Before you start using the apply rules keep the following in mind:
- Define the best match.
- A set of unique custom attributes for these hosts/services?
- Or group memberships, e.g. a host being a member of a hostgroup, applying services to it?
- A generic pattern match on the host/service name?
- Multiple expressions combined with
&&
or||
operators
- All expressions must return a boolean value (an empty string is equal to
false
e.g.)
Note
You can set/override object attributes in apply rules using the respectively available objects in that scope (host and/or service objects).
Custom attributes can also store nested dictionaries and arrays. That way you can use them for not only matching for their existance or values in apply expressions, but also assign ("inherit") their values into the generated objected from apply rules.
- Apply services to hosts
- Apply notifications to hosts and services
- Apply dependencies to hosts and services
- Apply scheduled downtimes to hosts and services
A more advanced example is using apply with for loops on arrays or dictionaries for example provided by custom atttributes or groups.
Tip
Building configuration in that dynamic way requires detailed information of the generated objects. Use the
object list
CLI command after successful configuration validation.
Apply Rules Expressions
You can use simple or advanced combinations of apply rule expressions. Each
expression must evaluate into the boolean true
value. An empty string
will be for instance interpreted as false
. In a similar fashion undefined
attributes will return false
.
Returns false
:
assign where host.vars.attribute_does_not_exist
Multiple assign where
condition rows are evaluated as OR
condition.
You can combine multiple expressions for matching only a subset of objects. In some cases,
you want to be able to add more than one assign/ignore where expression which matches
a specific condition. To achieve this you can use the logical and
and or
operators.
Match all *mysql*
patterns in the host name and (&&
) custom attribute prod_mysql_db
matches the db-*
pattern. All hosts with the custom attribute test_server
set to true
should be ignored, or any host name ending with *internal
pattern.
object HostGroup "mysql-server" {
display_name = "MySQL Server"
assign where match("*mysql*", host.name) && match("db-*", host.vars.prod_mysql_db)
ignore where host.vars.test_server == true
ignore where match("*internal", host.name)
}
Similar example for advanced notification apply rule filters: If the service
attribute notes
contains the has gold support 24x7
string AND
one of the
two condition passes: Either the customer
host custom attribute is set to customer-xy
OR
the host custom attribute always_notify
is set to true
.
The notification is ignored for services whose host name ends with *internal
OR
the priority
custom attribute is less than 2
.
template Notification "cust-xy-notification" {
users = [ "noc-xy", "mgmt-xy" ]
command = "mail-service-notification"
}
apply Notification "notify-cust-xy-mysql" to Service {
import "cust-xy-notification"
assign where match("*has gold support 24x7*", service.notes) && (host.vars.customer == "customer-xy" || host.vars.always_notify == true
ignore where match("*internal", host.name) || (service.vars.priority < 2 && host.vars.is_clustered == true)
}
Apply Services to Hosts
The sample configuration already includes a detailed example in hosts.conf and services.conf for this use case.
The example for ssh
applies a service object to all hosts with the address
attribute being defined and the custom attribute os
set to the string Linux
in vars
.
apply Service "ssh" {
import "generic-service"
check_command = "ssh"
assign where host.address && host.vars.os == "Linux"
}
Other detailed scenario examples are used in their respective chapters, for example apply services with custom command arguments.
Apply Notifications to Hosts and Services
Notifications are applied to specific targets (Host
or Service
) and work in a similar
manner:
apply Notification "mail-noc" to Service {
import "mail-service-notification"
user_groups = [ "noc" ]
assign where host.vars.notification.mail
}
In this example the mail-noc
notification will be created as object for all services having the
notification.mail
custom attribute defined. The notification command is set to mail-service-notification
and all members of the user group noc
will get notified.
Apply Dependencies to Hosts and Services
Detailed examples can be found in the dependencies chapter.
Apply Recurring Downtimes to Hosts and Services
The sample confituration includes an example in downtimes.conf.
Detailed examples can be found in the recurring downtimes chapter.
Using Apply For Rules
Next to the standard way of using apply rules there is the requirement of generating apply rules objects based on set (array or dictionary).
The sample configuration already includes a detailed example in hosts.conf and services.conf for this use case.
Take the following example: A host provides the snmp oids for different service check types. This could look like the following example:
object Host "router-v6" {
check_command = "hostalive"
address6 = "::1"
vars.oids["if01"] = "1.1.1.1.1"
vars.oids["temp"] = "1.1.1.1.2"
vars.oids["bgp"] = "1.1.1.1.5"
}
Now we want to create service checks for if01
and temp
but not bgp
.
Furthermore we want to pass the snmp oid stored as dictionary value to the
custom attribute called vars.snmp_oid
- this is the command argument required
by the snmp check command.
The service's display_name
should be set to the identifier inside the dictionary.
apply Service for (identifier => oid in host.vars.oids) {
check_command = "snmp"
display_name = identifier
vars.snmp_oid = oid
ignore where identifier == "bgp" //don't generate service for bgp checks
}
Icinga 2 evalatues the apply for
rule for all objects with the custom attribute
oids
set. It then iterates over all list items inside the for
loop and evaluates the
assign/ignore where
expressions. You can access the loop variable
in these expressions, e.g. for ignoring certain values.
In this example we'd ignore the bgp
identifier and avoid generating an unwanted service.
We could extend the configuration by also matching the oid
value on certain regex/wildcard
patterns for example.
Note
You don't need an
assign where
expression only checking for existance of the custom attribute.
That way you'll save duplicated apply rules by combining them into one
generic apply for
rule generating the object name with or without a prefix.
Apply For and Custom Attribute Override
Imagine a different more advanced example: You are monitoring your switch (hosts) with many interfaces (services). The following requirements/problems apply:
- Each interface service check should be named with a prefix and a running number
- Each interface has its own vlan tag
- Some interfaces have QoS enabled
- Additional attributes such as
display_name
ornotes,
notes_urland
action_url` must be dynamically generated
By defining the interfaces
dictionary with three example interfaces on the core-switch
host object, you'll make sure to pass the storage required by the for loop in the service apply
rule.
object Host "core-switch" {
import "generic-host"
address = "127.0.0.1"
vars.interfaces["0"] = {
port = 1
vlan = "internal"
address = "127.0.0.2"
qos = "enabled"
}
vars.interfaces["1"] = {
port = 2
vlan = "mgmt"
address = "127.0.1.2"
}
vars.interfaces["2"] = {
port = 3
vlan = "remote"
address = "127.0.2.2"
}
}
You can also omit the "if-"
string, then all generated service names are directly
taken from the if_name
variable value.
The config dictionary contains all key-value pairs for the specific interface in one
loop cycle, like port
, vlan
, address
and qos
for the 0
interface.
By defining a default value for the custom attribute qos
in the vars
dictionary
before adding the config
dictionary we'll ensure that this attribute is always defined.
After vars
is fully populated, all object attributes can be set. For strings, you can use
string concatention with the +
operator.
You can also specifiy the check command that way.
apply Service "if-" for (if_name => config in host.vars.interfaces) {
import "generic-service"
check_command = "ping4"
vars.qos = "disabled"
vars += config
display_name = "if-" + if_name + "-" + vars.vlan
notes = "Interface check for Port " + string(vars.port) + " in VLAN " + vars.vlan + " on Address " + vars.address + " QoS " + vars.qos
notes_url = "http://foreman.company.com/hosts/" + host.name
action_url = "http://snmp.checker.company.com/" + host.name + "if-" + if_name
}
Note that numbers must be explicitely casted to string when adding to strings. This can be achieved by wrapping them into the string() function.
Tip
Building configuration in that dynamic way requires detailed information of the generated objects. Use the
object list
CLI command after successful configuration validation.
Use Object Attributes in Apply Rules
Since apply rules are evaluated after the generic objects, you can reference existing host and/or service object attributes as values for any object attribute specified in that apply rule.
object Host "opennebula-host" {
import "generic-host"
address = "10.1.1.2"
vars.hosting["xyz"] = {
http_uri = "/shop"
customer_name = "Customer xyz"
customer_id = "7568"
support_contract = "gold"
}
vars.hosting["abc"] = {
http_uri = "/shop"
customer_name = "Customer xyz"
customer_id = "7568"
support_contract = "silver"
}
}
apply Service for (customer => config in host.vars.hosting) {
import "generic-service"
check_command = "ping4"
vars.qos = "disabled"
vars += config
vars.http_uri = "/" + vars.customer + "/" + config.http_uri
display_name = "Shop Check for " + vars.customer_name + "-" + vars.customer_id
notes = "Support contract: " + vars.support_contract + " for Customer " + vars.customer_name + " (" + vars.customer_id + ")."
notes_url = "http://foreman.company.com/hosts/" + host.name
action_url = "http://snmp.checker.company.com/" + host.name + "/" + vars.customer_id
}
Groups
A group is a collection of similar objects. Groups are primarily used as a visualization aid in web interfaces.
Group membership is defined at the respective object itself. If
you have a hostgroup name windows
for example, and want to assign
specific hosts to this group for later viewing the group on your
alert dashboard, first create a HostGroup object:
object HostGroup "windows" {
display_name = "Windows Servers"
}
Then add your hosts to this group:
template Host "windows-server" {
groups += [ "windows" ]
}
object Host "mssql-srv1" {
import "windows-server"
vars.mssql_port = 1433
}
object Host "mssql-srv2" {
import "windows-server"
vars.mssql_port = 1433
}
This can be done for service and user groups the same way:
object UserGroup "windows-mssql-admins" {
display_name = "Windows MSSQL Admins"
}
template User "generic-windows-mssql-users" {
groups += [ "windows-mssql-admins" ]
}
object User "win-mssql-noc" {
import "generic-windows-mssql-users"
email = "noc@example.com"
}
object User "win-mssql-ops" {
import "generic-windows-mssql-users"
email = "ops@example.com"
}
Group Membership Assign
Instead of manually assigning each object to a group you can also assign objects to a group based on their attributes:
object HostGroup "prod-mssql" {
display_name = "Production MSSQL Servers"
assign where host.vars.mssql_port && host.vars.prod_mysql_db
ignore where host.vars.test_server == true
ignore where match("*internal", host.name)
}
In this example all hosts with the vars
attribute mssql_port
will be added as members to the host group mssql
. However, all *internal
hosts or with the test_server
attribute set to true
are not added to this
group.
Details on the assign where
syntax can be found in the
Language Reference
Notifications
Notifications for service and host problems are an integral part of your monitoring setup.
When a host or service is in a downtime, a problem has been acknowledged or the dependency logic determined that the host/service is unreachable, no notifications are sent. You can configure additional type and state filters refining the notifications being actually sent.
There are many ways of sending notifications, e.g. by e-mail, XMPP, IRC, Twitter, etc. On its own Icinga 2 does not know how to send notifications. Instead it relies on external mechanisms such as shell scripts to notify users.
A notification specification requires one or more users (and/or user groups)
who will be notified in case of problems. These users must have all custom
attributes defined which will be used in the NotificationCommand
on execution.
The user icingaadmin
in the example below will get notified only on WARNING
and
CRITICAL
states and problem
and recovery
notification types.
object User "icingaadmin" {
display_name = "Icinga 2 Admin"
enable_notifications = true
states = [ OK, Warning, Critical ]
types = [ Problem, Recovery ]
email = "icinga@localhost"
}
If you don't set the states
and types
configuration attributes for the User
object, notifications for all states and types will be sent.
Details on troubleshooting notification problems can be found here.
Note
Make sure that the notification feature is enabled in order to execute notification commands.
You should choose which information you (and your notified users) are interested in case of emergency, and also which information does not provide any value to you and your environment.
An example notification command is explained here.
You can add all shared attributes to a Notification
template which is inherited
to the defined notifications. That way you'll save duplicated attributes in each
Notification
object. Attributes can be overridden locally.
template Notification "generic-notification" {
interval = 15m
command = "mail-service-notification"
states = [ Warning, Critical, Unknown ]
types = [ Problem, Acknowledgement, Recovery, Custom, FlappingStart,
FlappingEnd, DowntimeStart, DowntimeEnd, DowntimeRemoved ]
period = "24x7"
}
The time period 24x7
is included as example configuration with Icinga 2.
Use the apply
keyword to create Notification
objects for your services:
apply Notification "notify-cust-xy-mysql" to Service {
import "generic-notification"
users = [ "noc-xy", "mgmt-xy" ]
assign where match("*has gold support 24x7*", service.notes) && (host.vars.customer == "customer-xy" || host.vars.always_notify == true
ignore where match("*internal", host.name) || (service.vars.priority < 2 && host.vars.is_clustered == true)
}
Instead of assigning users to notifications, you can also add the user_groups
attribute with a list of user groups to the Notification
object. Icinga 2 will
send notifications to all group members.
Note
Only users who have been notified of a problem before (
Warning
,Critical
,Unknown
states for services,Down
for hosts) will receiveRecovery
notifications.
Notification Escalations
When a problem notification is sent and a problem still exists at the time of re-notification you may want to escalate the problem to the next support level. A different approach is to configure the default notification by email, and escalate the problem via SMS if not already solved.
You can define notification start and end times as additional configuration
attributes making the Notification
object a so-called notification escalation
.
Using templates you can share the basic notification attributes such as users or the
interval
(and override them for the escalation then).
Using the example from above, you can define additional users being escalated for SMS notifications between start and end time.
object User "icinga-oncall-2nd-level" {
display_name = "Icinga 2nd Level"
vars.mobile = "+1 555 424642"
}
object User "icinga-oncall-1st-level" {
display_name = "Icinga 1st Level"
vars.mobile = "+1 555 424642"
}
Define an additional NotificationCommand for SMS notifications.
Note
The example is not complete as there are many different SMS providers. Please note that sending SMS notifications will require an SMS provider or local hardware with a SIM card active.
object NotificationCommand "sms-notification" {
command = [
PluginDir + "/send_sms_notification",
"$mobile$",
"..."
}
The two new notification escalations are added onto the local host
and its service ping4
using the generic-notification
template.
The user icinga-oncall-2nd-level
will get notified by SMS (sms-notification
command) after 30m
until 1h
.
Note
The
interval
was set to 15m in thegeneric-notification
template example. Lower that value in your escalations by using a secondary template or by overriding the attribute directly in thenotifications
array position forescalation-sms-2nd-level
.
If the problem does not get resolved nor acknowledged preventing further notifications
the escalation-sms-1st-level
user will be escalated 1h
after the initial problem was
notified, but only for one hour (2h
as end
key for the times
dictionary).
apply Notification "mail" to Service {
import "generic-notification"
command = "mail-notification"
users = [ "icingaadmin" ]
assign where service.name == "ping4"
}
apply Notification "escalation-sms-2nd-level" to Service {
import "generic-notification"
command = "sms-notification"
users = [ "icinga-oncall-2nd-level" ]
times = {
begin = 30m
end = 1h
}
assign where service.name == "ping4"
}
apply Notification "escalation-sms-1st-level" to Service {
import "generic-notification"
command = "sms-notification"
users = [ "icinga-oncall-1st-level" ]
times = {
begin = 1h
end = 2h
}
assign where service.name == "ping4"
}
Notification Delay
Sometimes the problem in question should not be notified when the notification is due
(the object reaching the HARD
state) but a defined time duration afterwards. In Icinga 2
you can use the times
dictionary and set begin = 15m
as key and value if you want to
postpone the notification window for 15 minutes. Leave out the end
key - if not set,
Icinga 2 will not check against any end time for this notification. Make sure to
specify a relatively low notification interval
to get notified soon enough again.
apply Notification "mail" to Service {
import "generic-notification"
command = "mail-notification"
users = [ "icingaadmin" ]
interval = 5m
times.begin = 15m // delay notification window
assign where service.name == "ping4"
}
Disable Re-notifications
If you prefer to be notified only once, you can disable re-notifications by setting the
interval
attribute to 0
.
apply Notification "notify-once" to Service {
import "generic-notification"
command = "mail-notification"
users = [ "icingaadmin" ]
interval = 0 // disable re-notification
assign where service.name == "ping4"
}
Notification Filters by State and Type
If there are no notification state and type filter attributes defined at the Notification
or User
object Icinga 2 assumes that all states and types are being notified.
Available state and type filters for notifications are:
template Notification "generic-notification" {
states = [ Warning, Critical, Unknown ]
types = [ Problem, Acknowledgement, Recovery, Custom, FlappingStart,
FlappingEnd, DowntimeStart, DowntimeEnd, DowntimeRemoved ]
}
If you are familiar with Icinga 1.x notification_options
please note that they have been split
into type and state to allow more fine granular filtering for example on downtimes and flapping.
You can filter for acknowledgements and custom notifications too.
Commands
Icinga 2 uses three different command object types to specify how checks should be performed, notifications should be sent, and events should be handled.
Check Commands
CheckCommand objects define the command line how a check is called.
CheckCommand objects are referenced by
Host and Service objects
using the check_command
attribute.
Note
Make sure that the checker feature is enabled in order to execute checks.
Integrate the Plugin with a CheckCommand Definition
CheckCommand objects require the ITL template
plugin-check-command
to support native plugin based check methods.
Unless you have done so already, download your check plugin and put it
into the PluginDir directory. The following example uses the
check_disk
plugin contained in the Monitoring Plugins package.
The plugin path and all command arguments are made a list of double-quoted string arguments for proper shell escaping.
Call the check_disk
plugin with the --help
parameter to see
all available options. Our example defines warning (-w
) and
critical (-c
) thresholds for the disk usage. Without any
partition defined (-p
) it will check all local partitions.
icinga@icinga2 $ /usr/lib/nagios/plugins/check_disk --help
...
This plugin checks the amount of used disk space on a mounted file system
and generates an alert if free space is less than one of the threshold values
Usage:
check_disk -w limit -c limit [-W limit] [-K limit] {-p path | -x device}
[-C] [-E] [-e] [-f] [-g group ] [-k] [-l] [-M] [-m] [-R path ] [-r path ]
[-t timeout] [-u unit] [-v] [-X type] [-N type]
...
Note
Don't execute plugins as
root
and always use the absolute path to the plugin! Trust us.
Next step is to understand how command parameters are being passed from a host or service object, and add a CheckCommand definition based on these required parameters and/or default values.
Passing Check Command Parameters from Host or Service
Check command parameters are defined as custom attributes which can be accessed as runtime macros by the executed check command.
Define the default check command custom attribute disk_wfree
and disk_cfree
(freely definable naming schema) and their default threshold values. You can
then use these custom attributes as runtime macros for command arguments
on the command line.
Tip
Use a common command type as prefix for your command arguments to increase readability.
disk_wfree
helps understanding the context better than justwfree
as argument.
The default custom attributes can be overridden by the custom attributes
defined in the service using the check command my-disk
. The custom attributes
can also be inherited from a parent template using additive inheritance (+=
).
object CheckCommand "my-disk" {
import "plugin-check-command"
command = [ PluginDir + "/check_disk" ]
arguments = {
"-w" = {
value = "$disk_wfree$"
description = "Exit with WARNING status if less than INTEGER units of disk are free or Exit with WARNING status if less than PERCENT of disk space is free"
required = true
}
"-c" = {
value = "$disk_cfree$"
description = "Exit with CRITICAL status if less than INTEGER units of disk are free or Exit with CRITCAL status if less than PERCENT of disk space is free"
required = true
}
"-W" = {
value = "$disk_inode_wfree$"
description = "Exit with WARNING status if less than PERCENT of inode space is free"
}
"-K" = {
value = "$disk_inode_cfree$"
description = "Exit with CRITICAL status if less than PERCENT of inode space is free"
}
"-p" = {
value = "$disk_partitions$"
description = "Path or partition (may be repeated)"
repeat_key = true
order = 1
}
"-x" = {
value = "$disk_partitions_excluded$"
description = "Ignore device (only works if -p unspecified)"
}
}
vars.disk_wfree = "20%"
vars.disk_cfree = "10%"
}
Note
A proper example for the
check_disk
plugin is already shipped with Icinga 2 ready to use with the plugin check commands.
The host localhost
with the applied service basic-partitions
checks a basic set of disk partitions
with modified custom attributes (warning thresholds at 10%
, critical thresholds at 5%
free disk space).
The custom attribute disk_partition
can either hold a single string or an array of
string values for passing multiple partitions to the check_disk
check plugin.
object Host "my-server" {
import "generic-host"
address = "127.0.0.1"
address6 = "::1"
vars.local_disks["basic-partitions"] = {
disk_partitions = [ "/", "/tmp", "/var", "/home" ]
}
}
apply Service for (disk => config in host.vars.local_disks) {
import "generic-service"
check_command = "my-disk"
vars += config
vars.disk_wfree = "10%"
vars.disk_cfree = "5%"
}
More details on using arrays in custom attributes can be found in this chapter.
Command Arguments
By defining a check command line using the command
attribute Icinga 2
will resolve all macros in the static string or array. Sometimes it is
required to extend the arguments list based on a met condition evaluated
at command execution. Or making arguments optional - only set if the
macro value can be resolved by Icinga 2.
object CheckCommand "check_http" {
import "plugin-check-command"
command = [ PluginDir + "/check_http" ]
arguments = {
"-H" = "$http_vhost$"
"-I" = "$http_address$"
"-u" = "$http_uri$"
"-p" = "$http_port$"
"-S" = {
set_if = "$http_ssl$"
}
"--sni" = {
set_if = "$http_sni$"
}
"-a" = {
value = "$http_auth_pair$"
description = "Username:password on sites with basic authentication"
}
"--no-body" = {
set_if = "$http_ignore_body$"
}
"-r" = "$http_expect_body_regex$"
"-w" = "$http_warn_time$"
"-c" = "$http_critical_time$"
"-e" = "$http_expect$"
}
vars.http_address = "$address$"
vars.http_ssl = false
vars.http_sni = false
}
The example shows the check_http
check command defining the most common
arguments. Each of them is optional by default and will be omitted if
the value is not set. For example if the service calling the check command
does not have vars.http_port
set, it won't get added to the command
line.
If the vars.http_ssl
custom attribute is set in the service, host or command
object definition, Icinga 2 will add the -S
argument based on the set_if
numeric value to the command line. String values are not supported.
If the macro value cannot be resolved, Icinga 2 will not add the defined argument to the final command argument array. Empty strings for macro values won't omit the argument.
That way you can use the check_http
command definition for both, with and
without SSL enabled checks saving you duplicated command definitions.
Details on all available options can be found in the CheckCommand object definition.
Environment Variables
The env
command object attribute specifies a list of environment variables with values calculated
from either runtime macros or custom attributes which should be exported as environment variables
prior to executing the command.
This is useful for example for hiding sensitive information on the command line output when passing credentials to database checks:
object CheckCommand "mysql-health" {
import "plugin-check-command"
command = [
PluginDir + "/check_mysql"
]
arguments = {
"-H" = "$mysql_address$"
"-d" = "$mysql_database$"
}
vars.mysql_address = "$address$"
vars.mysql_database = "icinga"
vars.mysql_user = "icinga_check"
vars.mysql_pass = "password"
env.MYSQLUSER = "$mysql_user$"
env.MYSQLPASS = "$mysql_pass$"
}
Notification Commands
NotificationCommand objects define how notifications are delivered to external interfaces (E-Mail, XMPP, IRC, Twitter, etc).
NotificationCommand objects are referenced by
Notification objects using the command
attribute.
NotificationCommand
objects require the ITL template
plugin-notification-command
to support native plugin-based notifications.
Note
Make sure that the notification feature is enabled in order to execute notification commands.
Below is an example using runtime macros from Icinga 2 (such as $service.output$
for
the current check output) sending an email to the user(s) associated with the
notification itself ($user.email$
).
If you want to specify default values for some of the custom attribute definitions,
you can add a vars
dictionary as shown for the CheckCommand
object.
object NotificationCommand "mail-service-notification" {
import "plugin-notification-command"
command = [ SysconfDir + "/icinga2/scripts/mail-notification.sh" ]
env = {
NOTIFICATIONTYPE = "$notification.type$"
SERVICEDESC = "$service.name$"
HOSTALIAS = "$host.display_name$"
HOSTADDRESS = "$address$"
SERVICESTATE = "$service.state$"
LONGDATETIME = "$icinga.long_date_time$"
SERVICEOUTPUT = "$service.output$"
NOTIFICATIONAUTHORNAME = "$notification.author$"
NOTIFICATIONCOMMENT = "$notification.comment$"
HOSTDISPLAYNAME = "$host.display_name$"
SERVICEDISPLAYNAME = "$service.display_name$"
USEREMAIL = "$user.email$"
}
}
The command attribute in the mail-service-notification
command refers to the following
shell script. The macros specified in the env
array are exported
as environment variables and can be used in the notification script:
#!/usr/bin/env bash
template=$(cat <<TEMPLATE
***** Icinga *****
Notification Type: $NOTIFICATIONTYPE
Service: $SERVICEDESC
Host: $HOSTALIAS
Address: $HOSTADDRESS
State: $SERVICESTATE
Date/Time: $LONGDATETIME
Additional Info: $SERVICEOUTPUT
Comment: [$NOTIFICATIONAUTHORNAME] $NOTIFICATIONCOMMENT
TEMPLATE
)
/usr/bin/printf "%b" $template | mail -s "$NOTIFICATIONTYPE - $HOSTDISPLAYNAME - $SERVICEDISPLAYNAME is $SERVICESTATE" $USEREMAIL
Note
This example is for
exim
only. Requires changes forsendmail
and other MTAs.
While it's possible to specify the entire notification command right
in the NotificationCommand object it is generally advisable to create a
shell script in the /etc/icinga2/scripts
directory and have the
NotificationCommand object refer to that.
Event Commands
Unlike notifications event commands for hosts/services are called on every check execution if one of these conditions match:
- The host/service is in a soft state
- The host/service state changes into a hard state
- The host/service state recovers from a soft or hard state to OK/Up
EventCommand objects are referenced by
Host and Service objects
using the event_command
attribute.
Therefore the EventCommand
object should define a command line
evaluating the current service state and other service runtime attributes
available through runtime vars. Runtime macros such as $service.state_type$
and $service.state$
will be processed by Icinga 2 helping on fine-granular
events being triggered.
Common use case scenarios are a failing HTTP check requiring an immediate restart via event command, or if an application is locked and requires a restart upon detection.
EventCommand
objects require the ITL template plugin-event-command
to support native plugin based checks.
Use Event Commands to Restart Service Daemon
The following example will triggert a restart of the httpd
daemon
via ssh when the http
service check fails. If the service state is
OK
, it will not trigger any event action.
Requirements:
- ssh connection
- icinga user with public key authentication
- icinga user with sudo permissions for restarting the httpd daemon.
Example on Debian:
# ls /home/icinga/.ssh/
authorized_keys
# visudo
icinga ALL=(ALL) NOPASSWD: /etc/init.d/apache2 restart
Define a generic EventCommand object event_by_ssh
which can be used for all event commands triggered using ssh:
/* pass event commands through ssh */
object EventCommand "event_by_ssh" {
import "plugin-event-command"
command = [ PluginDir + "/check_by_ssh" ]
arguments = {
"-H" = "$event_by_ssh_address$"
"-p" = "$event_by_ssh_port$"
"-C" = "$event_by_ssh_command$"
"-l" = "$event_by_ssh_logname$"
"-i" = "$event_by_ssh_identity$"
"-q" = {
set_if = "$event_by_ssh_quiet$"
}
"-w" = "$event_by_ssh_warn$"
"-c" = "$event_by_ssh_crit$"
"-t" = "$event_by_ssh_timeout$"
}
vars.event_by_ssh_address = "$address$"
vars.event_by_ssh_quiet = false
}
The actual event command only passes the event_by_ssh_command
attribute.
The event_by_ssh_service
custom attribute takes care of passing the correct
daemon name, while test $service.state_id$ -gt 0
makes sure that the daemon
is only restarted when the service is not in an OK
state.
object EventCommand "event_by_ssh_restart_service" {
import "event_by_ssh"
//only restart the daemon if state > 0 (not-ok)
//requires sudo permissions for the icinga user
vars.event_by_ssh_command = "test $service.state_id$ -gt 0 && sudo /etc/init.d/$event_by_ssh_service$ restart"
}
Now set the event_command
attribute to event_by_ssh_restart_service
and tell it
which service should be restarted using the event_by_ssh_service
attribute.
object Service "http" {
import "generic-service"
host_name = "remote-http-host"
check_command = "http"
event_command = "event_by_ssh_restart_service"
vars.event_by_ssh_service = "$host.vars.httpd_name$"
//vars.event_by_ssh_logname = "icinga"
//vars.event_by_ssh_identity = "/home/icinga/.ssh/id_rsa.pub"
}
Each host with this service then must define the httpd_name
custom attribute
(for example generated from your cmdb):
object Host "remote-http-host" {
import "generic-host"
address = "192.168.1.100"
vars.httpd_name = "apache2"
}
You can testdrive this example by manually stopping the httpd
daemon
on your remote-http-host
. Enable the debuglog
feature and tail the
/var/log/icinga2/debug.log
file.
Remote Host Terminal:
# date; service apache2 status
Mon Sep 15 18:57:39 CEST 2014
Apache2 is running (pid 23651).
# date; service apache2 stop
Mon Sep 15 18:57:47 CEST 2014
[ ok ] Stopping web server: apache2 ... waiting .
Icinga 2 Host Terminal:
[2014-09-15 18:58:32 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_http' '-I' '192.168.1.100': PID 32622
[2014-09-15 18:58:32 +0200] notice/Process: PID 32622 ('/usr/lib64/nagios/plugins/check_http' '-I' '192.168.1.100') terminated with exit code 2
[2014-09-15 18:58:32 +0200] notice/Checkable: State Change: Checkable remote-http-host!http soft state change from OK to CRITICAL detected.
[2014-09-15 18:58:32 +0200] notice/Checkable: Executing event handler 'event_by_ssh_restart_service' for service 'remote-http-host!http'
[2014-09-15 18:58:32 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'test 2 -gt 0 && sudo /etc/init.d/apache2 restart' '-H' '192.168.1.100': PID 32623
[2014-09-15 18:58:33 +0200] notice/Process: PID 32623 ('/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'test 2 -gt 0 && sudo /etc/init.d/apache2 restart' '-H' '192.168.1.100') terminated with exit code 0
Remote Host Terminal:
# date; service apache2 status
Mon Sep 15 18:58:44 CEST 2014
Apache2 is running (pid 24908).
Dependencies
Icinga 2 uses host and service Dependency objects for determing their network reachability.
A service can depend on a host, and vice versa. A service has an implicit dependency (parent) to its host. A host to host dependency acts implicitly as host parent relation. When dependencies are calculated, not only the immediate parent is taken into account but all parents are inherited.
The parent_host_name
and parent_service_name
attributes are mandatory for
service dependencies, parent_host_name
is required for host dependencies.
Apply rules will allow you to
determine these attributes in a more
dynamic fashion if required.
parent_host_name = "core-router"
parent_service_name = "uplink-port"
Notifications are suppressed by default if a host or service becomes unreachable.
You can control that option by defining the disable_notifications
attribute.
disable_notifications = false
If the dependency should be triggered in the parent object's soft state, you
need to set ignore_soft_states
to false
.
The dependency state filter must be defined based on the parent object being
either a host (Up
, Down
) or a service (OK
, Warning
, Critical
, Unknown
).
The following example will make the dependency fail and trigger it if the parent object is not in one of these states:
states = [ OK, Critical, Unknown ]
Rephrased: If the parent service object changes into the Warning
state, this
dependency will fail and render all child objects (hosts or services) unreachable.
You can determine the child's reachability by querying the is_reachable
attribute
in for example DB IDO.
Implicit Dependencies for Services on Host
Icinga 2 automatically adds an implicit dependency for services on their host. That way
service notifications are suppressed when a host is DOWN
or UNREACHABLE
. This dependency
does not overwrite other dependencies and implicitely sets disable_notifications = true
and
states = [ Up ]
for all service objects.
Service checks are still executed. If you want to prevent them from happening, you can
apply the following dependency to all services setting their host as parent_host_name
and disabling the checks. assign where true
matches on all Service
objects.
apply Dependency "disable-host-service-checks" to Service {
disable_checks = true
assign where true
}
Dependencies for Network Reachability
A common scenario is the Icinga 2 server behind a router. Checking internet
access by pinging the Google DNS server google-dns
is a common method, but
will fail in case the dsl-router
host is down. Therefore the example below
defines a host dependency which acts implicitly as parent relation too.
Furthermore the host may be reachable but ping probes are dropped by the
router's firewall. In case the dsl-router
's ping4
service check fails, all
further checks for the ping4
service on host google-dns
service should
be suppressed. This is achieved by setting the disable_checks
attribute to true
.
object Host "dsl-router" {
import "generic-host"
address = "192.168.1.1"
}
object Host "google-dns" {
import "generic-host"
address = "8.8.8.8"
}
apply Service "ping4" {
import "generic-service"
check_command = "ping4"
assign where host.address
}
apply Dependency "internet" to Host {
parent_host_name = "dsl-router"
disable_checks = true
disable_notifications = true
assign where host.name != "dsl-router"
}
apply Dependency "internet" to Service {
parent_host_name = "dsl-router"
parent_service_name = "ping4"
disable_checks = true
assign where host.name != "dsl-router"
}
Apply Dependencies based on Custom Attributes
You can use apply rules to set parent or
child attributes e.g. parent_host_name
to other object's
attributes.
A common example are virtual machines hosted on a master. The object name of that master is auto-generated from your CMDB or VMWare inventory into the host's custom attributes (or a generic template for your cloud).
Define your master host object:
/* your master */
object Host "master.example.com" {
import "generic-host"
}
Add a generic template defining all common host attributes:
/* generic template for your virtual machines */
template Host "generic-vm" {
import "generic-host"
}
Add a template for all hosts on your example.com cloud setting
custom attribute vm_parent
to master.example.com
:
template Host "generic-vm-example.com" {
import "generic-vm"
vars.vm_parent = "master.example.com"
}
Define your guest hosts:
object Host "www.example1.com" {
import "generic-vm-master.example.com"
}
object Host "www.example2.com" {
import "generic-vm-master.example.com"
}
Apply the host dependency to all child hosts importing the
generic-vm
template and set the parent_host_name
to the previously defined custom attribute host.vars.vm_parent
.
apply Dependency "vm-host-to-parent-master" to Host {
parent_host_name = host.vars.vm_parent
assign where "generic-vm" in host.templates
}
You can extend this example, and make your services depend on the
master.example.com
host too. Their local scope allows you to use
host.vars.vm_parent
similar to the example above.
apply Dependency "vm-service-to-parent-master" to Service {
parent_host_name = host.vars.vm_parent
assign where "generic-vm" in host.templates
}
That way you don't need to wait for your guest hosts becoming unreachable when the master host goes down. Instead the services will detect their reachability immediately when executing checks.
Note
This method with setting locally scoped variables only works in apply rules, but not in object definitions.
Dependencies for Agent Checks
Another classic example are agent based checks. You would define a health check for the agent daemon responding to your requests, and make all other services querying that daemon depend on that health check.
The following configuration defines two nrpe based service checks nrpe-load
and nrpe-disk
applied to the nrpe-server
. The health check is defined as
nrpe-health
service.
apply Service "nrpe-health" {
import "generic-service"
check_command = "nrpe"
assign where match("nrpe-*", host.name)
}
apply Service "nrpe-load" {
import "generic-service"
check_command = "nrpe"
vars.nrpe_command = "check_load"
assign where match("nrpe-*", host.name)
}
apply Service "nrpe-disk" {
import "generic-service"
check_command = "nrpe"
vars.nrpe_command = "check_disk"
assign where match("nrpe-*", host.name)
}
object Host "nrpe-server" {
import "generic-host"
address = "192.168.1.5"
}
apply Dependency "disable-nrpe-checks" to Service {
parent_service_name = "nrpe-health"
states = [ OK ]
disable_checks = true
disable_notifications = true
assign where service.check_command == "nrpe"
ignore where service.name == "nrpe-health"
}
The disable-nrpe-checks
dependency is applied to all services
on the nrpe-service
host using the nrpe
check_command attribute
but not the nrpe-health
service itself.