icinga2/doc/4-monitoring-basics.md

92 KiB

Monitoring Basics

This part of the Icinga 2 documentation provides an overview of all the basic monitoring concepts you need to know to run Icinga 2.

Hosts and Services

Icinga 2 can be used to monitor the availability of hosts and services. Hosts and services can be virtually anything which can be checked in some way:

  • Network services (HTTP, SMTP, SNMP, SSH, etc.)
  • Printers
  • Switches / routers
  • Temperature sensors
  • Other local or network-accessible services

Host objects provide a mechanism to group services that are running on the same physical device.

Here is an example of a host object which defines two child services:

object Host "my-server1" {
  address = "10.0.0.1"
  check_command = "hostalive"
}

object Service "ping4" {
  host_name = "my-server1"
  check_command = "ping4"
}

object Service "http" {
  host_name = "my-server1"
  check_command = "http"
}

The example creates two services ping4 and http which belong to the host my-server1.

Note

When using apply rules, a service apply definition will implicitely create a relationship to each host by setting the host_name attribute.

It also specifies that the host should perform its own check using the hostalive check command.

The address attribute is used by check commands to determine which network address is associated with the host object.

Details on troubleshooting check problems can be found here.

Host States

Hosts can be in any of the following states:

Name Description
UP The host is available.
DOWN The host is unavailable.

Service States

Services can be in any of the following states:

Name Description
OK The service is working properly.
WARNING The service is experiencing some problems but is still considered to be in working condition.
CRITICAL The service is in a critical state.
UNKNOWN The check could not determine the service's state.

Hard and Soft States

When detecting a problem with a host/service Icinga re-checks the object a number of times (based on the max_check_attempts and retry_interval settings) before sending notifications. This ensures that no unnecessary notifications are sent for transient failures. During this time the object is in a SOFT state.

After all re-checks have been executed and the object is still in a non-OK state the host/service switches to a HARD state and notifications are sent.

Name Description
HARD The host/service's state hasn't recently changed.
SOFT The host/service has recently changed state and is being re-checked.

Host and Service Checks

Hosts and Services determine their state from a check result returned from a check execution to the Icinga 2 application. By default the generic-host example template will define hostalive as host check. If your host is unreachable for ping, you should consider using a different check command, for instance the http check command, or if there is no check available, the dummy check command.

object Host "uncheckable-host" {
  check_command = "dummy"
  vars.dummy_state = 1
  vars.dummy_text = "Pretending to be OK."
}

Service checks could also use a dummy check, but the common strategy is to integrate an existing plugin as check command and reference that in your Service object definition.

Configuration Best Practice

The Getting Started chapter already introduced various aspects of the Icinga 2 configuration language. If you are ready to configure additional hosts, services, notifications, dependencies, etc, you should think about the requirements first and then decide for a possible strategy.

There are many ways of creating Icinga 2 configuration objects:

  • Manually with your preferred editor, for example vi(m), nano, notepad, etc.
  • Generated by a configuration management too such as Puppet, Chef, Ansible, etc.
  • A configuration addon for Icinga 2
  • A custom exporter script from your CMDB or inventory tool
  • your own.

In order to find the best strategy for your own configuration, ask yourself the following questions:

  • Do your hosts share a common group of services (for example linux hosts with disk, load, etc checks)?
  • Only a small set of users receives notifications and escalations for all hosts/services?

If you can at least answer one of these questions with yes, look for the apply rules logic instead of defining objects on a per host and service basis.

  • You are required to define specific configuration for each host/service?
  • Does your configuration generation tool already know about the host-service-relationship?

Then you should look for the object specific configuration setting host_name etc accordingly.

Finding the best files and directory tree for your configuration is up to you. Make sure that the icinga2.conf configuration file includes them, and then think about:

  • tree-based on locations, hostgroups, specific host attributes with sub levels of directories.
  • flat hosts.conf, services.conf, etc files for rule based configuration.
  • generated configuration with one file per host and a global configuration for groups, users, etc.
  • one big file generated from an external application (probably a bad idea for maintaining changes).
  • your own.

In either way of choosing the right strategy you should additionally check the following:

  • Are there any specific attributes describing the host/service you could set as vars custom attributes? You can later use them for applying assign/ignore rules, or export them into external interfaces.
  • Put hosts into hostgroups, services into servicegroups and use these attributes for your apply rules.
  • Use templates to store generic attributes for your objects and apply rules making your configuration more readable. Details can be found in the using templates chapter.
  • Apply rules may overlap. Keep a central place (for example, services.conf or notifications.conf) storing the configuration instead of defining apply rules deep in your configuration tree.
  • Every plugin used as check, notification or event command requires a Command definition. Further details can be looked up in the check commands chapter.

If you happen to have further questions, do not hesitate to join the community support channels and ask community members for their experience and best practices.

Object Inheritance Using Templates

Templates may be used to apply a set of identical attributes to more than one object:

template Service "generic-service" {
  max_check_attempts = 3
  check_interval = 5m
  retry_interval = 1m
  enable_perfdata = true
}

template Service "ipv6-service {
  notes = "IPv6 critical != IPv4 broken."
}

apply Service "ping4" {
  import "generic-service"

  check_command = "ping4"

  assign where host.address
}

apply Service "ping6" {
  import "generic-service"
  import "ipv6-service"

  check_command = "ping6"

  assign where host.address6
}

In this example the ping4 and ping6 services inherit properties from the template generic-service. The ping6 service additionally imports the ipv6-service template with the notes attribute.

Objects as well as templates themselves can import an arbitrary number of templates. Attributes inherited from a template can be overridden in the object if necessary.

Apply objects based on rules

Instead of assigning each object (Service, Notification, Dependency, ScheduledDowntime) based on attribute identifiers for example host_name objects can be applied.

Before you start using the apply rules keep the following in mind:

  • Define the best match.
  • All expressions must return a boolean value (an empty string is equal to false e.g.)

Note

You can set/override object attributes in apply rules using the respectively available objects in that scope (host and/or service objects).

Custom attributes can also store nested dictionaries and arrays. That way you can use them for not only matching for their existance or values in apply expressions, but also assign ("inherit") their values into the generated objected from apply rules.

A more advanced example is using apply with for loops on arrays or dictionaries for example provided by custom atttributes or groups.

Tip

Building configuration in that dynamic way requires detailed information of the generated objects. Use the object list cli command after successful configuration validation.

Apply Rules Expressions

You can use simple or advanced combinations of apply rule expressions. Each expression must evaluate into the boolean true value. An empty string will be for instance interpreted as false. In a similar fashion undefined attributes will return false.

Returns false:

assign where host.vars.attribute_does_not_exist

Multiple assign where condition rows are evaluated as OR condition.

You can combine multiple expressions for matching only a subset of objects. In some cases, you want to be able to add more than one assign/ignore where expression which matches a specific condition. To achieve this you can use the logical and and or operators.

Match all *mysql* patterns in the host name and (&&) custom attribute prod_mysql_db matches the db-* pattern. All hosts with the custom attribute test_server set to true should be ignored, or any host name ending with *internal pattern.

object HostGroup "mysql-server" {
  display_name = "MySQL Server"

  assign where match("*mysql*", host.name) && match("db-*", host.vars.prod_mysql_db)
  ignore where host.vars.test_server == true
  ignore where match("*internal", host.name)
}

Similar example for advanced notification apply rule filters: If the service attribute notes contains the has gold support 24x7 string AND one of the two condition passes: Either the customer host custom attribute is set to customer-xy OR the host custom attribute always_notify is set to true.

The notification is ignored for services whose host name ends with *internal OR the priority custom attribute is less than 2.

template Notification "cust-xy-notification" {
  users = [ "noc-xy", "mgmt-xy" ]
  command = "mail-service-notification"
}

apply Notification "notify-cust-xy-mysql" to Service {
  import "cust-xy-notification"

  assign where match("*has gold support 24x7*", service.notes) && (host.vars.customer == "customer-xy" || host.vars.always_notify == true
  ignore where match("*internal", host.name) || (service.vars.priority < 2 && host.is_clustered == true)
}

Apply Services to Hosts

The sample configuration already ships a detailed example in hosts.conf and services.conf for this use case.

The example for ssh applies a service object to all hosts with the address attribute being defined and the custom attribute os set to the string Linux in vars.

apply Service "ssh" {
  import "generic-service"

  check_command = "ssh"

  assign where host.address && host.vars.os == "Linux"
}

Other detailed scenario examples are used in their respective chapters, for example apply services with custom command arguments.

Apply Notifications to Hosts and Services

Notifications are applied to specific targets (Host or Service) and work in a similar manner:

apply Notification "mail-noc" to Service {
  import "mail-service-notification"

  user_groups = [ "noc" ]

  assign where host.vars.notification.mail
}

In this example the mail-noc notification will be created as object for all services having the notification.mail custom attribute defined. The notification command is set to mail-service-notification and all members of the user group noc will get notified.

Apply Dependencies to Hosts and Services

Detailed examples can be found in the dependencies chapter.

Apply Recurring Downtimes to Hosts and Services

The sample confituration ships an example in downtimes.conf.

Detailed examples can be found in the recurring downtimes chapter.

Using Apply For Rules

Next to the standard way of using apply rules there is

The sample configuration already ships a detailed example in hosts.conf and services.conf for this use case.

Imagine a different example: You are monitoring your switch (hosts) with many interfaces (services). The following requirements/problems apply:

  • Each interface service check should be named with a prefix and a running number
  • Each interface has its own vlan tag
  • Some interfaces have QoS enabled
  • Additional attributes such as display_name or notes, notes_urlandaction_url` must be dynamically generated

By defining the interfaces dictionary with three example interfaces on the core-switch host object, you'll make sure to pass the storage required by the for loop in the service apply rule.

object Host "core-switch" {
  import "generic-host"
  address = "127.0.0.1"

  vars.interfaces["0"] = {
    port = 1
    vlan = "internal"
    address = "127.0.0.2"
    qos = "enabled"
  }
  vars.interfaces["1"] = {
    port = 2
    vlan = "mgmt"
    address = "127.0.1.2"
  }
  vars.interfaces["2"] = {
    port = 3
    vlan = "remote"
    address = "127.0.2.2"
  }
}

You can also omit the "if-" string, then all generated service names are directly taken from the if_name variable value.

The config dictionary contains all key-value pairs for the specific interface in one loop cycle, like port, vlan, address and qos for the 0 interface.

By defining a default value for the custom attribute qos in the vars dictionary before adding the config dictionary we''ll ensure that this attribute is always defined.

After vars is fully populated, all object attributes can be set. For strings, you can use string concatention with the + operator.

You can also specifiy the check command that way.

apply Service "if-" for (if_name => config in host.vars.interfaces) {
  import "generic-service"
  check_command = "ping4"

  vars.qos = "disabled"
  vars += config

  display_name = "if-" + if_name + "-" + vars.vlan

  notes = "Interface check for Port " + string(vars.port) + " in VLAN " + vars.vlan + " on Address " + vars.address + " QoS " + vars.qos
  notes_url = "http://foreman.company.com/hosts/" + host.name
  action_url = "http://snmp.checker.company.com/" + host.name + "if-" + if_name

  assign where host.vars.interfaces
}

Note that numbers must be explicitely casted to string when adding to strings. This can be achieved by wrapping them into the string() function.

Tip

Building configuration in that dynamic way requires detailed information of the generated objects. Use the object list cli command after successful configuration validation.

Use Object Attributes in Apply Rules

Since apply rules are evaluated after the generic objects, you can reference existing host and/or service object attributes as values for any object attribute specified in that apply rule.

object Host "opennebula-host" {
  import "generic-host"
  address = "10.1.1.2"

  vars.hosting["xyz"] = {
    http_uri = "/shop"
customer_name = "Customer xyz"
customer_id = "7568"
support_contract = "gold"
  }
  vars.hosting["abc"] = {
    http_uri = "/shop"
customer_name = "Customer xyz"
customer_id = "7568"
support_contract = "silver"
  }
}

apply Service for (customer => config in host.vars.hosting) {
  import "generic-service"
  check_command = "ping4"

  vars.qos = "disabled"

  vars += config

  vars.http_uri = "/" + vars.customer + "/" + config.http_uri

  display_name = "Shop Check for " + vars.customer_name + "-" + vars.customer_id

  notes = "Support contract: " + vars.support_contract + " for Customer " + vars.customer_name + " (" + vars.customer_id + ")."

  notes_url = "http://foreman.company.com/hosts/" + host.name
  action_url = "http://snmp.checker.company.com/" + host.name + "/" + vars.customer_id

  assign where host.vars.hosting
}

Groups

Groups are used for combining hosts, services, and users into accessible configuration attributes and views in external (web) interfaces.

Group membership is defined at the respective object itself. If you have a hostgroup name windows for example, and want to assign specific hosts to this group for later viewing the group on your alert dashboard, first create the hostgroup:

object HostGroup "windows" {
  display_name = "Windows Servers"
}

Then add your hosts to this hostgroup

template Host "windows-server" {
  groups += [ "windows" ]
}

object Host "mssql-srv1" {
  import "windows-server"

  vars.mssql_port = 1433
}

object Host "mssql-srv2" {
  import "windows-server"

  vars.mssql_port = 1433
}

This can be done for service and user groups the same way. Additionally the user groups are associated as attributes in Notification objects.

object UserGroup "windows-mssql-admins" {
  display_name = "Windows MSSQL Admins"
}

template User "generic-windows-mssql-users" {
  groups += [ "windows-mssql-admins" ]
}

object User "win-mssql-noc" {
  import "generic-windows-mssql-users"

  email = "noc@example.com"
}

object User "win-mssql-ops" {
  import "generic-windows-mssql-users"

  email = "ops@example.com"
}

Group Membership Assign

If there is a certain number of hosts, services, or users matching a pattern it's reasonable to assign the group object to these members. Details on the assign where syntax can be found here

object HostGroup "prod-mssql" {
  display_name = "Production MSSQL Servers"
  assign where host.vars.mssql_port && host.vars.prod_mysql_db
  ignore where host.vars.test_server == true
  ignore where match("*internal", host.name)
}

In this inherited example from above all hosts with the vars attribute mssql_port set will be added as members to the host group mssql. All *internal hosts or with the test_server attribute set to true will be ignored.

Notifications

Notifications for service and host problems are an integral part of your monitoring setup.

When a host or service is in a downtime, a problem has been acknowledged or the dependency logic determined that the host/service is unreachable, no notifications are sent. You can configure additional type and state filters refining the notifications being actually sent.

There are many ways of sending notifications, e.g. by e-mail, XMPP, IRC, Twitter, etc. On its own Icinga 2 does not know how to send notifications. Instead it relies on external mechanisms such as shell scripts to notify users.

A notification specification requires one or more users (and/or user groups) who will be notified in case of problems. These users must have all custom attributes defined which will be used in the NotificationCommand on execution.

The user icingaadmin in the example below will get notified only on WARNING and CRITICAL states and problem and recovery notification types.

object User "icingaadmin" {
  display_name = "Icinga 2 Admin"
  enable_notifications = true
  states = [ OK, Warning, Critical ]
  types = [ Problem, Recovery ]
  email = "icinga@localhost"
}

If you don't set the states and types configuration attributes for the User object, notifications for all states and types will be sent.

Details on troubleshooting notification problems can be found here.

Note

Make sure that the notification feature is enabled on your master instance in order to execute notification commands.

You should choose which information you (and your notified users) are interested in case of emergency, and also which information does not provide any value to you and your environment.

An example notification command is explained here.

You can add all shared attributes to a Notification template which is inherited to the defined notifications. That way you'll save duplicated attributes in each Notification object. Attributes can be overridden locally.

template Notification "generic-notification" {
  interval = 15m

  command = "mail-service-notification"

  states = [ Warning, Critical, Unknown ]
  types = [ Problem, Acknowledgement, Recovery, Custom, FlappingStart,
            FlappingEnd, DowntimeStart, DowntimeEnd, DowntimeRemoved ]

  period = "24x7"
}

The time period 24x7 is shipped as example configuration with Icinga 2.

Use the apply keyword to create Notification objects for your services:

apply Notification "notify-cust-xy-mysql" to Service {
  import "generic-notification"

  users = [ "noc-xy", "mgmt-xy" ]

  assign where match("*has gold support 24x7*", service.notes) && (host.vars.customer == "customer-xy" || host.vars.always_notify == true
  ignore where match("*internal", host.name) || (service.vars.priority < 2 && host.is_clustered == true)
}

Instead of assigning users to notifications, you can also add the user_groups attribute with a list of user groups to the Notification object. Icinga 2 will send notifications to all group members.

Notification Escalations

When a problem notification is sent and a problem still exists at the time of re-notification you may want to escalate the problem to the next support level. A different approach is to configure the default notification by email, and escalate the problem via SMS if not already solved.

You can define notification start and end times as additional configuration attributes making the Notification object a so-called notification escalation. Using templates you can share the basic notification attributes such as users or the interval (and override them for the escalation then).

Using the example from above, you can define additional users being escalated for SMS notifications between start and end time.

object User "icinga-oncall-2nd-level" {
  display_name = "Icinga 2nd Level"

  vars.mobile = "+1 555 424642"
}

object User "icinga-oncall-1st-level" {
  display_name = "Icinga 1st Level"

  vars.mobile = "+1 555 424642"
}

Define an additional NotificationCommand for SMS notifications.

Note

The example is not complete as there are many different SMS providers. Please note that sending SMS notifications will require an SMS provider or local hardware with a SIM card active.

object NotificationCommand "sms-notification" {
   command = [
     PluginDir + "/send_sms_notification",
     "$mobile$",
     "..."
}

The two new notification escalations are added onto the local host and its service ping4 using the generic-notification template. The user icinga-oncall-2nd-level will get notified by SMS (sms-notification command) after 30m until 1h.

Note

The interval was set to 15m in the generic-notification template example. Lower that value in your escalations by using a secondary template or by overriding the attribute directly in the notifications array position for escalation-sms-2nd-level.

If the problem does not get resolved nor acknowledged preventing further notifications the escalation-sms-1st-level user will be escalated 1h after the initial problem was notified, but only for one hour (2h as end key for the times dictionary).

apply Notification "mail" to Service {
  import "generic-notification"

  command = "mail-notification"
  users = [ "icingaadmin" ]

  assign where service.name == "ping4"
}

apply Notification "escalation-sms-2nd-level" to Service {
  import "generic-notification"

  command = "sms-notification"
  users = [ "icinga-oncall-2nd-level" ]

  times = {
    begin = 30m
    end = 1h
  }

  assign where service.name == "ping4"
}

apply Notification "escalation-sms-1st-level" to Service {
  import "generic-notification"

  command = "sms-notification"
  users = [ "icinga-oncall-1st-level" ]

  times = {
    begin = 1h
    end = 2h
  }

  assign where service.name == "ping4"
}

Notification Delay

Sometimes the problem in question should not be notified when the notification is due (the object reaching the HARD state) but a defined time duration afterwards. In Icinga 2 you can use the times dictionary and set begin = 15m as key and value if you want to postpone the notification window for 15 minutes. Leave out the end key - if not set, Icinga 2 will not check against any end time for this notification. Make sure to specify a relatively low notification interval to get notified soon enough again.

apply Notification "mail" to Service {
  import "generic-notification"

  command = "mail-notification"
  users = [ "icingaadmin" ]

  interval = 5m

  times.begin = 15m // delay notification window

  assign where service.name == "ping4"
}

Disable Re-notifications

If you prefer to be notified only once, you can disable re-notifications by setting the interval attribute to 0.

apply Notification "notify-once" to Service {
  import "generic-notification"

  command = "mail-notification"
  users = [ "icingaadmin" ]

  interval = 0 // disable re-notification

  assign where service.name == "ping4"
}

Notification Filters by State and Type

If there are no notification state and type filter attributes defined at the Notification or User object Icinga 2 assumes that all states and types are being notified.

Available state and type filters for notifications are:

template Notification "generic-notification" {

  states = [ Warning, Critical, Unknown ]
  types = [ Problem, Acknowledgement, Recovery, Custom, FlappingStart,
            FlappingEnd, DowntimeStart, DowntimeEnd, DowntimeRemoved ]
}

If you are familiar with Icinga 1.x notification_options please note that they have been split into type and state to allow more fine granular filtering for example on downtimes and flapping. You can filter for acknowledgements and custom notifications too.s and custom notifications too.

Time Periods

Time Periods define time ranges in Icinga where event actions are triggered, for example whether a service check is executed or not within the check_period attribute. Or a notification should be sent to users or not, filtered by the period and notification_period configuration attributes for Notification and User objects.

Note

If you are familar with Icinga 1.x - these time period definitions are called legacy timeperiods in Icinga 2.

An Icinga 2 legacy timeperiod requires the ITL provided template legacy-timeperiod.

The TimePeriod attribute ranges may contain multiple directives, including weekdays, days of the month, and calendar dates. These types may overlap/override other types in your ranges dictionary.

The descending order of precedence is as follows:

  • Calendar date (2008-01-01)
  • Specific month date (January 1st)
  • Generic month date (Day 15)
  • Offset weekday of specific month (2nd Tuesday in December)
  • Offset weekday (3rd Monday)
  • Normal weekday (Tuesday)

If you don't set any check_period or notification_period attribute on your configuration objects Icinga 2 assumes 24x7 as time period as shown below.

object TimePeriod "24x7" {
  import "legacy-timeperiod"

  display_name = "Icinga 2 24x7 TimePeriod"
  ranges = {
    "monday"    = "00:00-24:00"
    "tuesday"   = "00:00-24:00"
    "wednesday" = "00:00-24:00"
    "thursday"  = "00:00-24:00"
    "friday"    = "00:00-24:00"
    "saturday"  = "00:00-24:00"
    "sunday"    = "00:00-24:00"
  }
}

If your operation staff should only be notified during workhours create a new timeperiod named workhours defining a work day from 09:00 to 17:00.

object TimePeriod "workhours" {
  import "legacy-timeperiod"

  display_name = "Icinga 2 8x5 TimePeriod"
  ranges = {
    "monday"    = "09:00-17:00"
    "tuesday"   = "09:00-17:00"
    "wednesday" = "09:00-17:00"
    "thursday"  = "09:00-17:00"
    "friday"    = "09:00-17:00"
  }
}

Use the period attribute to assign time periods to Notification and Dependency objects:

object Notification "mail" {
  import "generic-notification"

  host_name = "localhost"

  command = "mail-notification"
  users = [ "icingaadmin" ]
  period = "workhours"
}

Commands

Icinga 2 uses three different command object types to specify how checks should be performed, notifications should be sent, and events should be handled.

Environment Variables for Commands

Please check Runtime Custom Attributes as Environment Variables.

Check Commands

CheckCommand objects define the command line how a check is called.

CheckCommand objects are referenced by Host and Service objects using the check_command attribute.

Note

Make sure that the checker feature is enabled in order to execute checks.

Integrate the Plugin with a CheckCommand Definition

CheckCommand objects require the ITL template plugin-check-command to support native plugin based check methods.

Unless you have done so already, download your check plugin and put it into the PluginDir directory. The following example uses the check_disk plugin shipped with the Monitoring Plugins package.

The plugin path and all command arguments are made a list of double-quoted string arguments for proper shell escaping.

Call the check_disk plugin with the --help parameter to see all available options. Our example defines warning (-w) and critical (-c) thresholds for the disk usage. Without any partition defined (-p) it will check all local partitions.

icinga@icinga2 $ /usr/lib/nagios/plugins/check_disk --help
...
This plugin checks the amount of used disk space on a mounted file system
and generates an alert if free space is less than one of the threshold values


Usage:
 check_disk -w limit -c limit [-W limit] [-K limit] {-p path | -x device}
[-C] [-E] [-e] [-f] [-g group ] [-k] [-l] [-M] [-m] [-R path ] [-r path ]
[-t timeout] [-u unit] [-v] [-X type] [-N type]
...

Note

Don't execute plugins as root and always use the absolute path to the plugin! Trust us.

Next step is to understand how command parameters are being passed from a host or service object, and add a CheckCommand definition based on these required parameters and/or default values.

Passing Check Command Parameters from Host or Service

Unlike Icinga 1.x check command parameters are defined as custom attributes which can be accessed as runtime macros by the executed check command.

Define the default check command custom attribute disk_wfree and disk_cfree (freely definable naming schema) and their default threshold values. You can then use these custom attributes as runtime macros for command arguments on the command line.

Tip

Use a common command type as prefix for your command arguments to increase readability. disk_wfree helps understanding the context better than just wfree as argument.

The default custom attributes can be overridden by the custom attributes defined in the service using the check command my-disk. The custom attributes can also be inherited from a parent template using additive inheritance (+=).

object CheckCommand "my-disk" {
  import "plugin-check-command"

  command = [ PluginDir + "/check_disk" ]

  arguments = {
    "-w" = "$disk_wfree$%"
    "-c" = "$disk_cfree$%"
  }

  vars.disk_wfree = 20
  vars.disk_cfree = 10
}

The host localhost with the service my-disk checks all disks with modified custom attributes (warning thresholds at 10%, critical thresholds at 5% free disk space).

object Host "localhost" {
  import "generic-host"

  address = "127.0.0.1"
  address6 = "::1"
}

object Service "my-disk" {
  import "generic-service"

  host_name = "localhost"
  check_command = "my-disk"

  vars.disk_wfree = 10
  vars.disk_cfree = 5
}

Command Arguments

By defining a check command line using the command attribute Icinga 2 will resolve all macros in the static string or array. Sometimes it is required to extend the arguments list based on a met condition evaluated at command execution. Or making arguments optional - only set if the macro value can be resolved by Icinga 2.

object CheckCommand "check_http" {
  import "plugin-check-command"

  command = [ PluginDir + "/check_http" ]

  arguments = {
    "-H" = "$http_vhost$"
    "-I" = "$http_address$"
    "-u" = "$http_uri$"
    "-p" = "$http_port$"
    "-S" = {
      set_if = "$http_ssl$"
    }
    "--sni" = {
      set_if = "$http_sni$"
    }
    "-a" = {
      value = "$http_auth_pair$"
      description = "Username:password on sites with basic authentication"
    }
    "--no-body" = {
      set_if = "$http_ignore_body$"
    }
    "-r" = "$http_expect_body_regex$"
    "-w" = "$http_warn_time$"
    "-c" = "$http_critical_time$"
    "-e" = "$http_expect$"
  }

  vars.http_address = "$address$"
  vars.http_ssl = false
  vars.http_sni = false
}

The example shows the check_http check command defining the most common arguments. Each of them is optional by default and will be omitted if the value is not set. For example if the service calling the check command does not have vars.http_port set, it won't get added to the command line.

If the vars.http_ssl custom attribute is set in the service, host or command object definition, Icinga 2 will add the -S argument based on the set_if numeric value to the command line. String values are not supported.

That way you can use the check_http command definition for both, with and without SSL enabled checks saving you duplicated command definitions.

Details on all available options can be found in the CheckCommand object definition.

Apply Services with Custom Command Arguments

Imagine the following scenario: The my-host1 host is reachable using the default port 22, while the my-host2 host requires a different port on 2222. Both hosts are in the hostgroup my-linux-servers.

object HostGroup "my-linux-servers" {
  display_name = "Linux Servers"
  assign where host.vars.os == "Linux"
}

/* this one has port 22 opened */
object Host "my-host1" {
  import "generic-host"
  address = "129.168.1.50"
  vars.os = "Linux"
}

/* this one listens on a different ssh port */
object Host "my-host2" {
  import "generic-host"
  address = "129.168.2.50"
  vars.os = "Linux"
  vars.custom_ssh_port = 2222
}

All hosts in the my-linux-servers hostgroup should get the my-ssh service applied based on an apply rule. The optional ssh_port command argument should be inherited from the host the service is applied to. If not set, the check command my-ssh will omit the argument. The host argument is special: skip_key tells Icinga 2 to ignore the key, and directly put the value onto the command line. The order attribute specifies that this argument is the first one (-1 is smaller than the other defaults).

object CheckCommand "my-ssh" {
  import "plugin-check-command"

  command = [ PluginDir + "/check_ssh" ]

  arguments = {
    "-p" = "$ssh_port$"
    "host" = {
      value = "$ssh_address$"
      skip_key = true
      order = -1
    }
  }

  vars.ssh_address = "$address$"
}

/* apply ssh service */
apply Service "my-ssh" {
  import "generic-service"
  check_command = "my-ssh"

  //set the command argument for ssh port with a custom host attribute, if set
  vars.ssh_port = "$host.vars.custom_ssh_port$"

  assign where "my-linux-servers" in host.groups
}

The my-host1 will get the my-ssh service checking on the default port:

[2014-05-26 21:52:23 +0200] notice/Process: Running command '/usr/lib/nagios/plugins/check_ssh', '129.168.1.50': PID 27281

The my-host2 will inherit the custom_ssh_port variable to the service and execute a different command:

[2014-05-26 21:51:32 +0200] notice/Process: Running command '/usr/lib/nagios/plugins/check_ssh', '-p', '2222', '129.168.2.50': PID 26956

Notification Commands

NotificationCommand objects define how notifications are delivered to external interfaces (E-Mail, XMPP, IRC, Twitter, etc).

NotificationCommand objects are referenced by Notification objects using the command attribute.

NotificationCommand objects require the ITL template plugin-notification-command to support native plugin-based notifications.

Note

Make sure that the notification feature is enabled on your master instance in order to execute notification commands.

Below is an example using runtime macros from Icinga 2 (such as $service.output$ for the current check output) sending an email to the user(s) associated with the notification itself ($user.email$).

If you want to specify default values for some of the custom attribute definitions, you can add a vars dictionary as shown for the CheckCommand object.

object NotificationCommand "mail-service-notification" {
  import "plugin-notification-command"

  command = [ SysconfDir + "/icinga2/scripts/mail-notification.sh" ]

  env = {
    NOTIFICATIONTYPE = "$notification.type$"
    SERVICEDESC = "$service.name$"
    HOSTALIAS = "$host.display_name$"
    HOSTADDRESS = "$address$"
    SERVICESTATE = "$service.state$"
    LONGDATETIME = "$icinga.long_date_time$"
    SERVICEOUTPUT = "$service.output$"
    NOTIFICATIONAUTHORNAME = "$notification.author$"
    NOTIFICATIONCOMMENT = "$notification.comment$"
	HOSTDISPLAYNAME = "$host.display_name$"
    SERVICEDISPLAYNAME = "$service.display_name$"
    USEREMAIL = "$user.email$"
  }
}

The command attribute in the mail-service-notification command refers to the following shell script. The macros specified in the env array are exported as environment variables and can be used in the notification script:

#!/usr/bin/env bash
template=$(cat <<TEMPLATE
***** Icinga  *****

Notification Type: $NOTIFICATIONTYPE

Service: $SERVICEDESC
Host: $HOSTALIAS
Address: $HOSTADDRESS
State: $SERVICESTATE

Date/Time: $LONGDATETIME

Additional Info: $SERVICEOUTPUT

Comment: [$NOTIFICATIONAUTHORNAME] $NOTIFICATIONCOMMENT
TEMPLATE
)

/usr/bin/printf "%b" $template | mail -s "$NOTIFICATIONTYPE - $HOSTDISPLAYNAME - $SERVICEDISPLAYNAME is $SERVICESTATE" $USEREMAIL

Note

This example is for exim only. Requires changes for sendmail and other MTAs.

While it's possible to specify the entire notification command right in the NotificationCommand object it is generally advisable to create a shell script in the /etc/icinga2/scripts directory and have the NotificationCommand object refer to that.

Event Commands

Unlike notifications event commands for hosts/services are called on every check execution if one of these conditions match:

EventCommand objects are referenced by Host and Service objects using the event_command attribute.

Therefore the EventCommand object should define a command line evaluating the current service state and other service runtime attributes available through runtime vars. Runtime macros such as $service.state_type$ and $service.state$ will be processed by Icinga 2 helping on fine-granular events being triggered.

Common use case scenarios are a failing HTTP check requiring an immediate restart via event command, or if an application is locked and requires a restart upon detection.

EventCommand objects require the ITL template plugin-event-command to support native plugin based checks.

Use Event Commands to Restart Service Daemon

The following example will triggert a restart of the httpd daemon via ssh when the http service check fails. If the service state is OK, it will not trigger any event action.

Requirements:

  • ssh connection
  • icinga user with public key authentication
  • icinga user with sudo permissions for restarting the httpd daemon.

Example on Debian:

# ls /home/icinga/.ssh/
authorized_keys

# visudo
icinga  ALL=(ALL) NOPASSWD: /etc/init.d/apache2 restart

Define a generic EventCommand object event_by_ssh which can be used for all event commands triggered using ssh:

/* pass event commands through ssh */
object EventCommand "event_by_ssh" {
  import "plugin-event-command"

  command = [ PluginDir + "/check_by_ssh" ]

  arguments = {
    "-H" = "$event_by_ssh_address$"
    "-p" = "$event_by_ssh_port$"
    "-C" = "$event_by_ssh_command$"
    "-l" = "$event_by_ssh_logname$"
    "-i" = "$event_by_ssh_identity$"
    "-q" = {
      set_if = "$event_by_ssh_quiet$"
    }
    "-w" = "$event_by_ssh_warn$"
    "-c" = "$event_by_ssh_crit$"
    "-t" = "$event_by_ssh_timeout$"
  }

  vars.event_by_ssh_address = "$address$"
  vars.event_by_ssh_quiet = false
}

The actual event command only passes the event_by_ssh_command attribute. The event_by_ssh_service custom attribute takes care of passing the correct daemon name, while test $service.state_id$ -gt 0 makes sure that the daemon is only restarted when the service is an a not OK state.

object EventCommand "event_by_ssh_restart_service" {
  import "event_by_ssh"

  //only restart the daemon if state > 0 (not-ok)
  //requires sudo permissions for the icinga user
  vars.event_by_ssh_command = "test $service.state_id$ -gt 0 && sudo /etc/init.d/$event_by_ssh_service$ restart"
}

Now set the event_command attribute to event_by_ssh_restart_service and tell it which service should be restarted using the event_by_ssh_service attribute.

object Service "http" {
  import "generic-service"
  host_name = "remote-http-host"
  check_command = "http"

  event_command = "event_by_ssh_restart_service"
  vars.event_by_ssh_service = "$host.vars.httpd_name$"

  //vars.event_by_ssh_logname = "icinga"
  //vars.event_by_ssh_identity = "/home/icinga/.ssh/id_rsa.pub"
}

Each host with this service then must define the httpd_name custom attribute (for example generated from your cmdb):

object Host "remote-http-host" {
  import "generic-host"
  address = "192.168.1.100"

  vars.httpd_name = "apache2"
}

You can testdrive this example by manually stopping the httpd daemon on your remote-http-host. Enable the debuglog feature and tail the /var/log/icinga2/debug.log file.

Remote Host Terminal:

# date; service apache2 status
Mon Sep 15 18:57:39 CEST 2014
Apache2 is running (pid 23651).
# date; service apache2 stop
Mon Sep 15 18:57:47 CEST 2014
[ ok ] Stopping web server: apache2 ... waiting .

Icinga 2 Host Terminal:

[2014-09-15 18:58:32 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_http' '-I' '192.168.1.100': PID 32622
[2014-09-15 18:58:32 +0200] notice/Process: PID 32622 ('/usr/lib64/nagios/plugins/check_http' '-I' '192.168.1.100') terminated with exit code 2
[2014-09-15 18:58:32 +0200] notice/Checkable: State Change: Checkable remote-http-host!http soft state change from OK to CRITICAL detected.
[2014-09-15 18:58:32 +0200] notice/Checkable: Executing event handler 'event_by_ssh_restart_service' for service 'remote-http-host!http'
[2014-09-15 18:58:32 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'test 2 -gt 0 && sudo /etc/init.d/apache2 restart' '-H' '192.168.1.100': PID 32623
[2014-09-15 18:58:33 +0200] notice/Process: PID 32623 ('/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'test 2 -gt 0 && sudo /etc/init.d/apache2 restart' '-H' '192.168.1.100') terminated with exit code 0

Remote Host Terminal:

# date; service apache2 status
Mon Sep 15 18:58:44 CEST 2014
Apache2 is running (pid 24908).

Dependencies

Icinga 2 uses host and service Dependency objects for determing their network reachability. The parent_host_name and parent_service_name attributes are mandatory for service dependencies, parent_host_name is required for host dependencies.

A service can depend on a host, and vice versa. A service has an implicit dependency (parent) to its host. A host to host dependency acts implicitly as host parent relation. When dependencies are calculated, not only the immediate parent is taken into account but all parents are inherited.

Notifications are suppressed if a host or service becomes unreachable.

Implicit Dependencies for Services on Host

Icinga 2 automatically adds an implicit dependency for services on their host. That way service notifications are suppressed when a host is DOWN or UNREACHABLE. This dependency does not overwrite other dependencies and implicitely sets disable_notifications = true and states = [ Up ] for all service objects.

Service checks are still executed. If you want to prevent them from happening, you can apply the following dependency to all services setting their host as parent_host_name and disabling the checks. assign where true matches on all Service objects.

apply Dependency "disable-host-service-checks" to Service {
  disable_checks = true
  assign where true
}

Dependencies for Network Reachability

A common scenario is the Icinga 2 server behind a router. Checking internet access by pinging the Google DNS server google-dns is a common method, but will fail in case the dsl-router host is down. Therefore the example below defines a host dependency which acts implicitly as parent relation too.

Furthermore the host may be reachable but ping probes are dropped by the router's firewall. In case the dsl-router``ping4 service check fails, all further checks for the ping4 service on host google-dns service should be suppressed. This is achieved by setting the disable_checks attribute to true.

object Host "dsl-router" {
  address = "192.168.1.1"
}

object Host "google-dns" {
  address = "8.8.8.8"
}

apply Service "ping4" {
  import "generic-service"

  check_command = "ping4"

  assign where host.address
}

apply Dependency "internet" to Host {
  parent_host_name = "dsl-router"
  disable_checks = true
  disable_notifications = true

  assign where host.name != "dsl-router"
}

apply Dependency "internet" to Service {
  parent_host_name = "dsl-router"
  parent_service_name = "ping4"
  disable_checks = true

  assign where host.name != "dsl-router"
}

Dependencies for Agent Checks

Another classic example are agent based checks. You would define a health check for the agent daemon responding to your requests, and make all other services querying that daemon depend on that health check.

The following configuration defines two nrpe based service checks nrpe-load and nrpe-disk applied to the nrpe-server. The health check is defined as nrpe-health service.

apply Service "nrpe-health" {
  import "generic-service"
  check_command = "nrpe"
  assign where match("nrpe-*", host.name)
}

apply Service "nrpe-load" {
  import "generic-service"
  check_command = "nrpe"
  vars.nrpe_command = "check_load"
  assign where match("nrpe-*", host.name)
}

apply Service "nrpe-disk" {
  import "generic-service"
  check_command = "nrpe"
  vars.nrpe_command = "check_disk"
  assign where match("nrpe-*", host.name)
}

object Host "nrpe-server" {
  import "generic-host"
  address = "192.168.1.5"
}

apply Dependency "disable-nrpe-checks" to Service {
  parent_service_name = "nrpe-health"

  states = [ OK ]
  disable_checks = true
  disable_notifications = true
  assign where service.check_command == "nrpe"
  ignore where service.name == "nrpe-health"
}

The disable-nrpe-checks dependency is applied to all services on the nrpe-service host using the nrpe check_command attribute but not the nrpe-health service itself.

Downtimes

Downtimes can be scheduled for planned server maintenance or any other targetted service outage you are aware of in advance.

Downtimes will suppress any notifications, and may trigger other downtimes too. If the downtime was set by accident, or the duration exceeds the maintenance, you can manually cancel the downtime. Planned downtimes will also be taken into account for SLA reporting tools calculating the SLAs based on the state and downtime history.

Multiple downtimes for a single object may overlap. This is useful when you want to extend your maintenance window taking longer than expected. If there are multiple downtimes triggered for one object, the overall downtime depth will be greater than 1.

If the downtime was scheduled after the problem changed to a critical hard state triggering a problem notification, and the service recovers during the downtime window, the recovery notification won't be suppressed.

Fixed and Flexible Downtimes

A fixed downtime will be activated at the defined start time, and removed at the end time. During this time window the service state will change to NOT-OK and then actually trigger the downtime. Notifications are suppressed and the downtime depth is incremented.

Common scenarios are a planned distribution upgrade on your linux servers, or database updates in your warehouse. The customer knows about a fixed downtime window between 23:00 and 24:00. After 24:00 all problems should be alerted again. Solution is simple - schedule a fixed downtime starting at 23:00 and ending at 24:00.

Unlike a fixed downtime, a flexible downtime will be triggered by the state change in the time span defined by start and end time, and then last for the specified duration in minutes.

Imagine the following scenario: Your service is frequently polled by users trying to grab free deleted domains for immediate registration. Between 07:30 and 08:00 the impact will hit for 15 minutes and generate a network outage visible to the monitoring. The service is still alive, but answering too slow to Icinga 2 service checks. For that reason, you may want to schedule a downtime between 07:30 and 08:00 with a duration of 15 minutes. The downtime will then last from its trigger time until the duration is over. After that, the downtime is removed (may happen before or after the actual end time!).

Scheduling a downtime

This can either happen through a web interface or by sending an external command to the external command pipe provided by the ExternalCommandListener configuration.

Fixed downtimes require a start and end time (a duration will be ignored). Flexible downtimes need a start and end time for the time span, and a duration independent from that time span.

Triggered Downtimes

This is optional when scheduling a downtime. If there is already a downtime scheduled for a future maintenance, the current downtime can be triggered by that downtime. This renders useful if you have scheduled a host downtime and are now scheduling a child host's downtime getting triggered by the parent downtime on NOT-OK state change.

Recurring Downtimes

ScheduledDowntime objects can be used to set up recurring downtimes for services.

Example:

apply ScheduledDowntime "backup-downtime" to Service {
  author = "icingaadmin"
  comment = "Scheduled downtime for backup"

  ranges = {
    monday = "02:00-03:00"
    tuesday = "02:00-03:00"
    wednesday = "02:00-03:00"
    thursday = "02:00-03:00"
    friday = "02:00-03:00"
    saturday = "02:00-03:00"
    sunday = "02:00-03:00"
  }

  assign where "backup" in service.groups
}

Comments

Comments can be added at runtime and are persistent over restarts. You can add useful information for others on repeating incidents (for example "last time syslog at 100% cpu on 17.10.2013 due to stale nfs mount") which is primarly accessible using web interfaces.

Adding and deleting comment actions are possible through the external command pipe provided with the ExternalCommandListener configuration. The caller must pass the comment id in case of manipulating an existing comment.

Acknowledgements

If a problem is alerted and notified you may signal the other notification recipients that you are aware of the problem and will handle it.

By sending an acknowledgement to Icinga 2 (using the external command pipe provided with ExternalCommandListener configuration) all future notifications are suppressed, a new comment is added with the provided description and a notification with the type NotificationFilterAcknowledgement is sent to all notified users.

Expiring Acknowledgements

Once a problem is acknowledged it may disappear from your handled problems dashboard and no-one ever looks at it again since it will suppress notifications too.

This fire-and-forget action is quite common. If you're sure that a current problem should be resolved in the future at a defined time, you can define an expiration time when acknowledging the problem.

Icinga 2 will clear the acknowledgement when expired and start to re-notify if the problem persists.

Custom Attributes

Using Custom Attributes for Apply Rules

Custom attributes are not only used at runtime in command definitions to pass command arguments, but are also a smart way to define patterns and groups for applying objects for dynamic config generation.

There are several ways of using custom attributes with apply rules:

  • As simple attribute literal (number, string, boolean) for expression conditions (assign where, ignore where)
  • As array or dictionary attribute with nested values (e.g. dictionaries in dictionaries) in apply for rules.

Features like DB IDO, Livestatus(#livestatus) or StatusData(#status-data) dump this column as encoded JSON string, and set is_json resp. cv_is_json to 1.

If arrays are used in runtime macros (for example $host.groups$) all entries are separated using the ; character. If an entry contains a semi-colon itself, it is escaped like this: entry1;ent\;ry2;entry3.

Using Custom Attributes at Runtime

Custom attributes may be used in command definitions to dynamically change how the command is executed.

Additionally there are Icinga 2 features such as the PerfDataWriter type which use custom runtime attributes to format their output.

Tip

Custom attributes are identified by the 'vars' dictionary attribute as short name. Accessing the different attribute keys is possible using the '.' accessor.

Custom attributes in command definitions or performance data templates are evaluated at runtime when executing a command. These custom attributes cannot be used elsewhere (e.g. in other configuration attributes).

Custom attribute values must be either a string, a number or a boolean value. Arrays and dictionaries cannot be used.

Here is an example of a command definition which uses user-defined custom attributes:

object CheckCommand "my-ping" {
  import "plugin-check-command"

  command = [
    PluginDir + "/check_ping", "-4"
  ]

  arguments = {
    "-H" = "$ping_address$"
    "-w" = "$ping_wrta$,$ping_wpl$%"
    "-c" = "$ping_crta$,$ping_cpl$%"
    "-p" = "$ping_packets$"
    "-t" = "$ping_timeout$"
  }

  vars.ping_address = "$address$"
  vars.ping_wrta = 100
  vars.ping_wpl = 5
  vars.ping_crta = 200
  vars.ping_cpl = 15
  vars.ping_packets = 5
  vars.ping_timeout = 0
}

Custom attribute names used at runtime must be enclosed in two $ signs, e.g. $address$. When using the $ sign as single character, you need to escape it with an additional dollar sign ($$). This example also makes use of the command arguments passed to the command line. -4 must be added as additional array key.

Runtime Custom Attributes Evaluation Order

When executing commands Icinga 2 checks the following objects in this order to look up custom attributes and their respective values:

  1. User object (only for notifications)
  2. Service object
  3. Host object
  4. Command object
  5. Global custom attributes in the vars constant

This execution order allows you to define default values for custom attributes in your command objects. The my-ping command shown above uses this to set default values for some of the latency thresholds and timeouts.

When using the my-ping command you can override some or all of the custom attributes in the service definition like this:

object Service "ping" {
  host_name = "localhost"
  check_command = "my-ping"

  vars.ping_packets = 10 // Overrides the default value of 5 given in the command
}

If a custom attribute isn't defined anywhere an empty value is used and a warning is emitted to the Icinga 2 log.

Best Practice

By convention every host should have an address attribute. Hosts which have an IPv6 address should also have an address6 attribute.

Runtime Custom Attributes as Environment Variables

The env command object attribute specifies a list of environment variables with values calculated from either runtime macros or custom attributes which should be exported as environment variables prior to executing the command.

This is useful for example for hiding sensitive information on the command line output when passing credentials to database checks:

object CheckCommand "mysql-health" {
  import "plugin-check-command"

  command = [
    PluginDir + "/check_mysql"
  ]

  arguments = {
    "-H" = "$mysql_address$"
    "-d" = "$mysql_database$"
  }

  vars.mysql_address = "$address$"
  vars.mysql_database = "icinga"
  vars.mysql_user = "icinga_check"
  vars.mysql_pass = "password"

  env.MYSQLUSER = "$mysql_user$"
  env.MYSQLPASS = "$mysql_pass$"
}

Multiple Host Addresses using Custom Attributes

The following example defines a Host with three different interface addresses defined as custom attributes in the vars dictionary. The if-eth0 and if-eth1 services will import these values into the address custom attribute. This attribute is available through the generic $address$ runtime macro.

object Host "multi-ip" {
  check_command = "dummy"
  vars.address_lo = "127.0.0.1"
  vars.address_eth0 = "10.0.0.10"
  vars.address_eth1 = "192.168.1.10"
}

apply Service "if-eth0" {
  import "generic-service"

  vars.address = "$host.vars.address_eth0$"
  check_command = "my-generic-interface-check"

  assign where host.vars.address_eth0 != ""
}

apply Service "if-eth1" {
  import "generic-service"

  vars.address = "$host.vars.address_eth1$"
  check_command = "my-generic-interface-check"

  assign where host.vars.address_eth1 != ""
}

object CheckCommand "my-generic-interface-check" {
  import "plugin-check-command"

  command = "echo \"This would be the service $service.description$ using the address value: $address$\""
}

The CheckCommand object is just an example to help you with testing and understanding the different custom attributes and runtime macros.

Modified Attributes

Icinga 2 allows you to modify defined object attributes at runtime different to the local configuration object attributes. These modified attributes are stored as bit-shifted-value and made available in backends. Icinga 2 stores modified attributes in its state file and restores them on restart.

Modified Attributes can be reset using external commands.

Runtime Macros

Next to custom attributes there are additional runtime macros made available by Icinga 2. These runtime macros reflect the current object state and may change over time while custom attributes are configured statically (but can be modified at runtime using external commands).

Runtime Macro Evaluation Order

Custom attributes can be accessed at runtime using their identifier omitting the vars. prefix. There are special cases when those custom attributes are not set and Icinga 2 provides a fallback to existing object attributes for example host.address.

In the following example the $address$ macro will be resolved with the value of vars.address.

object Host "localhost" {
  import "generic-host"
  check_command = "my-host-macro-test"
  address = "127.0.0.1"
  vars.address = "127.2.2.2"
}

object CheckCommand "my-host-macro-test" {
  command = "echo \"address: $address$ host.address: $host.address$ host.vars.address: $host.vars.address$\""
}

The check command output will look like

"address: 127.2.2.2 host.address: 127.0.0.1 host.vars.address: 127.2.2.2"

If you alter the host object and remove the vars.address line, Icinga 2 will fail to look up $address$ in the custom attributes dictionary and then look for the host object's attribute.

The check command output will change to

"address: 127.0.0.1 host.address: 127.0.0.1 host.vars.address: "

The same example can be defined for services overriding the address field based on a specific host custom attribute.

object Host "localhost" {
  import "generic-host"
  address = "127.0.0.1"
  vars.macro_address = "127.3.3.3"
}

apply Service "my-macro-test" to Host {
  import "generic-service"
  check_command = "my-service-macro-test"
  vars.address = "$host.vars.macro_address$"

  assign where host.address
}

object CheckCommand "my-service-macro-test" {
  command = "echo \"address: $address$ host.address: $host.address$ host.vars.macro_address: $host.vars.macro_address$ service.vars.address: $service.vars.address$\""
}

When the service check is executed the output looks like

"address: 127.3.3.3 host.address: 127.0.0.1 host.vars.macro_address: 127.3.3.3 service.vars.address: 127.3.3.3"

That way you can easily override existing macros being accessed by their short name like $address$ and refrain from defining multiple check commands (one for $address$ and one for $host.vars.macro_address$).

Host Runtime Macros

The following host custom attributes are available in all commands that are executed for hosts or services:

Name Description
host.name The name of the host object.
host.display_name The value of the display_name attribute.
host.state The host's current state. Can be one of UNREACHABLE, UP and DOWN.
host.state_id The host's current state. Can be one of 0 (up), 1 (down) and 2 (unreachable).
host.state_type The host's current state type. Can be one of SOFT and HARD.
host.check_attempt The current check attempt number.
host.max_check_attempts The maximum number of checks which are executed before changing to a hard state.
host.last_state The host's previous state. Can be one of UNREACHABLE, UP and DOWN.
host.last_state_id The host's previous state. Can be one of 0 (up), 1 (down) and 2 (unreachable).
host.last_state_type The host's previous state type. Can be one of SOFT and HARD.
host.last_state_change The last state change's timestamp.
host.duration_sec The time since the last state change.
host.latency The host's check latency.
host.execution_time The host's check execution time.
host.output The last check's output.
host.perfdata The last check's performance data.
host.last_check The timestamp when the last check was executed.
host.num_services Number of services associated with the host.
host.num_services_ok Number of services associated with the host which are in an OK state.
host.num_services_warning Number of services associated with the host which are in a WARNING state.
host.num_services_unknown Number of services associated with the host which are in an UNKNOWN state.
host.num_services_critical Number of services associated with the host which are in a CRITICAL state.

Service Runtime Macros

The following service macros are available in all commands that are executed for services:

Name Description
service.name The short name of the service object.
service.display_name The value of the display_name attribute.
service.check_command The short name of the command along with any arguments to be used for the check.
service.state The service's current state. Can be one of OK, WARNING, CRITICAL and UNKNOWN.
service.state_id The service's current state. Can be one of 0 (ok), 1 (warning), 2 (critical) and 3 (unknown).
service.state_type The service's current state type. Can be one of SOFT and HARD.
service.check_attempt The current check attempt number.
service.max_check_attempts The maximum number of checks which are executed before changing to a hard state.
service.last_state The service's previous state. Can be one of OK, WARNING, CRITICAL and UNKNOWN.
service.last_state_id The service's previous state. Can be one of 0 (ok), 1 (warning), 2 (critical) and 3 (unknown).
service.last_state_type The service's previous state type. Can be one of SOFT and HARD.
service.last_state_change The last state change's timestamp.
service.duration_sec The time since the last state change.
service.latency The service's check latency.
service.execution_time The service's check execution time.
service.output The last check's output.
service.perfdata The last check's performance data.
service.last_check The timestamp when the last check was executed.

Command Runtime Macros

The following custom attributes are available in all commands:

Name Description
command.name The name of the command object.

User Runtime Macros

The following custom attributes are available in all commands that are executed for users:

Name Description
user.name The name of the user object.
user.display_name The value of the display_name attribute.

Notification Runtime Macros

Name Description
notification.type The type of the notification.
notification.author The author of the notification comment, if existing.
notification.comment The comment of the notification, if existing.

Global Runtime Macros

The following macros are available in all executed commands:

Name Description
icinga.timet Current UNIX timestamp.
icinga.long_date_time Current date and time including timezone information. Example: 2014-01-03 11:23:08 +0000
icinga.short_date_time Current date and time. Example: 2014-01-03 11:23:08
icinga.date Current date. Example: 2014-01-03
icinga.time Current time including timezone information. Example: 11:23:08 +0000
icinga.uptime Current uptime of the Icinga 2 process.

The following macros provide global statistics:

Name Description
icinga.num_services_ok Current number of services in state 'OK'.
icinga.num_services_warning Current number of services in state 'Warning'.
icinga.num_services_critical Current number of services in state 'Critical'.
icinga.num_services_unknown Current number of services in state 'Unknown'.
icinga.num_services_pending Current number of pending services.
icinga.num_services_unreachable Current number of unreachable services.
icinga.num_services_flapping Current number of flapping services.
icinga.num_services_in_downtime Current number of services in downtime.
icinga.num_services_acknowledged Current number of acknowledged service problems.
icinga.num_hosts_up Current number of hosts in state 'Up'.
icinga.num_hosts_down Current number of hosts in state 'Down'.
icinga.num_hosts_unreachable Current number of unreachable hosts.
icinga.num_hosts_flapping Current number of flapping hosts.
icinga.num_hosts_in_downtime Current number of hosts in downtime.
icinga.num_hosts_acknowledged Current number of acknowledged host problems.

Check Result Freshness

In Icinga 2 active check freshness is enabled by default. It is determined by the check_interval attribute and no incoming check results in that period of time.

threshold = last check execution time + check interval

Passive check freshness is calculated from the check_interval attribute if set.

threshold = last check result time + check interval

If the freshness checks are invalid, a new check is executed defined by the check_command attribute.

Check Flapping

The flapping algorithm used in Icinga 2 does not store the past states but calculcates the flapping threshold from a single value based on counters and half-life values. Icinga 2 compares the value with a single flapping threshold configuration attribute named flapping_threshold.

Flapping detection can be enabled or disabled using the enable_flapping attribute.

Volatile Services

By default all services remain in a non-volatile state. When a problem occurs, the SOFT state applies and once max_check_attempts attribute is reached with the check counter, a HARD state transition happens. Notifications are only triggered by HARD state changes and are then re-sent defined by the interval attribute.

It may be reasonable to have a volatile service which stays in a HARD state type if the service stays in a NOT-OK state. That way each service recheck will automatically trigger a notification unless the service is acknowledged or in a scheduled downtime.

External Commands

Icinga 2 provides an external command pipe for processing commands triggering specific actions (for example rescheduling a service check through the web interface).

In order to enable the ExternalCommandListener configuration use the following command and restart Icinga 2 afterwards:

# icinga2 feature enable command

Icinga 2 creates the command pipe file as /var/run/icinga2/cmd/icinga2.cmd using the default configuration.

Web interfaces and other Icinga addons are able to send commands to Icinga 2 through the external command pipe, for example for rescheduling a forced service check:

# /bin/echo "[`date +%s`] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;`date +%s`" >> /var/run/icinga2/cmd/icinga2.cmd

# tail -f /var/log/messages

Oct 17 15:01:25 icinga-server icinga2: Executing external command: [1382014885] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;1382014885
Oct 17 15:01:25 icinga-server icinga2: Rescheduling next check for service 'ping4'

External Command List

A list of currently supported external commands can be found here.

Detailed information on the commands and their required parameters can be found on the Icinga 1.x documentation.

Logging

Icinga 2 supports three different types of logging:

  • File logging
  • Syslog (on *NIX-based operating systems)
  • Console logging (STDOUT on tty)

You can enable additional loggers using the icinga2 feature enable and icinga2 feature disable commands to configure loggers:

Feature Description
debuglog Debug log (path: /var/log/icinga2/debug.log, severity: debug or higher)
mainlog Main log (path: /var/log/icinga2/icinga2.log, severity: information or higher)
syslog Syslog (severity: warning or higher)

By default file the mainlog feature is enabled. When running Icinga 2 on a terminal log messages with severity information or higher are written to the console.

Performance Data

When a host or service check is executed plugins should provide so-called performance data. Next to that additional check performance data can be fetched using Icinga 2 runtime macros such as the check latency or the current service state (or additional custom attributes).

The performance data can be passed to external applications which aggregate and store them in their backends. These tools usually generate graphs for historical reporting and trending.

Well-known addons processing Icinga performance data are PNP4Nagios, inGraph and Graphite.

Writing Performance Data Files

PNP4Nagios, inGraph and Graphios use performance data collector daemons to fetch the current performance files for their backend updates.

Therefore the Icinga 2 PerfdataWriter object allows you to define the output template format for host and services backed with Icinga 2 runtime vars.

host_format_template = "DATATYPE::HOSTPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tHOSTPERFDATA::$host.perfdata$\tHOSTCHECKCOMMAND::$host.checkcommand$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.statetype$"
service_format_template = "DATATYPE::SERVICEPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tSERVICEDESC::$service.name$\tSERVICEPERFDATA::$service.perfdata$\tSERVICECHECKCOMMAND::$service.checkcommand$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.statetype$\tSERVICESTATE::$service.state$\tSERVICESTATETYPE::$service.statetype$"

The default templates are already provided with the Icinga 2 feature configuration which can be enabled using

# icinga2 feature enable perfdata

By default all performance data files are rotated in a 15 seconds interval into the /var/spool/icinga2/perfdata/ directory as host-perfdata.<timestamp> and service-perfdata.<timestamp>. External collectors need to parse the rotated performance data files and then remove the processed files.

Graphite Carbon Cache Writer

While there are some Graphite collector scripts and daemons like Graphios available for Icinga 1.x it's more reasonable to directly process the check and plugin performance in memory in Icinga 2. Once there are new metrics available, Icinga 2 will directly write them to the defined Graphite Carbon daemon tcp socket.

You can enable the feature using

# icinga2 feature enable graphite

By default the GraphiteWriter object expects the Graphite Carbon Cache to listen at 127.0.0.1 on port 2003.

The current naming schema is

icinga.<hostname>.<metricname>
icinga.<hostname>.<servicename>.<metricname>

You can customize the metric prefix name by using the host_name_template and service_name_template configuration attributes.

The example below uses runtime macros and a global constant named GraphiteEnv. The constant name is freely definable and should be put in the constants.conf file.

const GraphiteEnv = "icinga.env1"

object GraphiteWriter "graphite" {
  host_name_template = GraphiteEnv + ".$host.name$"
  service_name_template = GraphiteEnv + ".$host.name$.$service.name$"
}

To make sure Icinga 2 writes a valid label into Graphite some characters are replaced with _ in the target name:

\/.-  (and space)

The resulting name in Graphite might look like:

www-01 / http-cert / response time
icinga.www_01.http_cert.response_time

In addition to the performance data retrieved from the check plugin, Icinga 2 sends internal check statistic data to Graphite:

metric description
current_attempt current check attempt
max_check_attempts maximum check attempts until the hard state is reached
reachable checked object is reachable
downtime_depth number of downtimes this object is in
execution_time check execution time
latency check latency
state current state of the checked object
state_type 0=SOFT, 1=HARD state

The following example illustrates how to configure the storage-schemas for Graphite Carbon Cache. Please make sure that the order is correct because the first match wins.

[icinga_internals]
pattern = ^icinga\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type)
retentions = 5m:7d

[icinga_default]
# intervals like PNP4Nagios uses them per default
pattern = ^icinga\.
retentions = 1m:2d,5m:10d,30m:90d,360m:4y

Status Data

Icinga 1.x writes object configuration data and status data in a cyclic interval to its objects.cache and status.dat files. Icinga 2 provides the StatusDataWriter object which dumps all configuration objects and status updates in a regular interval.

# icinga2 feature enable statusdata

Icinga 1.x Classic UI requires this data set as part of its backend.

Note

If you are not using any web interface or addon which uses these files you can safely disable this feature.

Compat Logging

The Icinga 1.x log format is considered being the Compat Log in Icinga 2 provided with the CompatLogger object.

These logs are not only used for informational representation in external web interfaces parsing the logs, but also to generate SLA reports and trends in Icinga 1.x Classic UI. Furthermore the Livestatus feature uses these logs for answering queries to historical tables.

The CompatLogger object can be enabled with

# icinga2 feature enable compatlog

By default, the Icinga 1.x log file called icinga.log is located in /var/log/icinga2/compat. Rotated log files are moved into var/log/icinga2/compat/archives.

The format cannot be changed without breaking compatibility to existing log parsers.

# tail -f /var/log/icinga2/compat/icinga.log

[1382115688] LOG ROTATION: HOURLY
[1382115688] LOG VERSION: 2.0
[1382115688] HOST STATE: CURRENT;localhost;UP;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;disk;WARNING;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;http;OK;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;load;OK;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;ping4;OK;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;ping6;OK;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;processes;WARNING;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;ssh;OK;HARD;1;
[1382115688] SERVICE STATE: CURRENT;localhost;users;OK;HARD;1;
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;disk;1382115705
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;http;1382115705
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;load;1382115705
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;1382115705
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;ping6;1382115705
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;processes;1382115705
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;ssh;1382115705
[1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;users;1382115705
[1382115731] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;ping6;2;critical test|
[1382115731] SERVICE ALERT: localhost;ping6;CRITICAL;SOFT;2;critical test

DB IDO

The IDO (Icinga Data Output) modules for Icinga 2 take care of exporting all configuration and status information into a database. The IDO database is used by a number of projects including Icinga Web 1.x and 2.

Details on the installation can be found in the Getting Started chapter. Details on the configuration can be found in the IdoMysqlConnection and IdoPgsqlConnection object configuration documentation. The DB IDO feature supports High Availability in the Icinga 2 cluster.

The following example query checks the health of the current Icinga 2 instance writing its current status to the DB IDO backend table icinga_programstatus every 10 seconds. By default it checks 60 seconds into the past which is a reasonable amount of time - adjust it for your requirements. If the condition is not met, the query returns an empty result.

Tip

Use check plugins to monitor the backend.

Replace the default string with your instance name, if different.

Example for MySQL:

# mysql -u root -p icinga -e "SELECT status_update_time FROM icinga_programstatus ps
  JOIN icinga_instances i ON ps.instance_id=i.instance_id
  WHERE (UNIX_TIMESTAMP(ps.status_update_time) > UNIX_TIMESTAMP(NOW())-60)
  AND i.instance_name='default';"

+---------------------+
| status_update_time  |
+---------------------+
| 2014-05-29 14:29:56 |
+---------------------+

Example for PostgreSQL:

# export PGPASSWORD=icinga; psql -U icinga -d icinga -c "SELECT ps.status_update_time FROM icinga_programstatus AS ps
  JOIN icinga_instances AS i ON ps.instance_id=i.instance_id
  WHERE ((SELECT extract(epoch from status_update_time) FROM icinga_programstatus) > (SELECT extract(epoch from now())-60))
  AND i.instance_name='default'";

status_update_time
------------------------
 2014-05-29 15:11:38+02
(1 Zeile)

A detailed list on the available table attributes can be found in the DB IDO Schema documentation.

Livestatus

The MK Livestatus project implements a query protocol that lets users query their Icinga instance for status information. It can also be used to send commands.

Details on the installation can be found in the Getting Started chapter.

Livestatus Sockets

Other to the Icinga 1.x Addon, Icinga 2 supports two socket types

  • Unix socket (default)
  • TCP socket

Details on the configuration can be found in the LivestatusListener object configuration.

Livestatus GET Queries

Note

All Livestatus queries require an additional empty line as query end identifier. The unixcat tool is either available by the MK Livestatus project or as separate binary.

There also is a Perl module available in CPAN for accessing the Livestatus socket programmatically: Monitoring::Livestatus

Example using the unix socket:

# echo -e "GET services\n" | unixcat /var/run/icinga2/cmd/livestatus

Example using the tcp socket listening on port 6558:

# echo -e 'GET services\n' | netcat 127.0.0.1 6558

# cat servicegroups <<EOF
GET servicegroups

EOF

(cat servicegroups; sleep 1) | netcat 127.0.0.1 6558

Livestatus COMMAND Queries

A list of available external commands and their parameters can be found here

$ echo -e 'COMMAND <externalcommandstring>' | netcat 127.0.0.1 6558

Livestatus Filters

and, or, negate

Operator | Negate | Description ----------|------------------------ = | != | Equality ~ | !~ | Regex match =~ | !=~ | Equality ignoring case ~~ | !~~ | Regex ignoring case < | | Less than

   |          | Greater than

<= | | Less than or equal

= | | Greater than or equal

Livestatus Stats

Schema: "Stats: aggregatefunction aggregateattribute"

Aggregate Function Description
sum  
min  
max  
avg sum / count
std standard deviation
suminv sum (1 / value)
avginv suminv / count
count ordinary default for any stats query if not aggregate function defined

Example:

GET hosts
Filter: has_been_checked = 1
Filter: check_type = 0
Stats: sum execution_time
Stats: sum latency
Stats: sum percent_state_change
Stats: min execution_time
Stats: min latency
Stats: min percent_state_change
Stats: max execution_time
Stats: max latency
Stats: max percent_state_change
OutputFormat: json
ResponseHeader: fixed16

Livestatus Output

  • CSV

CSV Output uses two levels of array separators: The members array separator is a comma (1st level) while extra info and host|service relation separator is a pipe (2nd level).

Separators can be set using ASCII codes like:

Separators: 10 59 44 124
  • JSON

Default separators.

Livestatus Error Codes

Code Description
200 OK
404 Table does not exist
452 Exception on query

Livestatus Tables

Table Join Description
hosts   host config and status attributes, services counter
hostgroups   hostgroup config, status attributes and host/service counters
services hosts service config and status attributes
servicegroups   servicegroup config, status attributes and service counters
contacts   contact config and status attributes
contactgroups   contact config, members
commands   command name and line
status   programstatus, config and stats
comments services status attributes
downtimes services status attributes
timeperiods   name and is inside flag
endpoints   config and status attributes
log services, hosts, contacts, commands parses compatlog and shows log attributes
statehist hosts, services parses compatlog and aggregates state change attributes

The commands table is populated with CheckCommand, EventCommand and NotificationCommand objects.

A detailed list on the available table attributes can be found in the Livestatus Schema documentation.

Check Result Files

Icinga 1.x writes its check result files to a temporary spool directory where they are processed in a regular interval. While this is extremely inefficient in performance regards it has been rendered useful for passing passive check results directly into Icinga 1.x skipping the external command pipe.

Several clustered/distributed environments and check-aggregation addons use that method. In order to support step-by-step migration of these environments, Icinga 2 ships the CheckResultReader object.

There is no feature configuration available, but it must be defined on-demand in your Icinga 2 objects configuration.

object CheckResultReader "reader" {
  spool_dir = "/data/check-results"
}