icinga2/doc/6-advanced-topics.md

493 lines
18 KiB
Markdown
Raw Normal View History

# <a id="advanced-topics"></a> Advanced Topics
## <a id="downtimes"></a> Downtimes
2013-10-10 19:05:49 +02:00
Downtimes can be scheduled for planned server maintenance or
any other targetted service outage you are aware of in advance.
Downtimes will suppress any notifications, and may trigger other
downtimes too. If the downtime was set by accident, or the duration
exceeds the maintenance, you can manually cancel the downtime.
Planned downtimes will also be taken into account for SLA reporting
tools calculating the SLAs based on the state and downtime history.
> **Note**
>
> Downtimes may overlap with their start and end times. If there
> are multiple downtimes triggered for one object, the overall downtime depth
> will be more than `1`. This is useful when you want to extend
> your maintenance window taking longer than expected.
### <a id="fixed-flexible-downtimes"></a> Fixed and Flexible Downtimes
A `fixed` downtime will be activated at the defined start time, and
removed at the end time. During this time window the service state
will change to `NOT-OK` and then actually trigger the downtime.
Notifications are suppressed and the downtime depth is incremented.
Common scenarios are a planned distribution upgrade on your linux
servers, or database updates in your warehouse. The customer knows
about a fixed downtime window between 23:00 and 24:00. After 24:00
all problems should be alerted again. Solution is simple -
schedule a `fixed` downtime starting at 23:00 and ending at 24:00.
Unlike a `fixed` downtime, a `flexible` downtime end does not necessarily
happen at the provided end time. Instead the downtime will be triggered
by the state change in the time span defined by start and end time, but
then last a defined duration in minutes.
Imagine the following scenario: Your service is frequently polled
by users trying to grab free deleted domains for immediate registration.
Between 07:30 and 08:00 the impact will hit for 15 minutes and generate
a network outage visible to the monitoring. The service is still alive,
but answering too slow to Icinga 2 service checks.
For that reason, you may want to schedule a downtime between 07:30 and
08:00 with a duration of 15 minutes. The downtime will then last from
its trigger time until the duration is over. After that, the downtime
is removed (may happen before or after the actual end time!).
### <a id="scheduling-downtime"></a> Scheduling a downtime
This can either happen through a web interface (Icinga 1.x Classic UI or Web)
or by using the external command pipe provided by the `ExternalCommandListener`
configuration.
Fixed downtimes require a start and end time (a duration will be ignored).
Flexible downtimes need a start and end time for the time span, and a duration
2013-10-21 21:43:17 +02:00
independent from that time span.
> **Note**
>
> Modern web interfaces treat services in a downtime as `handled`.
### <a id="triggered-downtimes"></a> Triggered Downtimes
This is optional when scheduling a downtime. If there is already a downtime
scheduled for a future maintenance, the current downtime can be triggered by
that downtime. This renders useful if you have scheduled a host downtime and
are now scheduling a child host's downtime getting triggered by the parent
downtime on NOT-OK state change.
2013-10-10 19:05:49 +02:00
### <a id="recurring-downtimes"></a> Recurring Downtimes
[ScheduledDowntime objects](#objecttype-scheduleddowntime) can be used to set up
recurring downtimes for services.
Example:
template ScheduledDowntime "backup-downtime" {
author = "icingaadmin",
comment = "Scheduled downtime for backup",
ranges = {
monday = "02:00-03:00",
tuesday = "02:00-03:00",
wednesday = "02:00-03:00",
thursday = "02:00-03:00",
friday = "02:00-03:00",
saturday = "02:00-03:00",
sunday = "02:00-03:00"
}
}
object Host "localhost" inherits "generic-host" {
...
services["load"] = {
templates = [ "generic-service" ],
check_command = "load",
scheduled_downtimes["backup"] = {
templates = [ "backup-downtime" ]
}
},
}
## <a id="comments"></a> Comments
2013-10-10 19:05:49 +02:00
Comments can be added at runtime and are persistent over restarts. You can
add useful information for others on repeating incidents (for example
"last time syslog at 100% cpu on 17.10.2013 due to stale nfs mount") which
is primarly accessible using web interfaces.
Adding and deleting comment actions are possible through the external command pipe
provided with the `ExternalCommandListener` configuration. The caller must
2013-10-21 21:43:17 +02:00
pass the comment id in case of manipulating an existing comment.
2013-10-10 19:05:49 +02:00
## <a id="acknowledgements"></a> Acknowledgements
2013-10-10 19:05:49 +02:00
If a problem is alerted and notified you may signal the other notification
receipients that you are aware of the problem and will handle it.
By sending an acknowledgement to Icinga 2 (using the external command pipe
provided with `ExternalCommandListener` configuration) all future notifications
are suppressed, a new comment is added with the provided description and
a notification with the type `NotificationFilterAcknowledgement` is sent
to all notified users.
> **Note**
>
> Modern web interfaces treat acknowledged problems as `handled`.
### <a id="expiring-acknowledgements"></a> Expiring Acknowledgements
Once a problem is acknowledged it may disappear from your `handled problems`
dashboard and no-one ever looks at it again since it will suppress
notifications too.
This `fire-and-forget` action is quite common. If you're sure that a
2013-10-21 21:43:17 +02:00
current problem should be resolved in the future at a defined time,
you can define an expiration time when acknowledging the problem.
Icinga 2 will clear the acknowledgement when expired and start to
re-notify if the problem persists.
2013-10-10 19:05:49 +02:00
## <a id="cluster"></a> Cluster
2013-10-18 14:31:57 +02:00
An Icinga 2 cluster consists of two or more nodes and can reside on multiple
architectures. The base concept of Icinga 2 is the possibility to add additional
features using components. In case of a cluster setup you have to add the
cluster feature to all nodes. Before you start configuring the diffent nodes
it's necessary to setup the underlying communication layer based on SSL.
### <a id="certificate-authority-certificates"></a> Certificate Authority and Certificates
2013-10-18 14:31:57 +02:00
Icinga 2 comes with two scripts helping you to create CA and node certificates
for you Icinga 2 Cluster.
2013-10-18 14:31:57 +02:00
The first step is the creation of CA using the following command:
2013-10-18 14:31:57 +02:00
icinga2-build-ca
2013-10-18 14:31:57 +02:00
Please make sure to export a variable containing an empty folder for the created
CA files:
2013-10-18 14:31:57 +02:00
export ICINGA_CA="/root/icinga-ca"
2013-10-18 14:31:57 +02:00
In the next step you have to create a certificate and a key file for every node
using the following command:
2013-10-18 14:31:57 +02:00
icinga2-build-key icinga-node-1
2013-10-18 14:31:57 +02:00
Please create a certificate and a key file for every node in the Icinga 2
Cluster and save the CA key in case you want to set up certificates for
additional nodes at a later date.
### <a id="enable-cluster-configuration"></a> Enable the Cluster Configuration
2013-10-18 14:31:57 +02:00
Until the cluster-component is moved into an independent feature you have to
enable the required libraries in the icinga2.conf configuration file:
2013-10-18 14:31:57 +02:00
library "cluster"
### <a id="configure-clusterlistener-object"></a> Configure the ClusterListener Object
2013-11-20 10:29:37 +01:00
The ClusterListener needs to be configured on every node in the cluster with the
following settings:
Configuration Setting |Value
-------------------------|------------------------------------
ca_path | path to ca.crt file
cert_path | path to server certificate
key_path | path to server key
bind_port | port for incoming and outgoing conns
peers | array of all reachable nodes
------------------------- ------------------------------------
A sample config part can look like this:
2013-10-18 14:31:57 +02:00
/**
* Load cluster library and configure ClusterListener using certificate files
*/
library "cluster"
object ClusterListener "cluster" {
ca_path = "/etc/icinga2/ca/ca.crt",
cert_path = "/etc/icinga2/ca/icinga-node-1.crt",
key_path = "/etc/icinga2/ca/icinga-node-1.key",
2013-10-18 15:10:34 +02:00
2013-10-18 14:31:57 +02:00
bind_port = 8888,
2013-10-18 15:10:34 +02:00
2013-10-18 14:31:57 +02:00
peers = [ "icinga-node-2" ]
}
> **Note**
>
> The certificate files must be readable by the user Icinga 2 is running as. Also,
> the private key file should not be world-readable.
2013-11-20 10:29:37 +01:00
Peers configures the direction used to connect multiple nodes together. If have
a three node cluster consisting of
* node-1
* node-2
* node-3
2013-11-20 10:29:37 +01:00
and `node-3` is only reachable from `node-2`, you have to consider this in your
peer configuration.
### <a id="configure-cluster-endpoints"></a> Configure Cluster Endpoints
2013-11-20 10:29:37 +01:00
In addition to the configured port and hostname every endpoint can have specific
abilities to send configuration files to other nodes and limit the hosts allowed
to send configuration files.
Configuration Setting |Value
-------------------------|------------------------------------
host | hostname
port | port
accept_config | defines all nodes allowed to send configs
2013-11-20 10:29:37 +01:00
config_files | defines all files to be send to that node - MUST BE AN ABSOLUTE PATH
------------------------- ------------------------------------
A sample config part can look like this:
2013-10-18 14:31:57 +02:00
/**
* Configure config master endpoint
2013-10-18 14:31:57 +02:00
*/
2013-10-18 14:31:57 +02:00
object Endpoint "icinga-node-1" {
host = "icinga-node-1.localdomain",
port = 8888,
config_files = ["/etc/icinga2/conf.d/*.conf"]
}
2013-10-18 14:31:57 +02:00
If you update the configuration files on the configured file sender, it will
force a restart on all receiving nodes after validating the new config.
2013-10-10 19:05:49 +02:00
A sample config part for a config receiver endpoint can look like this:
/**
* Configure config receiver endpoint
*/
object Endpoint "icinga-node-2" {
host = "icinga-node-2.localdomain",
port = 8888,
accept_config = [ "icinga-node-1" ]
}
2013-11-20 10:29:37 +01:00
By default these configuration files are saved in /var/lib/icinga2/cluster/config.
2013-11-20 10:29:37 +01:00
In order to load configuration files which were received from a remote Icinga 2
instance you will have to add the following include directive to your
`icinga2.conf` configuration file:
include (IcingaLocalStateDir + "/lib/icinga2/cluster/config/*/*")
### <a id="initial-cluster-sync"></a> Initial Cluster Sync
In order to make sure that all of your cluster nodes have the same state you will
have to pick one of the nodes as your initial "master" and copy its state file
to all the other nodes.
You can find the state file in `/var/lib/icinga2/icinga2.state`. Before copying
the state file you should make sure that all your cluster nodes are properly shut
down.
2014-02-07 14:03:53 +01:00
### <a id="assign-services-to-cluster-nodes"></a> Assign Services to Cluster Nodes
By default all services are distributed among the cluster nodes with the `Checker`
feature enabled.
If you require specific services to be only executed by one or more checker nodes
within the cluster, you must define `authorities` as additional service object
attribute. Required Endpoints must be defined as array.
object Host "dmz-host1" inherits "generic-host" {
services["dmz-oracledb"] = {
templates = [ "generic-service" ],
authorities = [ "icinga-node-1" ],
}
}
> **Tip**
>
> Most common usecase is building a classic Master-Slave-Setup. The master node
> does not have the `Checker` feature enabled, and the slave nodes are checking
> services based on their location, inheriting from a global service template
> defining the authorities.
2014-02-07 14:03:53 +01:00
### <a id="cluster-health-check"></a> Cluster Health Check
The Icinga 2 [ITL](#itl) ships an internal check command checking all configured
`EndPoints` in the cluster setup. The check result will become critical if
one or more configured nodes are not connected.
Example:
object Host "icinga2a" inherits "generic-host" {
services["cluster"] = {
templates = [ "generic-service" ],
check_interval = 1m,
check_command = "cluster",
authorities = [ "icinga2a" ]
},
}
> **Note**
>
> Each cluster node should execute its own local cluster health check to
> get an idea about network related connection problems from different
> point of views. Use the `authorities` attribute to assign the service
> check to the configured node.
### <a id="host-multiple-cluster-nodes"></a> Host With Multiple Cluster Nodes
Special scenarios might require multiple cluster nodes running on a single host.
By default Icinga 2 and its features will drop their runtime data below the prefix
`IcingaLocalStateDir`. By default packages will set that path to `/var`.
You can either set that variable as constant configuration
definition in [icinga2.conf](#icinga2-conf) or pass it as runtime variable to
the Icinga 2 daemon.
# icinga2 -c /etc/icinga2/node1/icinga2.conf -DIcingaLocalStateDir=/opt/node1/var
## <a id="domains"></a> Domains
A [Service](#objecttype-service) object can be restricted using the `domains` attribute
array specifying endpoint privileges.
A Domain object specifices the ACLs applied for each [Endpoint](#objecttype-endpoint).
The following example assigns the domain `dmz-db` to the service `dmz-oracledb`. Endpoint
`icinga-node-dmz-1` does not allow any object modification (no commands, check results) and only
relays local messages to the remote node(s). The endpoint `icinga-node-dmz-2` processes all
messages read and write (accept check results, commands and also relay messages to remote
nodes).
That way the service `dmz-oracledb` on endpoint `icinga-node-dmz-1` will not be modified
by any cluster event message, and could be checked by the local authority too presenting
a different state history. `icinga-node-dmz-2` still receives all cluster message updates
from the `icinga-node-dmz-1` endpoint.
object Host "dmz-host1" inherits "generic-host" {
services["dmz-oracledb"] = {
templates = [ "generic-service" ],
domains = [ "dmz-db" ],
authorities = [ "icinga-node-dmz-1", "icinga-node-dmz-2"],
}
}
object Domain "dmz-db" {
acl = {
icinga-node-dmz-1 = (DomainPrivReadOnly),
icinga-node-dmz-2 = (DomainPrivReadWrite)
}
}
## <a id="dependencies"></a> Dependencies
2013-10-10 19:05:49 +02:00
Icinga 2 uses host and service [Dependency](#objecttype-dependency) objects either directly
defined or as inline definition as `dependencies` dictionary. The `parent_host` and `parent_service`
attributes are mandatory, `child_host` and `child_service` attributes are obsolete within
inline definitions in an existing service object or service inline definition.
A service can depend on a host, and vice versa. A service has an implicit dependency (parent)
to its host. A host to host dependency acts implicit as host parent relation.
2013-10-18 14:31:57 +02:00
A common scenario is the Icinga 2 server behind a router. Checking internet
access by pinging the Google DNS server `google-dns` is a common method, but
will fail in case the `dsl-router` host is down. Therefore the example below
defines a host dependency which acts implicit as parent relation too.
Furthermore the host may be reachable but ping samples are dropped by the
router's firewall. In case the `dsl-router``ping4` service check fails, all
further checks for the `google-dns` `ping4` service should be suppressed.
This is achieved by setting the `disable_checks` attribute to `true`.
2013-10-18 14:31:57 +02:00
object Host "dsl-router" {
services["ping4"] = {
templates = "generic-service",
check_command = "ping4"
}
macros = {
address = "192.168.1.1",
},
}
object Host "google-dns" {
services["ping4"] = {
templates = "generic-service",
check_command = "ping4",
dependencies["dsl-router-ping4"] = {
parent_host = "dsl-router",
parent_service = "ping4",
disable_checks = true
}
}
macros = {
address = "8.8.8.8",
},
dependencies["dsl-router"] = {
parent_host = "dsl-router"
},
}
2013-10-10 19:05:49 +02:00
## <a id="check-result-freshness"></a> Check Result Freshness
2013-10-10 19:05:49 +02:00
In Icinga 2 active check freshness is enabled by default. It is determined by the
`check_interval` attribute and no incoming check results in that period of time.
threshold = last check execution time + check interval
Passive check freshness is calculated from the `check_interval` attribute if set.
threshold = last check result time + check interval
If the freshness checks are invalid, a new check is executed defined by the
`check_command` attribute.
2013-10-10 19:05:49 +02:00
## <a id="check-flapping"></a> Check Flapping
2013-10-10 19:05:49 +02:00
The flapping algorithm used in Icinga 2 does not store the past states but
calculcates the flapping threshold from a single value based on counters and
half-life values. Icinga 2 compares the value with a single flapping threshold
configuration attribute named `flapping_threshold`.
> **Note**
>
> Flapping must be explicitely enabled setting the `Service` object attribute
> `enable_flapping = 1`.
2013-10-10 19:05:49 +02:00
## <a id="volatile-services"></a> Volatile Services
2013-10-10 19:05:49 +02:00
2013-10-18 20:05:08 +02:00
By default all services remain in a non-volatile state. When a problem
occurs, the `SOFT` state applies and once `max_check_attempts` attribute
is reached with the check counter, a `HARD` state transition happens.
Notifications are only triggered by `HARD` state changes and are then
re-sent defined by the `notification_interval` attribute.
It may be reasonable to have a volatile service which stays in a `HARD`
state type if the service stays in a `NOT-OK` state. That way each
service recheck will automatically trigger a notification unless the
service is acknowledged or in a scheduled downtime.
2013-10-10 19:05:49 +02:00
## <a id="modified-attributes"></a> Modified Attributes
2013-10-10 19:05:49 +02:00
Icinga 2 allows you to modify defined object attributes at runtime different to
the local configuration object attributes. These modified attributes are
stored as bit-shifted-value and made available in backends. Icinga 2 stores
modified attributes in its state file and restores them on restart.
2013-10-10 19:05:49 +02:00
Modified Attributes can be reset using external commands.
2013-10-10 19:05:49 +02:00
## <a id="plugin-api"></a> Plugin API
2013-10-10 19:05:49 +02:00
Currently the native plugin api inherited from the `Monitoring Plugins` (former
`Nagios Plugins`) project is available.
Future specifications will be documented here.
### <a id="monitoring-plugin-api"></a> Monitoring Plugin API
The `Monitoring Plugin API` (former `Nagios Plugin API`) is defined in the
[Monitoring Plugins Development Guidelines](https://www.monitoring-plugins.org/doc/guidelines.html).
2013-10-10 19:05:49 +02:00