From 6490d03b1fad31da26c1f135469d8724ed563f4b Mon Sep 17 00:00:00 2001 From: Gunnar Beutner Date: Wed, 11 Feb 2015 14:15:03 +0100 Subject: [PATCH] Add missing file --- doc/3-monitoring-basics.md | 2 +- doc/4-advanced-topics.md | 636 +++++++++++++++++++++++++++++++++++++ 2 files changed, 637 insertions(+), 1 deletion(-) create mode 100644 doc/4-advanced-topics.md diff --git a/doc/3-monitoring-basics.md b/doc/3-monitoring-basics.md index 2dc761f83..4ec7d2cd6 100644 --- a/doc/3-monitoring-basics.md +++ b/doc/3-monitoring-basics.md @@ -716,7 +716,7 @@ This can be done for service and user groups the same way: ### Group Membership Assign -Instead of manually assigning each object to a group you can also assign objects +Instead of manually assigning each object to a group you can also assign objects to a group based on their attributes: object HostGroup "prod-mssql" { diff --git a/doc/4-advanced-topics.md b/doc/4-advanced-topics.md new file mode 100644 index 000000000..af4e9b42b --- /dev/null +++ b/doc/4-advanced-topics.md @@ -0,0 +1,636 @@ +# Advanced Topics + +This chapter covers a number of advanced topics. If you're new to Icinga you +can safely skip over things you're not interested in. + +## Downtimes + +Downtimes can be scheduled for planned server maintenance or +any other targetted service outage you are aware of in advance. + +Downtimes will suppress any notifications, and may trigger other +downtimes too. If the downtime was set by accident, or the duration +exceeds the maintenance, you can manually cancel the downtime. +Planned downtimes will also be taken into account for SLA reporting +tools calculating the SLAs based on the state and downtime history. + +Multiple downtimes for a single object may overlap. This is useful +when you want to extend your maintenance window taking longer than expected. +If there are multiple downtimes triggered for one object, the overall downtime depth +will be greater than `1`. + + +If the downtime was scheduled after the problem changed to a critical hard +state triggering a problem notification, and the service recovers during +the downtime window, the recovery notification won't be suppressed. + +### Fixed and Flexible Downtimes + +A `fixed` downtime will be activated at the defined start time, and +removed at the end time. During this time window the service state +will change to `NOT-OK` and then actually trigger the downtime. +Notifications are suppressed and the downtime depth is incremented. + +Common scenarios are a planned distribution upgrade on your linux +servers, or database updates in your warehouse. The customer knows +about a fixed downtime window between 23:00 and 24:00. After 24:00 +all problems should be alerted again. Solution is simple - +schedule a `fixed` downtime starting at 23:00 and ending at 24:00. + +Unlike a `fixed` downtime, a `flexible` downtime will be triggered +by the state change in the time span defined by start and end time, +and then last for the specified duration in minutes. + +Imagine the following scenario: Your service is frequently polled +by users trying to grab free deleted domains for immediate registration. +Between 07:30 and 08:00 the impact will hit for 15 minutes and generate +a network outage visible to the monitoring. The service is still alive, +but answering too slow to Icinga 2 service checks. +For that reason, you may want to schedule a downtime between 07:30 and +08:00 with a duration of 15 minutes. The downtime will then last from +its trigger time until the duration is over. After that, the downtime +is removed (may happen before or after the actual end time!). + +### Scheduling a downtime + +This can either happen through a web interface or by sending an [external command](4-advanced-topics.md#external-commands) +to the external command pipe provided by the `ExternalCommandListener` configuration. + +Fixed downtimes require a start and end time (a duration will be ignored). +Flexible downtimes need a start and end time for the time span, and a duration +independent from that time span. + +### Triggered Downtimes + +This is optional when scheduling a downtime. If there is already a downtime +scheduled for a future maintenance, the current downtime can be triggered by +that downtime. This renders useful if you have scheduled a host downtime and +are now scheduling a child host's downtime getting triggered by the parent +downtime on NOT-OK state change. + +### Recurring Downtimes + +[ScheduledDowntime objects](6-object-types.md#objecttype-scheduleddowntime) can be used to set up +recurring downtimes for services. + +Example: + + apply ScheduledDowntime "backup-downtime" to Service { + author = "icingaadmin" + comment = "Scheduled downtime for backup" + + ranges = { + monday = "02:00-03:00" + tuesday = "02:00-03:00" + wednesday = "02:00-03:00" + thursday = "02:00-03:00" + friday = "02:00-03:00" + saturday = "02:00-03:00" + sunday = "02:00-03:00" + } + + assign where "backup" in service.groups + } + + +## Comments + +Comments can be added at runtime and are persistent over restarts. You can +add useful information for others on repeating incidents (for example +"last time syslog at 100% cpu on 17.10.2013 due to stale nfs mount") which +is primarly accessible using web interfaces. + +Adding and deleting comment actions are possible through the external command pipe +provided with the `ExternalCommandListener` configuration. The caller must +pass the comment id in case of manipulating an existing comment. + + +## Acknowledgements + +If a problem is alerted and notified you may signal the other notification +recipients that you are aware of the problem and will handle it. + +By sending an acknowledgement to Icinga 2 (using the external command pipe +provided with `ExternalCommandListener` configuration) all future notifications +are suppressed, a new comment is added with the provided description and +a notification with the type `NotificationFilterAcknowledgement` is sent +to all notified users. + +### Expiring Acknowledgements + +Once a problem is acknowledged it may disappear from your `handled problems` +dashboard and no-one ever looks at it again since it will suppress +notifications too. + +This `fire-and-forget` action is quite common. If you're sure that a +current problem should be resolved in the future at a defined time, +you can define an expiration time when acknowledging the problem. + +Icinga 2 will clear the acknowledgement when expired and start to +re-notify if the problem persists. + + +## Time Periods + +Time Periods define time ranges in Icinga where event actions are +triggered, for example whether a service check is executed or not within +the `check_period` attribute. Or a notification should be sent to +users or not, filtered by the `period` and `notification_period` +configuration attributes for `Notification` and `User` objects. + +> **Note** +> +> If you are familar with Icinga 1.x - these time period definitions +> are called `legacy timeperiods` in Icinga 2. +> +> An Icinga 2 legacy timeperiod requires the `ITL` provided template +>`legacy-timeperiod`. + +The `TimePeriod` attribute `ranges` may contain multiple directives, +including weekdays, days of the month, and calendar dates. +These types may overlap/override other types in your ranges dictionary. + +The descending order of precedence is as follows: + +* Calendar date (2008-01-01) +* Specific month date (January 1st) +* Generic month date (Day 15) +* Offset weekday of specific month (2nd Tuesday in December) +* Offset weekday (3rd Monday) +* Normal weekday (Tuesday) + +If you don't set any `check_period` or `notification_period` attribute +on your configuration objects Icinga 2 assumes `24x7` as time period +as shown below. + + object TimePeriod "24x7" { + import "legacy-timeperiod" + + display_name = "Icinga 2 24x7 TimePeriod" + ranges = { + "monday" = "00:00-24:00" + "tuesday" = "00:00-24:00" + "wednesday" = "00:00-24:00" + "thursday" = "00:00-24:00" + "friday" = "00:00-24:00" + "saturday" = "00:00-24:00" + "sunday" = "00:00-24:00" + } + } + +If your operation staff should only be notified during workhours +create a new timeperiod named `workhours` defining a work day from +09:00 to 17:00. + + object TimePeriod "workhours" { + import "legacy-timeperiod" + + display_name = "Icinga 2 8x5 TimePeriod" + ranges = { + "monday" = "09:00-17:00" + "tuesday" = "09:00-17:00" + "wednesday" = "09:00-17:00" + "thursday" = "09:00-17:00" + "friday" = "09:00-17:00" + } + } + +Use the `period` attribute to assign time periods to +`Notification` and `Dependency` objects: + + object Notification "mail" { + import "generic-notification" + + host_name = "localhost" + + command = "mail-notification" + users = [ "icingaadmin" ] + period = "workhours" + } + + +## Check Result Freshness + +In Icinga 2 active check freshness is enabled by default. It is determined by the +`check_interval` attribute and no incoming check results in that period of time. + + threshold = last check execution time + check interval + +Passive check freshness is calculated from the `check_interval` attribute if set. + + threshold = last check result time + check interval + +If the freshness checks are invalid, a new check is executed defined by the +`check_command` attribute. + + +## Check Flapping + +The flapping algorithm used in Icinga 2 does not store the past states but +calculcates the flapping threshold from a single value based on counters and +half-life values. Icinga 2 compares the value with a single flapping threshold +configuration attribute named `flapping_threshold`. + +Flapping detection can be enabled or disabled using the `enable_flapping` attribute. + + +## Volatile Services + +By default all services remain in a non-volatile state. When a problem +occurs, the `SOFT` state applies and once `max_check_attempts` attribute +is reached with the check counter, a `HARD` state transition happens. +Notifications are only triggered by `HARD` state changes and are then +re-sent defined by the `interval` attribute. + +It may be reasonable to have a volatile service which stays in a `HARD` +state type if the service stays in a `NOT-OK` state. That way each +service recheck will automatically trigger a notification unless the +service is acknowledged or in a scheduled downtime. + + +## External Commands + +Icinga 2 provides an external command pipe for processing commands +triggering specific actions (for example rescheduling a service check +through the web interface). + +In order to enable the `ExternalCommandListener` configuration use the +following command and restart Icinga 2 afterwards: + + # icinga2 feature enable command + +Icinga 2 creates the command pipe file as `/var/run/icinga2/cmd/icinga2.cmd` +using the default configuration. + +Web interfaces and other Icinga addons are able to send commands to +Icinga 2 through the external command pipe, for example for rescheduling +a forced service check: + + # /bin/echo "[`date +%s`] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;`date +%s`" >> /var/run/icinga2/cmd/icinga2.cmd + + # tail -f /var/log/messages + + Oct 17 15:01:25 icinga-server icinga2: Executing external command: [1382014885] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;1382014885 + Oct 17 15:01:25 icinga-server icinga2: Rescheduling next check for service 'ping4' + + +### External Command List + +A list of currently supported external commands can be found [here](18-appendix.md#external-commands-list-detail). + +Detailed information on the commands and their required parameters can be found +on the [Icinga 1.x documentation](http://docs.icinga.org/latest/en/extcommands2.html). + +## Logging + +Icinga 2 supports three different types of logging: + +* File logging +* Syslog (on *NIX-based operating systems) +* Console logging (`STDOUT` on tty) + +You can enable additional loggers using the `icinga2 feature enable` +and `icinga2 feature disable` commands to configure loggers: + +Feature | Description +---------|------------ +debuglog | Debug log (path: `/var/log/icinga2/debug.log`, severity: `debug` or higher) +mainlog | Main log (path: `/var/log/icinga2/icinga2.log`, severity: `information` or higher) +syslog | Syslog (severity: `warning` or higher) + +By default file the `mainlog` feature is enabled. When running Icinga 2 +on a terminal log messages with severity `information` or higher are +written to the console. + +## Performance Data + +When a host or service check is executed plugins should provide so-called +`performance data`. Next to that additional check performance data +can be fetched using Icinga 2 runtime macros such as the check latency +or the current service state (or additional custom attributes). + +The performance data can be passed to external applications which aggregate and +store them in their backends. These tools usually generate graphs for historical +reporting and trending. + +Well-known addons processing Icinga performance data are PNP4Nagios, +inGraph and Graphite. + +### Writing Performance Data Files + +PNP4Nagios, inGraph and Graphios use performance data collector daemons to fetch +the current performance files for their backend updates. + +Therefore the Icinga 2 `PerfdataWriter` object allows you to define +the output template format for host and services backed with Icinga 2 +runtime vars. + + host_format_template = "DATATYPE::HOSTPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tHOSTPERFDATA::$host.perfdata$\tHOSTCHECKCOMMAND::$host.checkcommand$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.statetype$" + service_format_template = "DATATYPE::SERVICEPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tSERVICEDESC::$service.name$\tSERVICEPERFDATA::$service.perfdata$\tSERVICECHECKCOMMAND::$service.checkcommand$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.statetype$\tSERVICESTATE::$service.state$\tSERVICESTATETYPE::$service.statetype$" + +The default templates are already provided with the Icinga 2 feature configuration +which can be enabled using + + # icinga2 feature enable perfdata + +By default all performance data files are rotated in a 15 seconds interval into +the `/var/spool/icinga2/perfdata/` directory as `host-perfdata.` and +`service-perfdata.`. +External collectors need to parse the rotated performance data files and then +remove the processed files. + +### Graphite Carbon Cache Writer + +While there are some Graphite collector scripts and daemons like Graphios available for +Icinga 1.x it's more reasonable to directly process the check and plugin performance +in memory in Icinga 2. Once there are new metrics available, Icinga 2 will directly +write them to the defined Graphite Carbon daemon tcp socket. + +You can enable the feature using + + # icinga2 feature enable graphite + +By default the `GraphiteWriter` object expects the Graphite Carbon Cache to listen at +`127.0.0.1` on TCP port `2003`. + +The current naming schema is + + icinga.. + icinga... + +You can customize the metric prefix name by using the `host_name_template` and +`service_name_template` configuration attributes. + +The example below uses [runtime macros](3-monitoring-basics.md#runtime-macros) and a +[global constant](16-language-reference.md#constants) named `GraphiteEnv`. The constant name +is freely definable and should be put in the [constants.conf](5-configuring-icinga-2.md#constants-conf) file. + + const GraphiteEnv = "icinga.env1" + + object GraphiteWriter "graphite" { + host_name_template = GraphiteEnv + ".$host.name$" + service_name_template = GraphiteEnv + ".$host.name$.$service.name$" + } + +To make sure Icinga 2 writes a valid label into Graphite some characters are replaced +with `_` in the target name: + + \/.- (and space) + +The resulting name in Graphite might look like: + + www-01 / http-cert / response time + icinga.www_01.http_cert.response_time + +In addition to the performance data retrieved from the check plugin, Icinga 2 sends +internal check statistic data to Graphite: + + metric | description + -------------------|------------------------------------------ + current_attempt | current check attempt + max_check_attempts | maximum check attempts until the hard state is reached + reachable | checked object is reachable + downtime_depth | number of downtimes this object is in + execution_time | check execution time + latency | check latency + state | current state of the checked object + state_type | 0=SOFT, 1=HARD state + +The following example illustrates how to configure the storage-schemas for Graphite Carbon +Cache. Please make sure that the order is correct because the first match wins. + + [icinga_internals] + pattern = ^icinga\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type) + retentions = 5m:7d + + [icinga_default] + # intervals like PNP4Nagios uses them per default + pattern = ^icinga\. + retentions = 1m:2d,5m:10d,30m:90d,360m:4y + +### GELF Writer + +The `Graylog Extended Log Format` (short: [GELF](http://www.graylog2.org/resources/gelf)) +can be used to send application logs directly to a TCP socket. + +While it has been specified by the [graylog2](http://www.graylog2.org/) project as their +[input resource standard](http://www.graylog2.org/resources/gelf), other tools such as +[Logstash](http://www.logstash.net) also support `GELF` as +[input type](http://logstash.net/docs/latest/inputs/gelf). + +You can enable the feature using + + # icinga2 feature enable gelf + +By default the `GelfWriter` object expects the GELF receiver to listen at `127.0.0.1` on TCP port `12201`. +The default `source` attribute is set to `icinga2`. You can customize that for your needs if required. + +Currently these events are processed: +* Check results +* State changes +* Notifications + +### OpenTSDB Writer + +While there are some OpenTSDB collector scripts and daemons like tcollector available for +Icinga 1.x it's more reasonable to directly process the check and plugin performance +in memory in Icinga 2. Once there are new metrics available, Icinga 2 will directly +write them to the defined TSDB TCP socket. + +You can enable the feature using + + # icinga2 feature enable opentsdb + +By default the `OpenTsdbWriter` object expects the TSD to listen at +`127.0.0.1` on port `4242`. + +The current naming schema is + + icinga.host. + icinga.service.. + +for host and service checks. The tag host is always applied. + +To make sure Icinga 2 writes a valid metric into OpenTSDB some characters are replaced +with `_` in the target name: + + \ (and space) + +The resulting name in OpenTSDB might look like: + + www-01 / http-cert / response time + icinga.http_cert.response_time + +In addition to the performance data retrieved from the check plugin, Icinga 2 sends +internal check statistic data to OpenTSDB: + + metric | description + -------------------|------------------------------------------ + current_attempt | current check attempt + max_check_attempts | maximum check attempts until the hard state is reached + reachable | checked object is reachable + downtime_depth | number of downtimes this object is in + execution_time | check execution time + latency | check latency + state | current state of the checked object + state_type | 0=SOFT, 1=HARD state + +While reachable, state and state_type are metrics for the host or service the +other metrics follow the current naming schema + + icinga.check. + +with the following tags + + tag | description + --------|------------------------------------------ + type | the check type, one of [host, service] + host | hostname, the check ran on + service | the service name (if type=service) + +> **Note** +> +> You might want to set the tsd.core.auto_create_metrics setting to `true` +> in your opentsdb.conf configuration file. + + +## Status Data + +Icinga 1.x writes object configuration data and status data in a cyclic +interval to its `objects.cache` and `status.dat` files. Icinga 2 provides +the `StatusDataWriter` object which dumps all configuration objects and +status updates in a regular interval. + + # icinga2 feature enable statusdata + +Icinga 1.x Classic UI requires this data set as part of its backend. + +> **Note** +> +> If you are not using any web interface or addon which uses these files +> you can safely disable this feature. + + +## Compat Logging + +The Icinga 1.x log format is considered being the `Compat Log` +in Icinga 2 provided with the `CompatLogger` object. + +These logs are not only used for informational representation in +external web interfaces parsing the logs, but also to generate +SLA reports and trends in Icinga 1.x Classic UI. Furthermore the +[Livestatus](12-livestatus.md#setting-up-livestatus) feature uses these logs for answering queries to +historical tables. + +The `CompatLogger` object can be enabled with + + # icinga2 feature enable compatlog + +By default, the Icinga 1.x log file called `icinga.log` is located +in `/var/log/icinga2/compat`. Rotated log files are moved into +`var/log/icinga2/compat/archives`. + +The format cannot be changed without breaking compatibility to +existing log parsers. + + # tail -f /var/log/icinga2/compat/icinga.log + + [1382115688] LOG ROTATION: HOURLY + [1382115688] LOG VERSION: 2.0 + [1382115688] HOST STATE: CURRENT;localhost;UP;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;disk;WARNING;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;http;OK;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;load;OK;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;ping4;OK;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;ping6;OK;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;processes;WARNING;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;ssh;OK;HARD;1; + [1382115688] SERVICE STATE: CURRENT;localhost;users;OK;HARD;1; + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;disk;1382115705 + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;http;1382115705 + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;load;1382115705 + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;1382115705 + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;ping6;1382115705 + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;processes;1382115705 + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;ssh;1382115705 + [1382115706] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;localhost;users;1382115705 + [1382115731] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;ping6;2;critical test| + [1382115731] SERVICE ALERT: localhost;ping6;CRITICAL;SOFT;2;critical test + + + + +## DB IDO + +The IDO (Icinga Data Output) modules for Icinga 2 take care of exporting all +configuration and status information into a database. The IDO database is used +by a number of projects including Icinga Web 1.x and 2. + +Details on the installation can be found in the [Configuring DB IDO](#configuring-db-ido) +chapter. Details on the configuration can be found in the +[IdoMysqlConnection](6-object-types.md#objecttype-idomysqlconnection) and +[IdoPgsqlConnection](6-object-types.md#objecttype-idopgsqlconnection) +object configuration documentation. +The DB IDO feature supports [High Availability](9-monitoring-remote-systems.md#high-availability-db-ido) in +the Icinga 2 cluster. + +The following example query checks the health of the current Icinga 2 instance +writing its current status to the DB IDO backend table `icinga_programstatus` +every 10 seconds. By default it checks 60 seconds into the past which is a reasonable +amount of time - adjust it for your requirements. If the condition is not met, +the query returns an empty result. + +> **Tip** +> +> Use [check plugins](10-addons-plugins.md#plugins) to monitor the backend. + +Replace the `default` string with your instance name, if different. + +Example for MySQL: + + # mysql -u root -p icinga -e "SELECT status_update_time FROM icinga_programstatus ps + JOIN icinga_instances i ON ps.instance_id=i.instance_id + WHERE (UNIX_TIMESTAMP(ps.status_update_time) > UNIX_TIMESTAMP(NOW())-60) + AND i.instance_name='default';" + + +---------------------+ + | status_update_time | + +---------------------+ + | 2014-05-29 14:29:56 | + +---------------------+ + + +Example for PostgreSQL: + + # export PGPASSWORD=icinga; psql -U icinga -d icinga -c "SELECT ps.status_update_time FROM icinga_programstatus AS ps + JOIN icinga_instances AS i ON ps.instance_id=i.instance_id + WHERE ((SELECT extract(epoch from status_update_time) FROM icinga_programstatus) > (SELECT extract(epoch from now())-60)) + AND i.instance_name='default'"; + + status_update_time + ------------------------ + 2014-05-29 15:11:38+02 + (1 Zeile) + + +A detailed list on the available table attributes can be found in the [DB IDO Schema documentation](18-appendix.md#schema-db-ido). + + +## Check Result Files + +Icinga 1.x writes its check result files to a temporary spool directory +where they are processed in a regular interval. +While this is extremely inefficient in performance regards it has been +rendered useful for passing passive check results directly into Icinga 1.x +skipping the external command pipe. + +Several clustered/distributed environments and check-aggregation addons +use that method. In order to support step-by-step migration of these +environments, Icinga 2 supports the `CheckResultReader` object. + +There is no feature configuration available, but it must be defined +on-demand in your Icinga 2 objects configuration. + + object CheckResultReader "reader" { + spool_dir = "/data/check-results" + }