From 95f0a7a0221e03b0d95ef3707cd909c44e37b46a Mon Sep 17 00:00:00 2001
From: Michael Friedrich <michael.friedrich@icinga.com>
Date: Fri, 8 Sep 2017 13:40:09 +0200
Subject: [PATCH] Docs: Technical Concepts for cluster and signing

refs #5450
---
 doc/03-monitoring-basics.md                   |   2 +-
 doc/07-agent-based-monitoring.md              |   2 +-
 doc/14-features.md                            |  15 +-
 doc/15-troubleshooting.md                     |   2 +-
 doc/19-technical-concepts.md                  | 270 ++++++++++++++++++
 ...ript-debugger.md => 20-script-debugger.md} |   0
 doc/{20-development.md => 21-development.md}  |   0
 doc/{21-selinux.md => 22-selinux.md}          |   0
 ...a-1x.md => 23-migrating-from-icinga-1x.md} |  16 +-
 doc/{23-appendix.md => 24-appendix.md}        |  12 +-
 10 files changed, 297 insertions(+), 22 deletions(-)
 create mode 100644 doc/19-technical-concepts.md
 rename doc/{19-script-debugger.md => 20-script-debugger.md} (100%)
 rename doc/{20-development.md => 21-development.md} (100%)
 rename doc/{21-selinux.md => 22-selinux.md} (100%)
 rename doc/{22-migrating-from-icinga-1x.md => 23-migrating-from-icinga-1x.md} (99%)
 rename doc/{23-appendix.md => 24-appendix.md} (99%)
diff --git a/doc/03-monitoring-basics.md b/doc/03-monitoring-basics.md
index eb8b6c197..2bd040f37 100644
--- a/doc/03-monitoring-basics.md
+++ b/doc/03-monitoring-basics.md
@@ -2155,7 +2155,7 @@ Rephrased: If the parent service object changes into the `Warning` state, this
 dependency will fail and render all child objects (hosts or services) unreachable.
 
 You can determine the child's reachability by querying the `is_reachable` attribute
-in for example [DB IDO](23-appendix.md#schema-db-ido-extensions).
+in for example [DB IDO](24-appendix.md#schema-db-ido-extensions).
 
 ### Implicit Dependencies for Services on Host <a id="dependencies-implicit-host-service"></a>
 
diff --git a/doc/07-agent-based-monitoring.md b/doc/07-agent-based-monitoring.md
index a79feba47..e2597a341 100644
--- a/doc/07-agent-based-monitoring.md
+++ b/doc/07-agent-based-monitoring.md
@@ -181,7 +181,7 @@ SNMP Traps can be received and filtered by using [SNMPTT](http://snmptt.sourcefo
 and specific trap handlers passing the check results to Icinga 2.
 
 Following the SNMPTT [Format](http://snmptt.sourceforge.net/docs/snmptt.shtml#SNMPTT.CONF-FORMAT)
-documentation and the Icinga external command syntax found [here](23-appendix.md#external-commands-list-detail)
+documentation and the Icinga external command syntax found [here](24-appendix.md#external-commands-list-detail)
 we can create generic services that can accommodate any number of hosts for a given scenario.
 
 ### Simple SNMP Traps <a id="simple-traps"></a>
diff --git a/doc/14-features.md b/doc/14-features.md
index 5b7b5e372..f356e3073 100644
--- a/doc/14-features.md
+++ b/doc/14-features.md
@@ -78,11 +78,16 @@ Example for PostgreSQL:
     (1 Zeile)
 
 
-A detailed list on the available table attributes can be found in the [DB IDO Schema documentation](23-appendix.md#schema-db-ido).
+A detailed list on the available table attributes can be found in the [DB IDO Schema documentation](24-appendix.md#schema-db-ido).
 
 
 ## External Commands <a id="external-commands"></a>
 
+> **Note**
+>
+> Please use the [REST API](12-icinga2-api.md#icinga2-api) as modern and secure alternative
+> for external actions.
+
 Icinga 2 provides an external command pipe for processing commands
 triggering specific actions (for example rescheduling a service check
 through the web interface).
@@ -106,7 +111,7 @@ a forced service check:
     Oct 17 15:01:25 icinga-server icinga2: Executing external command: [1382014885] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;1382014885
     Oct 17 15:01:25 icinga-server icinga2: Rescheduling next check for service 'ping4'
 
-A list of currently supported external commands can be found [here](23-appendix.md#external-commands-list-detail).
+A list of currently supported external commands can be found [here](24-appendix.md#external-commands-list-detail).
 
 Detailed information on the commands and their required parameters can be found
 on the [Icinga 1.x documentation](https://docs.icinga.com/latest/en/extcommands2.html).
@@ -441,7 +446,7 @@ re-implementation of the Livestatus protocol which is compatible with MK
 Livestatus.
 
 Details on the available tables and attributes with Icinga 2 can be found
-in the [Livestatus Schema](23-appendix.md#schema-livestatus) section.
+in the [Livestatus Schema](24-appendix.md#schema-livestatus) section.
 
 You can enable Livestatus using icinga2 feature enable:
 
@@ -517,7 +522,7 @@ Example using the tcp socket listening on port `6558`:
 
 ### Livestatus COMMAND Queries <a id="livestatus-command-queries"></a>
 
-A list of available external commands and their parameters can be found [here](23-appendix.md#external-commands-list-detail)
+A list of available external commands and their parameters can be found [here](24-appendix.md#external-commands-list-detail)
 
     $ echo -e 'COMMAND <externalcommandstring>' | netcat 127.0.0.1 6558
 
@@ -618,7 +623,7 @@ Default separators.
 
 The `commands` table is populated with `CheckCommand`, `EventCommand` and `NotificationCommand` objects.
 
-A detailed list on the available table attributes can be found in the [Livestatus Schema documentation](23-appendix.md#schema-livestatus).
+A detailed list on the available table attributes can be found in the [Livestatus Schema documentation](24-appendix.md#schema-livestatus).
 
 
 ## Status Data Files <a id="status-data"></a>
diff --git a/doc/15-troubleshooting.md b/doc/15-troubleshooting.md
index f8ee88465..f9702911e 100644
--- a/doc/15-troubleshooting.md
+++ b/doc/15-troubleshooting.md
@@ -29,7 +29,7 @@ findings and details please.
 	* The newest Icinga 2 crash log if relevant, located in `/var/log/icinga2/crash`
 * Additional details
 	* If the check command failed, what's the output of your manual plugin tests?
-	* In case of [debugging](20-development.md#development) Icinga 2, the full back traces and outputs
+	* In case of [debugging](21-development.md#development) Icinga 2, the full back traces and outputs
 
 ## Analyze your Environment <a id="troubleshooting-analyze-environment"></a>
 
diff --git a/doc/19-technical-concepts.md b/doc/19-technical-concepts.md
new file mode 100644
index 000000000..10d0b9a07
--- /dev/null
+++ b/doc/19-technical-concepts.md
@@ -0,0 +1,270 @@
+# Technical Concepts <a id="technical-concepts"></a>
+
+This chapter provides insights into specific Icinga 2
+components, libraries, features and any other technical concept
+and design.
+
+<!--
+## Application <a id="technical-concepts-application"></a>
+
+### Libraries <a id="technical-concepts-application-libraries"></a>
+
+
+## Configuration <a id="technical-concepts-configuration"></a>
+
+### Compiler <a id="technical-concepts-configuration-compiler"></a>
+-->
+
+## Features <a id="technical-concepts-features"></a>
+
+Features are implemented in specific libraries and can be enabled
+using CLI commands.
+
+Features either write specific data or receive data.
+
+Examples for writing data: [DB IDO](14-features.md#db-ido), [Graphite](14-features.md#graphite-carbon-cache-writer), [InfluxDB](14-features.md#influxdb-writer). [GELF](14-features.md#gelfwriter), etc.
+Examples for receiving data: [REST API](12-icinga2-api.md#icinga2-api), etc.
+
+The implementation of features makes use of existing libraries
+and functionality. This makes the code more abstract, but shorter
+and easier to read.
+
+Features register callback functions on specific events they want
+to handle. For example the `GraphiteWriter` feature subscribes to
+new CheckResult events.
+
+Each time Icinga 2 receives and processes a new check result, this
+event is triggered and forwarded to all subscribers.
+
+The GraphiteWriter feature calls the registered function and processes
+the received data. Features which connect Icinga 2 to external interfaces
+normally parse and reformat the received data into an applicable format.
+
+The GraphiteWriter uses a TCP socket to communicate with the carbon cache
+daemon of Graphite. The InfluxDBWriter is instead writing bulk metric messages
+to InfluxDB's HTTP API.
+
+
+
+## Cluster <a id="technical-concepts-cluster"></a>
+
+### Communication <a id="technical-concepts-cluster-communication"></a>
+
+Icinga 2 uses its own certificate authority (CA) by default. The
+public and private CA keys can be generated on the signing master.
+
+Each node certificate must be signed by the private CA key.
+
+Note: The following description uses `parent node` and `child node`.
+This also applies to nodes in the same cluster zone.
+
+During the connection attempt, an SSL handshake is performed.
+If the public certificate of a child node is not signed by the same
+CA, the child node is not trusted and the connection will be closed.
+
+If the SSL handshake succeeds, the parent node reads the
+certificate's common name (CN) of the child node and looks for
+a local Endpoint object name configuration.
+
+If there is no Endpoint object found, further communication
+(runtime and config sync, etc.) is terminated.
+
+The child node also checks the CN from the parent node's public
+certificate. If the child node does not find any local Endpoint
+object name configuration, it will not trust the parent node.
+
+Both checks prevent accepting cluster messages from an untrusted
+source endpoint.
+
+If an Endpoint match was found, there is one additional security
+mechanism in place: Endpoints belong to a Zone hierarchy.
+
+Several cluster messages can only be sent "top down", others like
+check results are allowed being sent from the child to the parent node.
+
+Once this check succeeds the cluster messages are exchanged and processed.
+
+
+### CSR Signing <a id="technical-concepts-cluster-csr-signing"></a>
+
+In order to make things easier, Icinga 2 provides built-in methods
+to allow child nodes to request a signed certificate from the
+signing master.
+
+Icinga 2 v2.8 introduces the possibility to request certificates
+from indirectly connected nodes. This is required for multi level
+cluster environments with masters, satellites and clients.
+
+CSR Signing in general starts with the master setup. This step
+ensures that the master is in a working CSR signing state with:
+
+* public and private CA key in `/var/lib/icinga2/ca`
+* private `TicketSalt` constant defined inside the `api` feature
+* Cluster communication is ready and Icinga 2 listens on port 5665
+
+The child node setup which is run with CLI commands will now
+attempt to connect to the parent node. This is not necessarily
+the signing master instance, but could also be a parent satellite node.
+
+During this process the child node asks the user to verify the
+parent node's public certificate to prevent MITM attacks.
+
+There are two methods to request signed certificates:
+
+* Add the ticket into the request. This ticket was generated on the master
+beforehand and contains hashed details for which client it has been created.
+The signing master uses this information to automatically sign the certificate
+request.
+
+* Do not add a ticket into the request. It will be sent to the signing master
+which stores the pending request. Manual user interaction with CLI commands
+is necessary to sign the request.
+
+The certificate request is sent as `pki::RequestCertificate` cluster
+message to the parent node.
+
+If the parent node is not the signing master, it stores the request
+in `/var/lib/icinga2/certificate-requests` and forwards the
+cluster message to its parent node.
+
+Once the message arrives on the signing master, it first verifies that
+the sent certificate request is valid. This is to prevent unwanted errors
+or modified requests from the "proxy" node.
+
+After verification, the signing master checks if the request contains
+a valid signing ticket. It hashes the certificate's common name and
+compares the value to the received ticket number.
+
+If the ticket is valid, the certificate request is immediately signed
+with CA key. The request is sent back to the client inside a `pki::UpdateCertificate`
+cluster message.
+
+If the child node was not the certificate request origin, it only updates
+the cached request for the child node and send another cluster message
+down to its child node (e.g. from a satellite to a client).
+
+
+If no ticket was specified, the signing master waits until the
+`ca sign` CLI command manually signed the certificate.
+
+> **Note**
+>
+> Push notifications for manual request signing is not yet implemented (TODO).
+
+Once the child node reconnects it synchronizes all signed certificate requests.
+This takes some minutes and requires all nodes to reconnect to each other.
+
+
+#### CSR Signing: Clients without parent connection <a id="technical-concepts-cluster-csr-signing-clients-no-connection"></a>
+
+There is an additional scenario: The setup on a child node does
+not necessarily need a connection to the parent node.
+
+This mode leaves the node in a semi-configured state. You need
+to manually copy the master's public CA key into `/var/lib/icinga2/certs/ca.crt`
+on the client before starting Icinga 2.
+
+The parent node needs to actively connect to the child node.
+Once this connections succeeds, the child node will actively
+request a signed certificate.
+
+The update procedure works the same way as above.
+
+### High Availability <a id="technical-concepts-cluster-ha"></a>
+
+High availability is automatically enabled between two nodes in the same
+cluster zone.
+
+This requires the same configuration and enabled features on both nodes.
+
+HA zone members trust each other and share event updates as cluster messages.
+This includes for example check results, next check timestamp updates, acknowledgements
+or notifications.
+
+This ensures that both nodes are synchronized. If one node goes away, the
+remaining node takes over and continues as normal.
+
+
+Cluster nodes automatically determine the authority for configuration
+objects. This results in activated but paused objects. You can verify
+that by querying the `paused` attribute for all objects via REST API
+or debug console.
+
+Nodes inside a HA zone calculate the object authority independent from each other.
+
+The number of endpoints in a zone is defined through the configuration. This number
+is used inside a local modulo calculation to determine whether the node feels
+responsible for this object or not.
+
+This object authority is important for selected features explained below.
+
+Since features are configuration objects too, you must ensure that all nodes
+inside the HA zone share the same enabled features. If configured otherwise,
+one might have a checker feature on the left node, nothing on the right node.
+This leads to late check results because one half is not executed by the right
+node which holds half of the object authorities.
+
+### High Availability: Checker <a id="technical-concepts-cluster-ha-checker"></a>
+
+The `checker` feature only executes checks for `Checkable` objects (Host, Service)
+where it is authoritative.
+
+That way each node only executes checks for a segment of the overall configuration objects.
+
+The cluster message routing ensures that all check results are synchronized
+to nodes which are not authoritative for this configuration object.
+
+
+### High Availability: Notifications <a id="technical-concepts-cluster-notifications"></a>
+
+The `notification` feature only sends notifications for `Notification` objects
+where it is authoritative.
+
+That way each node only executes notifications for a segment of all notification objects.
+
+Notified users and other event details are synchronized throughout the cluster.
+This is required if for example the DB IDO feature is active on the other node.
+
+### High Availability: DB IDO <a id="technical-concepts-cluster-ha-ido"></a>
+
+If you don't have HA enabled for the IDO feature, both nodes will
+write their status and historical data to their own separate database
+backends.
+
+In order to avoid data separation and a split view (each node would require its
+own Icinga Web 2 installation on top), the high availability option was added
+to the DB IDO feature. This is enabled by default with the `enable_ha` setting.
+
+This requires a central database backend. Best practice is to use a MySQL cluster
+with a virtual IP.
+
+Both Icinga 2 nodes require the connection and credential details configured in
+their DB IDO feature.
+
+During startup Icinga 2 calculates whether the feature configuration object
+is authoritative on this node or not. The order is an alpha-numeric
+comparison, e.g. if you have `master1` and `master2`, Icinga 2 will enable
+the DB IDO feature on `master2` by default.
+
+If the connection between endpoints drops, the object authority is re-calculated.
+
+In order to prevent data duplication in a split-brain scenario where both
+nodes would write into the same database, there is another safety mechanism
+in place.
+
+The split-brain decision which node will write to the database is calculated
+from a quorum inside the `programstatus` table. Each node
+verifies whether the `endpoint_name` column is not itself on database connect.
+In addition to that the DB IDO feature compares the `last_update_time` column
+against the current timestamp plus the configured `failover_timeout` offset.
+
+That way only one active DB IDO feature writes to the database, even if they
+are not currently connected in a cluster zone. This prevents data duplication
+in historical tables.
+
+<!--
+## REST API <a id="technical-concepts-rest-api"></a>
+
+Icinga 2 provides its own HTTP server which shares the port 5665 with
+the JSON-RPC cluster protocol.
+-->
diff --git a/doc/19-script-debugger.md b/doc/20-script-debugger.md
similarity index 100%
rename from doc/19-script-debugger.md
rename to doc/20-script-debugger.md
diff --git a/doc/20-development.md b/doc/21-development.md
similarity index 100%
rename from doc/20-development.md
rename to doc/21-development.md
diff --git a/doc/21-selinux.md b/doc/22-selinux.md
similarity index 100%
rename from doc/21-selinux.md
rename to doc/22-selinux.md
diff --git a/doc/22-migrating-from-icinga-1x.md b/doc/23-migrating-from-icinga-1x.md
similarity index 99%
rename from doc/22-migrating-from-icinga-1x.md
rename to doc/23-migrating-from-icinga-1x.md
index 831841e95..4faeccc3b 100644
--- a/doc/22-migrating-from-icinga-1x.md
+++ b/doc/23-migrating-from-icinga-1x.md
@@ -11,7 +11,7 @@ on your migration requirements.
 For a long-term migration of your configuration you should consider re-creating
 your configuration based on the proposed Icinga 2 configuration paradigm.
 
-Please read the [next chapter](22-migrating-from-icinga-1x.md#differences-1x-2) to find out more about the differences
+Please read the [next chapter](23-migrating-from-icinga-1x.md#differences-1x-2) to find out more about the differences
 between 1.x and 2.
 
 ### Manual Config Migration Hints <a id="manual-config-migration-hints"></a>
@@ -24,7 +24,7 @@ The examples are taken from Icinga 1.x test and production environments and conv
 straight into a possible Icinga 2 format. If you found a different strategy, please
 let us know!
 
-If you require in-depth explanations, please check the [next chapter](22-migrating-from-icinga-1x.md#differences-1x-2).
+If you require in-depth explanations, please check the [next chapter](23-migrating-from-icinga-1x.md#differences-1x-2).
 
 #### Manual Config Migration Hints for Intervals <a id="manual-config-migration-hints-Intervals"></a>
 
@@ -185,7 +185,7 @@ While you could manually migrate this like (please note the new generic command
 
 #### Manual Config Migration Hints for Runtime Macros <a id="manual-config-migration-hints-runtime-macros"></a>
 
-Runtime macros have been renamed. A detailed comparison table can be found [here](22-migrating-from-icinga-1x.md#differences-1x-2-runtime-macros).
+Runtime macros have been renamed. A detailed comparison table can be found [here](23-migrating-from-icinga-1x.md#differences-1x-2-runtime-macros).
 
 For example, accessing the service check output looks like the following in Icinga 1.x:
 
@@ -257,7 +257,7 @@ while the service check command resolves its value to the service attribute attr
 #### Manual Config Migration Hints for Contacts (Users) <a id="manual-config-migration-hints-contacts-users"></a>
 
 Contacts in Icinga 1.x act as users in Icinga 2, but do not have any notification commands specified.
-This migration part is explained in the [next chapter](22-migrating-from-icinga-1x.md#manual-config-migration-hints-notifications).
+This migration part is explained in the [next chapter](23-migrating-from-icinga-1x.md#manual-config-migration-hints-notifications).
 
     define contact{
       contact_name                    testconfig-user
@@ -267,7 +267,7 @@ This migration part is explained in the [next chapter](22-migrating-from-icinga-
       email                           icinga@localhost
     }
 
-The `service_notification_options` can be [mapped](22-migrating-from-icinga-1x.md#manual-config-migration-hints-notification-filters)
+The `service_notification_options` can be [mapped](23-migrating-from-icinga-1x.md#manual-config-migration-hints-notification-filters)
 into generic `state` and `type` filters, if additional notification filtering is required. `alias` gets
 renamed to `display_name`.
 
@@ -319,7 +319,7 @@ Assign it to the host or service and set the newly generated notification comman
 
 
 Convert the `notification_options` attribute from Icinga 1.x to Icinga 2 `states` and `types`. Details
-[here](22-migrating-from-icinga-1x.md#manual-config-migration-hints-notification-filters). Add the notification period.
+[here](23-migrating-from-icinga-1x.md#manual-config-migration-hints-notification-filters). Add the notification period.
 
       states = [ OK, Warning, Critical ]
       types = [ Recovery, Problem, Custom ]
@@ -556,7 +556,7 @@ enabled.
       assign where "hg_svcdep2" in host.groups
     }
 
-Host dependencies are explained in the [next chapter](22-migrating-from-icinga-1x.md#manual-config-migration-hints-host-parents).
+Host dependencies are explained in the [next chapter](23-migrating-from-icinga-1x.md#manual-config-migration-hints-host-parents).
 
 
 
@@ -955,7 +955,7 @@ In Icinga 1.x arguments are specified in the `check_command` attribute and
 are separated from the command name using an exclamation mark (`!`).
 
 Please check the migration hints for a detailed
-[migration example](22-migrating-from-icinga-1x.md#manual-config-migration-hints-check-command-arguments).
+[migration example](23-migrating-from-icinga-1x.md#manual-config-migration-hints-check-command-arguments).
 
 > **Note**
 >
diff --git a/doc/23-appendix.md b/doc/24-appendix.md
similarity index 99%
rename from doc/23-appendix.md
rename to doc/24-appendix.md
index 3453ee993..f57c1bf33 100644
--- a/doc/23-appendix.md
+++ b/doc/24-appendix.md
@@ -692,16 +692,16 @@ Not supported: `debug_info`.
 
 #### Livestatus Hostsbygroup Table Attributes <a id="schema-livestatus-hostsbygroup-table-attributes"></a>
 
-All [hosts](23-appendix.md#schema-livestatus-hosts-table-attributes) table attributes grouped with
-the [hostgroups](23-appendix.md#schema-livestatus-hostgroups-table-attributes) table prefixed with `hostgroup_`.
+All [hosts](24-appendix.md#schema-livestatus-hosts-table-attributes) table attributes grouped with
+the [hostgroups](24-appendix.md#schema-livestatus-hostgroups-table-attributes) table prefixed with `hostgroup_`.
 
 #### Livestatus Servicesbygroup Table Attributes <a id="schema-livestatus-servicesbygroup-table-attributes"></a>
 
-All [services](23-appendix.md#schema-livestatus-services-table-attributes) table attributes grouped with
-the [servicegroups](23-appendix.md#schema-livestatus-servicegroups-table-attributes) table prefixed with `servicegroup_`.
+All [services](24-appendix.md#schema-livestatus-services-table-attributes) table attributes grouped with
+the [servicegroups](24-appendix.md#schema-livestatus-servicegroups-table-attributes) table prefixed with `servicegroup_`.
 
 #### Livestatus Servicesbyhostgroup Table Attributes <a id="schema-livestatus-servicesbyhostgroup-table-attributes"></a>
 
-All [services](23-appendix.md#schema-livestatus-services-table-attributes) table attributes grouped with
-the [hostgroups](23-appendix.md#schema-livestatus-hostgroups-table-attributes) table prefixed with `hostgroup_`.
+All [services](24-appendix.md#schema-livestatus-services-table-attributes) table attributes grouped with
+the [hostgroups](24-appendix.md#schema-livestatus-hostgroups-table-attributes) table prefixed with `hostgroup_`.