Documentation: Rewrite cluster docs

* Re-organize structure
* New section with HA features
* Permissions and security
* How to add a new node
* Cluster requirements
* Additional hints on installation
* More troubleshooting

fixes #6743
fixes #6703
fixes #6997
This commit is contained in:
Michael Friedrich 2014-08-24 11:21:54 +02:00
parent 3972aa20c4
commit 32c20132d0
3 changed files with 285 additions and 148 deletions

View File

@ -144,74 +144,52 @@ passing the check results to Icinga 2.
remote sender to push check results into the Icinga 2 `ExternalCommandListener`
feature.
> **Note**
>
> This addon works in a similar fashion like the Icinga 1.x distributed model. If you
> are looking for a real distributed architecture with Icinga 2, scroll down.
## <a id="distributed-monitoring-high-availability"></a> Distributed Monitoring and High Availability
An Icinga 2 cluster consists of two or more nodes and can reside on multiple
architectures. The base concept of Icinga 2 is the possibility to add additional
features using components. In case of a cluster setup you have to add the api feature
to all nodes.
An Icinga 2 cluster can be used for the following scenarios:
Building distributed environments with high availability included is fairly easy with Icinga 2.
The cluster feature is built-in and allows you to build many scenarios based on your requirements:
* [High Availability](#cluster-scenarios-high-availability). All instances in the `Zone` elect one active master and run as Active/Active cluster.
* [Distributed Zones](#cluster-scenarios-distributed-zones). A master zone and one or more satellites in their zones.
* [Load Distribution](#cluster-scenarios-load-distribution). A configuration master and multiple checker satellites.
You can combine these scenarios into a global setup fitting your requirements.
Each instance got their own event scheduler, and does not depend on a centralized master
coordinating and distributing the events. In case of a cluster failure, all nodes
continue to run independently. Be alarmed when your cluster fails and a Split-Brain-scenario
is in effect - all alive instances continue to do their job, and history will begin to differ.
> ** Note **
>
> Before you start, make sure to read the [requirements](#distributed-monitoring-requirements).
### <a id="cluster-requirements"></a> Cluster Requirements
Before you start deploying, keep the following things in mind:
* Your [SSL CA and certificates](#certificate-authority-certificates) are mandatory for secure communication
* Get pen and paper or a drawing board and design your nodes and zones!
** all nodes in a cluster zone are providing high availability functionality and trust each other
** cluster zones can be built in a Top-Down-design where the child trusts the parent
** communication between zones happens bi-directional which means that a DMZ-located node can still reach the master node, or vice versa
* Update firewall rules and ACLs
* Decide whether to use the built-in [configuration syncronization](#cluster-zone-config-sync) or use an external tool (Puppet, Ansible, Chef, Salt, etc) to manage the configuration deployment
> **Tip**
>
> If you're looking for troubleshooting cluster problems, check the general
> [troubleshooting](#troubleshooting-cluster) section.
Before you start configuring the diffent nodes it is necessary to setup the underlying
communication layer based on SSL.
### <a id="certificate-authority-certificates"></a> Certificate Authority and Certificates
Icinga 2 ships two scripts assisting with CA and node certificate creation
for your Icinga 2 cluster.
The first step is the creation of CA running the following command:
# icinga2-build-ca
Please make sure to export the environment variable `ICINGA_CA` pointing to
an empty folder for the newly created CA files:
# export ICINGA_CA="/root/icinga-ca"
Now create a certificate and key file for each node running the following command
(replace `icinga2a` with the required hostname):
# icinga2-build-key icinga2a
Repeat the step for all nodes in your cluster scenario. Save the CA key in case
you want to set up certificates for additional nodes at a later time.
Each node requires the following files in `/etc/icinga2/pki` (replace `fqdn-nodename` with
the host's FQDN):
* ca.crt
* &lt;fqdn-nodename&gt;.crt
* &lt;fqdn-nodename&gt;.key
### <a id="configure-nodename"></a> Configure the Icinga Node Name
Instead of using the default FQDN as node name you can optionally set
that value using the [NodeName](#global-constants) constant.
This setting must be unique for each node, and must also match
the name of the local [Endpoint](#objecttype-endpoint) object and the
SSL certificate common name.
const NodeName = "icinga2a"
Read further about additional [naming conventions](#cluster-naming-convention).
Not specifying the node name will make Icinga 2 using the FQDN. Make sure that all
configured endpoint names and common names are in sync.
### <a id="cluster-naming-convention"></a> Cluster Naming Convention
#### <a id="cluster-naming-convention"></a> Cluster Naming Convention
The SSL certificate common name (CN) will be used by the [ApiListener](#objecttype-apilistener)
object to determine the local authority. This name must match the local [Endpoint](#objecttype-endpoint)
@ -240,13 +218,103 @@ The [Endpoint](#objecttype-endpoint) name is further referenced as `endpoints` a
endpoints = [ "icinga2a", "icinga2b" ]
}
Specifying the local node name using the [NodeName](#global-constants) variable requires
Specifying the local node name using the [NodeName](#configure-nodename) variable requires
the same name as used for the endpoint name and common name above. If not set, the FQDN is used.
const NodeName = "icinga2a"
### <a id="configure-clusterlistener-object"></a> Configure the ApiListener Object
### <a id="certificate-authority-certificates"></a> Certificate Authority and Certificates
Icinga 2 ships two scripts assisting with CA and node certificate creation
for your Icinga 2 cluster.
> **Note**
>
> You're free to use your own method to generated a valid ca and signed client
> certificates.
Please make sure to export the environment variable `ICINGA_CA` pointing to
an empty folder for the newly created CA files:
# export ICINGA_CA="/root/icinga-ca"
The scripts will put all generated data and the required certificates in there.
The first step is the creation of the certificate authority (CA) running the
following command:
# icinga2-build-ca
Now create a certificate and key file for each node running the following command
(replace `icinga2a` with the required hostname):
# icinga2-build-key icinga2a
Repeat the step for all nodes in your cluster scenario.
Save the CA key in a secure location in case you want to set up certificates for
additional nodes at a later time.
Navigate to the location of your newly generated certificate files, and manually
copy/transfer them to `/etc/icinga2/pki` in your Icinga 2 configuration folder.
> **Note**
>
> The certificate files must be readable by the user Icinga 2 is running as. Also,
> the private key file must not be world-readable.
Each node requires the following files in `/etc/icinga2/pki` (replace `fqdn-nodename` with
the host's FQDN):
* ca.crt
* &lt;fqdn-nodename&gt;.crt
* &lt;fqdn-nodename&gt;.key
### <a id="cluster-configuration"></a> Cluster Configuration
The following section describe which configuration must be updated/created
in order to get your cluster running with basic functionality.
* [configure the node name](#configure-nodename)
* [configure the ApiListener object](#configure-apilistener-object)
* [configure cluster endpoints](#configure-cluster-endpoints)
* [configure cluster zones](#configure-cluster-zones)
Once you're finished with the basic setup the following section will
describe how to use [zone configuration synchronisation](#cluster-zone-config-sync)
and configure [cluster scenarios](#cluster-scenarios).
#### <a id="configure-nodename"></a> Configure the Icinga Node Name
Instead of using the default FQDN as node name you can optionally set
that value using the [NodeName](#global-constants) constant.
> ** Note **
>
> Skip this step if your FQDN already matches the default `NodeName` set
> in `/etc/icinga2/constants.conf`.
This setting must be unique for each node, and must also match
the name of the local [Endpoint](#objecttype-endpoint) object and the
SSL certificate common name as described in the
[cluster naming convention](#cluster-naming-convention).
vim /etc/icinga2/constants.conf
/* Our local instance name. By default this is the server's hostname as returned by `hostname --fqdn`.
* This should be the common name from the API certificate.
*/
const NodeName = "icinga2a"
Read further about additional [naming conventions](#cluster-naming-convention).
Not specifying the node name will make Icinga 2 using the FQDN. Make sure that all
configured endpoint names and common names are in sync.
#### <a id="configure-apilistener-object"></a> Configure the ApiListener Object
The [ApiListener](#objecttype-apilistener) object needs to be configured on
every node in the cluster with the following settings:
@ -272,8 +340,7 @@ synchronisation enabled for this node.
> The certificate files must be readable by the user Icinga 2 is running as. Also,
> the private key file must not be world-readable.
### <a id="configure-cluster-endpoints"></a> Configure Cluster Endpoints
#### <a id="configure-cluster-endpoints"></a> Configure Cluster Endpoints
`Endpoint` objects specify the `host` and `port` settings for the cluster nodes.
This configuration can be the same on all nodes in the cluster only containing
@ -292,8 +359,7 @@ A sample configuration looks like:
If this endpoint object is reachable on a different port, you must configure the
`ApiListener` on the local `Endpoint` object accordingly too.
### <a id="configure-cluster-zones"></a> Configure Cluster Zones
#### <a id="configure-cluster-zones"></a> Configure Cluster Zones
`Zone` objects specify the endpoints located in a zone. That way your distributed setup can be
seen as zones connected together instead of multiple instances in that specific zone.
@ -324,7 +390,7 @@ the defined parent zone `config-ha-master`.
}
#### <a id="cluster-zone-config-sync"></a> Zone Configuration Synchronisation
### <a id="cluster-zone-config-sync"></a> Zone Configuration Synchronisation
By default all objects for specific zones should be organized in
@ -376,12 +442,19 @@ process.
> determines the required include directory. This can be overridden using the
> [global constant](#global-constants) `ZonesDir`.
#### <a id="zone-synchronisation-permissions"></a> Global Configuration Zone
#### <a id="zone-global-config-templates"></a> Global Configuration Zone for Templates
If your zone configuration setup shares the same templates, groups, commands, timeperiods, etc.
you would have to duplicate quite a lot of configuration objects making the merged configuration
on your configuration master unique.
> ** Note **
>
> Only put templates, groups, etc into this zone. DO NOT add checkable objects such as
> hosts or services here. If they are checked by all instances globally, this will lead
> into duplicated check results and unclear state history. Not easy to troubleshoot too -
> you've been warned.
That is not necessary by defining a global zone shipping all those templates. By setting
`global = true` you ensure that this zone serving common configuration templates will be
synchronized to all involved nodes (only if they accept configuration though).
@ -406,11 +479,11 @@ your zone configuration visible to all nodes.
> **Note**
>
> If the remote node does not have this zone configured, it will ignore the configuration
> update, if it accepts configuration.
> update, if it accepts synchronized configuration.
If you don't require any global configuration, skip this setting.
#### <a id="zone-synchronisation-permissions"></a> Zone Configuration Permissions
#### <a id="zone-config-sync-permissions"></a> Zone Configuration Synchronisation Permissions
Each [ApiListener](#objecttype-apilistener) object must have the `accept_config` attribute
set to `true` to receive configuration from the parent `Zone` members. Default value is `false`.
@ -422,15 +495,13 @@ set to `true` to receive configuration from the parent `Zone` members. Default v
accept_config = true
}
### <a id="initial-cluster-sync"></a> Initial Cluster Sync
If `accept_config` is set to `false`, this instance won't accept configuration from remote
master instances anymore.
In order to make sure that all of your cluster nodes have the same state you will
have to pick one of the nodes as your initial "master" and copy its state file
to all the other nodes.
You can find the state file in `/var/lib/icinga2/icinga2.state`. Before copying
the state file you should make sure that all your cluster nodes are properly shut
down.
> ** Tip **
>
> Look into the [troubleshooting guides](#troubleshooting-cluster-config-sync) for debugging
> problems with the configuration synchronisation.
### <a id="cluster-health-check"></a> Cluster Health Check
@ -441,12 +512,12 @@ one or more configured nodes are not connected.
Example:
apply Service "cluster" {
object Service "cluster" {
check_command = "cluster"
check_interval = 5s
retry_interval = 1s
assign where host.name == "icinga2a"
host_name = "icinga2a"
}
Each cluster node should execute its own local cluster health check to
@ -458,78 +529,33 @@ connected zones.
Example for the `checker` zone checking the connection to the `master` zone:
apply Service "cluster-zone-master" {
object Service "cluster-zone-master" {
check_command = "cluster-zone"
check_interval = 5s
retry_interval = 1s
vars.cluster_zone = "master"
assign where host.name == "icinga2b"
host_name = "icinga2b"
}
### <a id="host-multiple-cluster-nodes"></a> Host With Multiple Cluster Nodes
Special scenarios might require multiple cluster nodes running on a single host.
By default Icinga 2 and its features will place their runtime data below the prefix
`LocalStateDir`. By default packages will set that path to `/var`.
You can either set that variable as constant configuration
definition in [icinga2.conf](#icinga2-conf) or pass it as runtime variable to
the Icinga 2 daemon.
# icinga2 -c /etc/icinga2/node1/icinga2.conf -DLocalStateDir=/opt/node1/var
### <a id="high-availability-db-ido"></a> High Availability with DB IDO
All instances within the same zone (e.g. the `master` zone as HA cluster) must
have the DB IDO feature enabled.
Example DB IDO MySQL:
# icinga2-enable-feature ido-mysql
The feature 'ido-mysql' is already enabled.
By default the DB IDO feature only runs on the elected zone master. All other nodes
disable the active IDO database connection at runtime.
> **Note**
>
> The DB IDO HA feature can be disabled by setting the `enable_ha` attribute to `false`
> for the [IdoMysqlConnection](#objecttype-idomysqlconnection) or
> [IdoPgsqlConnection](#objecttype-idopgsqlconnection) object on all nodes in the
> same zone.
>
> All endpoints will enable the DB IDO feature then, connect to the configured
> database and dump configuration, status and historical data on their own.
If the instance with the active DB IDO connection dies, the HA functionality will
re-enable the DB IDO connection on the newly elected zone master.
The DB IDO feature will try to determine which cluster endpoint is currently writing
to the database and bail out if another endpoint is active. You can manually verify that
by running the following query:
icinga=> SELECT status_update_time, endpoint_name FROM icinga_programstatus;
status_update_time | endpoint_name
------------------------+---------------
2014-08-15 15:52:26+02 | icinga2a
(1 Zeile)
This is useful when the cluster connection between endpoints breaks, and prevents
data duplication in split-brain-scenarios. The failover timeout can be set for the
`failover_timeout` attribute, but not lower than 60 seconds.
### <a id="cluster-scenarios"></a> Cluster Scenarios
All cluster nodes are full-featured Icinga 2 instances. You only need to enabled
the features for their role (for example, a `Checker` node only requires the `checker`
feature enabled, but not `notification` or `ido-mysql` features).
Each instance got their own event scheduler, and does not depend on a centralized master
coordinating and distributing the events. In case of a cluster failure, all nodes
continue to run independently. Be alarmed when your cluster fails and a Split-Brain-scenario
is in effect - all alive instances continue to do their job, and history will begin to differ.
#### <a id="cluster-scenarios-security"></a> Security in Cluster Scenarios
While there are certain capabilities to ensure the safe communication between all
nodes (firewalls, policies, software hardening, etc) the Icinga 2 cluster also provides
additional security itself:
* [SSL certificates](#certificate-authority-certificates) are mandatory for cluster communication.
* Child zones only receive event updates (check results, commands, etc) for their configured updates.
* Zones cannot influence/interfere other zones. Each checked object is assigned to only one zone.
* All nodes in a zone trust each other.
* [Configuration sync](#zone-config-sync-permissions) is disabled by default.
#### <a id="cluster-scenarios-features"></a> Features in Cluster Zones
@ -539,11 +565,13 @@ Even further all commands are distributed amongst connected nodes. For example,
re-schedule a check or acknowledge a problem on the master, and it gets replicated to the
actual slave checker node.
DB IDO on the left, graphite on the right side - works.
DB IDO on the left, graphite on the right side - works (if you disable
[DB IDO HA](#high-availability-db-ido)).
Icinga Web 2 on the left, checker and notifications on the right side - works too.
Everything on the left and on the right side - make sure to deal with duplicated notifications
and automated check distribution.
Everything on the left and on the right side - make sure to deal with
[load-balanced notifications and checks](#high-availability-features) in a
[HA zone](#cluster-scenarios-high-availability).
configure-cluster-zones
#### <a id="cluster-scenarios-distributed-zones"></a> Distributed Zones
That scenario fits if your instances are spread over the globe and they all report
@ -612,7 +640,6 @@ The zones would look like:
The `nuremberg-master` zone will only execute local checks, and receive
check results from the satellite nodes in the zones `berlin` and `vienna`.
#### <a id="cluster-scenarios-load-distribution"></a> Load Distribution
If you are planning to off-load the checks to a defined set of remote workers
@ -663,17 +690,13 @@ Zones:
global = true
}
#### <a id="cluster-scenarios-high-availability"></a> High Availability
#### <a id="cluster-scenarios-high-availability"></a> Cluster High Availability
High availability with Icinga 2 is possible by putting multiple nodes into
a dedicated `Zone`. All nodes will elect their active master, and retry an
a dedicated `Zone`. All nodes will elect one active master, and retry an
election once the current active master failed.
Selected features (such as [DB IDO](#high-availability-db-ido)) will only be
active on the current active master.
All other passive nodes will pause the features without reload/restart.
Selected features provide advanced [HA functionality](#high-availability-features).
Checks and notifications are load-balanced between nodes in the high availability
zone.
@ -693,7 +716,6 @@ Two or more nodes in a high availability setup require an [initial cluster sync]
> configuration files in the `zones.d` directory. All other nodes must not
> have that directory populated. Detail in the [Configuration Sync Chapter](#cluster-zone-config-sync).
#### <a id="cluster-scenarios-multiple-hierachies"></a> Multiple Hierachies
Your master zone collects all check results for reporting and graphing and also
@ -717,3 +739,110 @@ department instances. Furthermore the master NOC is able to see what's going on.
The instances in the departments will serve a local interface, and allow the administrators
to reschedule checks or acknowledge problems for their services.
### <a id="high-availability-features"></a> High Availability for Icinga 2 features
All nodes in the same zone require the same features enabled for High Availability (HA)
amongst them.
By default the following features provide advanced HA functionality:
* [Checks](#high-availability-checks) (load balanced, automated failover)
* [Notifications](#high-availability-notifications) (load balanced, automated failover)
* DB IDO (Run-Once, automated failover)
#### <a id="high-availability-checks"></a> High Availability with Checks
All nodes in the same zone automatically load-balance the check execution. When one instance
fails the other nodes will automatically take over the reamining checks.
> **Note**
>
> If a node should not check anything, disable the `checker` feature explicitely and
> reload Icinga 2.
# icinga2-disable-feature checker
# service icinga2 reload
#### <a id="high-availability-notifications"></a> High Availability with Notifications
Notifications are load balanced amongst all nodes in a zone. By default this functionality
is enabled.
If your nodes should notify independent from any other nodes (this will cause
duplicated notifications if not properly handled!), you can set `enable_ha = false`
in the [NotificationComponent](#objecttype-notificationcomponent) feature.
#### <a id="high-availability-db-ido"></a> High Availability with DB IDO
All instances within the same zone (e.g. the `master` zone as HA cluster) must
have the DB IDO feature enabled.
Example DB IDO MySQL:
# icinga2-enable-feature ido-mysql
The feature 'ido-mysql' is already enabled.
By default the DB IDO feature only runs on the elected zone master. All other passive
nodes disable the active IDO database connection at runtime.
> **Note**
>
> The DB IDO HA feature can be disabled by setting the `enable_ha` attribute to `false`
> for the [IdoMysqlConnection](#objecttype-idomysqlconnection) or
> [IdoPgsqlConnection](#objecttype-idopgsqlconnection) object on all nodes in the
> same zone.
>
> All endpoints will enable the DB IDO feature then, connect to the configured
> database and dump configuration, status and historical data on their own.
If the instance with the active DB IDO connection dies, the HA functionality will
re-enable the DB IDO connection on the newly elected zone master.
The DB IDO feature will try to determine which cluster endpoint is currently writing
to the database and bail out if another endpoint is active. You can manually verify that
by running the following query:
icinga=> SELECT status_update_time, endpoint_name FROM icinga_programstatus;
status_update_time | endpoint_name
------------------------+---------------
2014-08-15 15:52:26+02 | icinga2a
(1 Zeile)
This is useful when the cluster connection between endpoints breaks, and prevents
data duplication in split-brain-scenarios. The failover timeout can be set for the
`failover_timeout` attribute, but not lower than 60 seconds.
### <a id="cluster-add-node"></a> Add a new cluster endpoint
These steps are required for integrating a new cluster endpoint:
* generate a new [SSL client certificate](#certificate-authority-certificates)
* identify its location in the zones
* update the `zones.conf` file on each involved node ([endpoint](#configure-cluster-endpoints), [zones](#configure-cluster-zones))
** a new slave zone node requires updates for the master and slave zones
* if the node requires the existing zone history: [initial cluster sync](#initial-cluster-sync)
* add a [cluster health check](#cluster-health-check)
#### <a id="initial-cluster-sync"></a> Initial Cluster Sync
In order to make sure that all of your cluster nodes have the same state you will
have to pick one of the nodes as your initial "master" and copy its state file
to all the other nodes.
You can find the state file in `/var/lib/icinga2/icinga2.state`. Before copying
the state file you should make sure that all your cluster nodes are properly shut
down.
### <a id="host-multiple-cluster-nodes"></a> Host With Multiple Cluster Nodes
Special scenarios might require multiple cluster nodes running on a single host.
By default Icinga 2 and its features will place their runtime data below the prefix
`LocalStateDir`. By default packages will set that path to `/var`.
You can either set that variable as constant configuration
definition in [icinga2.conf](#icinga2-conf) or pass it as runtime variable to
the Icinga 2 daemon.
# icinga2 -c /etc/icinga2/node1/icinga2.conf -DLocalStateDir=/opt/node1/var

View File

@ -164,6 +164,14 @@ they remain in a Split-Brain-mode and history may differ.
Although the Icinga 2 cluster protocol stores historical events in a replay log for later synchronisation,
you should make sure to check why the network connection failed.
### <a id="troubleshooting-cluster-config-sync"></a> Cluster Troubleshooting Config Sync
If the cluster zones do not sync their configuration, make sure to check the following:
* Within a config master zone, only one configuration master is allowed to have its config in `/etc/icinga2/zones.d`.
** The master syncs the configuration to `/var/lib/icinga2/api/zones/` during startup and only syncs valid configuration to the other nodes
** The other nodes receive the configuration into `/var/lib/icinga2/api/zones/`
* The `icinga2.log` log file will indicate whether this ApiListener [accepts config](#zone-config-sync-permissions), or not
## <a id="debug"></a> Debug Icinga 2

View File

@ -8,7 +8,7 @@ const PluginDir = "@ICINGA2_PLUGINDIR@"
/* Our local instance name. By default this is the server's hostname as returned by `hostname --fqdn`.
* This should be the common name from the API certificate.
*/
*/
//const NodeName = "localhost"
/* Our local zone name. */