icinga2/doc/6.04-cluster.md

17 KiB

Cluster

An Icinga 2 cluster consists of two or more nodes and can reside on multiple architectures. The base concept of Icinga 2 is the possibility to add additional features using components. In case of a cluster setup you have to add the cluster feature to all nodes. Before you start configuring the diffent nodes it's necessary to setup the underlying communication layer based on SSL.

Certificate Authority and Certificates

Icinga 2 comes with two scripts helping you to create CA and node certificates for you Icinga 2 Cluster.

The first step is the creation of CA using the following command:

icinga2-build-ca

Please make sure to export a variable containing an empty folder for the created CA files:

export ICINGA_CA="/root/icinga-ca"

In the next step you have to create a certificate and a key file for every node using the following command:

icinga2-build-key icinga-node-1

Please create a certificate and a key file for every node in the Icinga 2 Cluster and save the CA key in case you want to set up certificates for additional nodes at a later date.

Enable the Cluster Configuration

Until the cluster-component is moved into an independent feature you have to enable the required libraries in the icinga2.conf configuration file:

library "cluster"

Configure the Icinga Node Name

Instead of using the default FQDN as node name you can optionally set that value using the IcingaNodeName constant. This setting must be unique on each cluster node, and must also match the name of the local Endpoint object and the SSL certificate common name.

const IcingaNodeName = "icinga-node-1"

Read further about additional naming conventions.

Not specifying the node name will default to FQDN. Make sure that all configured endpoint names and set common names are in sync.

Configure the ClusterListener Object

The ClusterListener needs to be configured on every node in the cluster with the following settings:

Configuration Setting Value
ca_path path to ca.crt file
cert_path path to server certificate
key_path path to server key
bind_port port for incoming and outgoing conns
peers array of all reachable nodes

A sample config part can look like this:

/**
 * Load cluster library and configure ClusterListener using certificate files
 */
library "cluster"

object ClusterListener "cluster" {
  ca_path = "/etc/icinga2/ca/ca.crt"
  cert_path = "/etc/icinga2/ca/icinga-node-1.crt"
  key_path = "/etc/icinga2/ca/icinga-node-1.key"

  bind_port = 8888

  peers = [ "icinga-node-2" ]
}

The certificate files must be readable by the user Icinga 2 is running as. Also, the private key file should not be world-readable.

Peers configures the direction used to connect multiple nodes together. If have a three node cluster consisting of

  • node-1
  • node-2
  • node-3

and node-3 is only reachable from node-2, you have to consider this in your peer configuration.

Configure Cluster Endpoints

In addition to the configured port and hostname every endpoint can have specific abilities to send configuration files to other nodes and limit the hosts allowed to send configuration files.

Configuration Setting Value
host hostname
port port
accept_config defines all nodes allowed to send configs
config_files defines all files to be send to that node - MUST BE AN ABSOLUTE PATH

A sample config part can look like this:

/**
 * Configure config master endpoint
 */

object Endpoint "icinga-node-1" {
  host = "icinga-node-1.localdomain"
  port = 8888
  config_files = ["/etc/icinga2/conf.d/*.conf"]
}

If you update the configuration files on the configured file sender, it will force a restart on all receiving nodes after validating the new config.

A sample config part for a config receiver endpoint can look like this:

/**
 * Configure config receiver endpoint
 */

object Endpoint "icinga-node-2" {
  host = "icinga-node-2.localdomain"
  port = 8888
  accept_config = [ "icinga-node-1" ]
}

By default these configuration files are saved in /var/lib/icinga2/cluster/config.

In order to load configuration files which were received from a remote Icinga 2 instance you will have to add the following include directive to your icinga2.conf configuration file:

include_recursive IcingaLocalStateDir + "/lib/icinga2/cluster/config"

Cluster Naming Convention

The SSL certificate common name (CN) will be used by the ClusterListener object to determine the local authority. This name must match the local Endpoint object name.

Example:

# icinga2-build-key icinga-node-1
...
Common Name (e.g. server FQDN or YOUR name) [icinga-node-1]:

# vim cluster.conf

object Endpoint "icinga-node-1" {
  host = "icinga-node-1.localdomain"
  port = 8888
}

The Endpoint name is further referenced as peers attribute on the ClusterListener object.

object Endpoint "icinga-node-2" {
  host = "icinga-node-2.localdomain"
  port = 8888
}

object ClusterListener "cluster" {
  ca_path = "/etc/icinga2/ca/ca.crt"
  cert_path = "/etc/icinga2/ca/icinga-node-1.crt"
  key_path = "/etc/icinga2/ca/icinga-node-1.key"

  bind_port = 8888

  peers = [ "icinga-node-2" ]
}

Specifying the local node name using the IcingaNodeName variable requires the same name as used for the endpoint name and common name above. If not set, the FQDN is used.

const IcingaNodeName = "icinga-node-1"

Initial Cluster Sync

In order to make sure that all of your cluster nodes have the same state you will have to pick one of the nodes as your initial "master" and copy its state file to all the other nodes.

You can find the state file in /var/lib/icinga2/icinga2.state. Before copying the state file you should make sure that all your cluster nodes are properly shut down.

Assign Services to Cluster Nodes

By default all services are distributed among the cluster nodes with the Checker feature enabled. If you require specific services to be only executed by one or more checker nodes within the cluster, you must define authorities as additional service object attribute. Required Endpoints must be defined as array.

apply Service "dmz-oracledb" {
  import "generic-service"

  authorities = [ "icinga-node-1" ]

  assign where "oracle" in host.groups
}

The most common use case is building a master-slave cluster. The master node does not have the checker feature enabled, and the slave nodes are checking services based on their location, inheriting from a global service template defining the authorities.

Cluster Health Check

The Icinga 2 ITL ships an internal check command checking all configured EndPoints in the cluster setup. The check result will become critical if one or more configured nodes are not connected.

Example:

apply Service "cluster" {
    import "generic-service"

    check_interval = 1m
    check_command = "cluster"
    authorities = [ "icinga2a" ]

    assign where host.name = "icinga2a"
}

Each cluster node should execute its own local cluster health check to get an idea about network related connection problems from different point of views. Use the authorities attribute to assign the service check to the configured node.

Host With Multiple Cluster Nodes

Special scenarios might require multiple cluster nodes running on a single host. By default Icinga 2 and its features will drop their runtime data below the prefix IcingaLocalStateDir. By default packages will set that path to /var. You can either set that variable as constant configuration definition in icinga2.conf or pass it as runtime variable to the Icinga 2 daemon.

# icinga2 -c /etc/icinga2/node1/icinga2.conf -DIcingaLocalStateDir=/opt/node1/var

Cluster Scenarios

Features in Cluster

Each cluster instance may use available features. If you have multiple locations or departments, they may write to their local database, or populate graphite. Even further all commands are distributed (unless prohibited using Domains).

DB IDO on the left, graphite on the right side - works. Icinga Web 2 on the left, checker and notifications on the right side - works too. Everything on the left and on the right side - make sure to deal with duplicated notifications and automated check distribution.

Location Based Cluster

That scenario fits if your instances are spread over the globe and they all report to a central instance. Their network connection only works towards the central master (or the master is able to connect, depending on firewall policies) which means remote instances won't see each/connect to each other.

All events are synced to the central node, but the remote nodes can still run local features such as a web interface, reporting, graphing, etc.

Imagine the following example with a central node in Nuremberg, and two remote DMZ based instances in Berlin and Vienna. The configuration tree on the central instance could look like this:

conf.d/
  templates/
  germany/
    nuremberg/
      hosts.conf
    berlin/
      hosts.conf
  austria/
    vienna/
      hosts.conf

The configuration deployment should look like:

  • The node nuremberg sends conf.d/germany/berlin to the berlin node.
  • The node nuremberg sends conf.d/austria/vienna to the vienna node.

conf.d/templates is shared on all nodes.

The endpoint configuration on the nuremberg node would look like:

object Endpoint "nuremberg" {
  host = "nuremberg.icinga.org"
  port = 8888
}

object Endpoint "berlin" {
  host = "berlin.icinga.org"
  port = 8888
  config_files_recursive = [ "/etc/icinga2/conf.d/templates",
                             "/etc/icinga2/conf.d/germany/berlin" ]
}

object Endpoint "vienna" {
  host = "vienna.icinga.org"
  port = 8888
  config_files_recursive = [ "/etc/icinga2/conf.d/templates",
                             "/etc/icinga2/conf.d/austria/vienna" ]
}

Each remote node will only peer with the central nuremberg node. Therefore only two endpoints are required for cluster connection. Furthermore the remote node must include the received configuration by the cluster functionality.

Example for the configuration on the berlin node:

object Endpoint "nuremberg" {
  host = "nuremberg.icinga.org"
  port = 8888
}

object Endpoint "berlin" {
  host = "berlin.icinga.org"
  port = 8888
  accept_config = [ "nuremberg" ]
}

include_recursive IcingaLocalStateDir + "/lib/icinga2/cluster/config"

Depenending on the network connectivity the connections can be either established by the remote node or the central node.

Example for berlin node connecting to central nuremberg node:

library "cluster"

object ClusterListener "berlin-cluster" {
  ca_path = "/etc/icinga2/ca/ca.crt"
  cert_path = "/etc/icinga2/ca/berlin.crt"
  key_path = "/etc/icinga2/ca/berlin.key"
  bind_port = 8888
  peers = [ "nuremberg" ]
}

Example for central nuremberg node connecting to remote nodes:

library "cluster"

object ClusterListener "nuremberg-cluster" {
  ca_path = "/etc/icinga2/ca/ca.crt"
  cert_path = "/etc/icinga2/ca/nuremberg.crt"
  key_path = "/etc/icinga2/ca/nuremberg.key"
  bind_port = 8888
  peers = [ "berlin", "vienna" ]
}

The central node should not do any checks by itself. There's two possibilities to achieve that:

  • Disable the checker feature
  • Pin the service object configuration to the remote endpoints using the authorities attribute.

Load Distribution

If you are planning to off-load the checks to a defined set of remote workers you can achieve that by:

  • Deploying the configuration on all nodes.
  • Let Icinga 2 distribute the load amongst all available nodes.

That way all remote check instances will receive the same configuration but only execute their part. The central instance can also execute checks, but you may also disable the Checker feature.

conf.d/
  templates/
  many/

If you are planning to have some checks executed by a specific set of checker nodes just pin them using the authorities attribute.

Example on the central node:

object Endpoint "central" {
  host = "central.icinga.org"
  port = 8888
}

object Endpoint "checker1" {
  host = "checker1.icinga.org"
  port = 8888
  config_files_recursive = [ "/etc/icinga2/conf.d" ]
}

object Endpoint "checker2" {
  host = "checker2.icinga.org"
  port = 8888
  config_files_recursive = [ "/etc/icinga2/conf.d" ]
}

object ClusterListener "central-cluster" {
  ca_path = "/etc/icinga2/ca/ca.crt"
  cert_path = "/etc/icinga2/ca/central.crt"
  key_path = "/etc/icinga2/ca/central.key"
  bind_port = 8888
  peers = [ "checker1", "checker2" ]
}

Example on checker1 node:

object Endpoint "central" {
  host = "central.icinga.org"
  port = 8888
}

object Endpoint "checker1" {
  host = "checker1.icinga.org"
  port = 8888
  accept_config = [ "central" ]
}

object Endpoint "checker2" {
  host = "checker2.icinga.org"
  port = 8888
  accept_config = [ "central" ]
}

object ClusterListener "checker1-cluster" {
  ca_path = "/etc/icinga2/ca/ca.crt"
  cert_path = "/etc/icinga2/ca/checker1.crt"
  key_path = "/etc/icinga2/ca/checker1.key"
  bind_port = 8888
}

High Availability

Two nodes in a high availability setup require an initial cluster sync. Furthermore the active master node should deploy the configuration to the second node, if that does not already happen by your provisioning tool. It primarly depends which features are enabled/used. It is still required that some failover mechanism detects for example which instance will be the notification "master".

Multiple Hierachies

Your central instance collects all check results for reporting and graphing and also does some sort of additional notifications. The customers got their own instances in their local DMZs. They are limited to read/write only their services, but replicate all events back to the central instance. Within each DMZ there are additional check instances also serving interfaces for local departments. The customers instances will collect all results, but also send them back to your central instance. Additionally the customers instance on the second level in the middle prohibits you from sending commands to the down below department nodes. You're only allowed to receive the results, and a subset of each customers configuration too.

Your central instance will generate global reports, aggregate alert notifications and check additional dependencies (for example, the customers internet uplink and bandwidth usage).

The customers instance will only check a subset of local services and delegate the rest to each department. Even though it acts as configuration master with a central dashboard for all departments managing their configuration tree which is then deployed to all department instances. Furthermore the central NOC is able to see what's going on.

The instances in the departments will serve a local interface, and allow the administrators to reschedule checks or acknowledge problems for their services.