Merge pull request #7420 from Icinga/feature/docs-tech-cluster-routing

Docs: Add cluster message routing details
This commit is contained in:
Michael Friedrich 2019-08-14 16:51:50 +02:00 committed by GitHub
commit ada7c3d80e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 65 additions and 3 deletions

View File

@ -754,7 +754,7 @@ that by grepping the log file for `resumed` and `paused`.
Specific features with HA capabilities are explained below. Specific features with HA capabilities are explained below.
### High Availability: Checker <a id="technical-concepts-cluster-ha-checker"></a> #### High Availability: Checker <a id="technical-concepts-cluster-ha-checker"></a>
The `checker` feature only executes checks for `Checkable` objects (Host, Service) The `checker` feature only executes checks for `Checkable` objects (Host, Service)
where it is authoritative. where it is authoritative.
@ -765,7 +765,7 @@ The cluster message routing ensures that all check results are synchronized
to nodes which are not authoritative for this configuration object. to nodes which are not authoritative for this configuration object.
### High Availability: Notifications <a id="technical-concepts-cluster-notifications"></a> #### High Availability: Notifications <a id="technical-concepts-cluster-notifications"></a>
The `notification` feature only sends notifications for `Notification` objects The `notification` feature only sends notifications for `Notification` objects
where it is authoritative. where it is authoritative.
@ -775,7 +775,7 @@ That way each node only executes notifications for a segment of all notification
Notified users and other event details are synchronized throughout the cluster. Notified users and other event details are synchronized throughout the cluster.
This is required if for example the DB IDO feature is active on the other node. This is required if for example the DB IDO feature is active on the other node.
### High Availability: DB IDO <a id="technical-concepts-cluster-ha-ido"></a> #### High Availability: DB IDO <a id="technical-concepts-cluster-ha-ido"></a>
If you don't have HA enabled for the IDO feature, both nodes will If you don't have HA enabled for the IDO feature, both nodes will
write their status and historical data to their own separate database write their status and historical data to their own separate database
@ -1052,6 +1052,68 @@ when e.g. hostgroup membership is used for assign where expressions, but the gro
available on the master. available on the master.
### Cluster: Message Routing <a id="technical-concepts-cluster-message-routing"></a>
One fundamental part of the cluster message routing is the MessageOrigin object.
This is created when a new JSON-RPC message is received in `JsonRpcConnection::MessageHandler()`.
It contains
- FromZone being extracted from the endpoint object which owns the JsonRpcConnection
- FromClient being the JsonRpcConnection bound to the endpoint object
These attributes are checked in message receive api handlers for security access. E.g. whether a
message origin is from a child zone which is not allowed, etc.
This is explained in the [JSON-RPC messages](19-technical-concepts.md#technical-concepts-json-rpc-messages) chapter.
Whenever such a message is processed on the client, it may trigger additional cluster events
which are sent back to other endpoints. Therefore it is key to always pass the MessageOrigin
`origin` when processing these messages locally.
Example:
- Client receives a CheckResult from another endpoint in the same zone, call it `sender` for now
- Calls ProcessCheckResult() to store the CR and calculcate states, notifications, etc.
- Calls the OnNewCheckResult() signal to trigger IDO updates
OnNewCheckResult() also calls a registered cluster handler which forwards the CheckResult to other cluster members.
Without any origin details, this CheckResult would be relayed to the `sender` endpoint again.
Which processes the message, ProcessCheckResult(), OnNewCheckResult(), sends back and so on.
That creates a loop which our cluster protocol needs to prevent at all cost.
RelayMessageOne() takes care of the routing. This involves fetching the targetZone for this message and its endpoints.
- Don't relay messages to ourselves.
- Don't relay messages to disconnected endpoints.
- Don't relay the message to the zone through more than one endpoint unless this is our own zone.
- Don't relay messages back to the endpoint which we got the message from. **THIS**
- Don't relay messages back to the zone which we got the message from.
- Only relay message to the zone master if we're not currently the zone master.
```
e1 is zone master, e2 and e3 are zone members.
Message is sent from e2 or e3:
!isMaster == true
targetEndpoint e1 is zone master -> send the message
targetEndpoint e3 is not zone master -> skip it, avoid routing loops
Message is sent from e1:
!isMaster == false -> send the messages to e2 and e3 being the zone routing master.
```
With passing the `origin` the following condition prevents sending a message back to sender:
```
if (origin && origin->FromClient && targetEndpoint == origin->FromClient->GetEndpoint()) {
```
This message then simply gets skipped for this specific Endpoint and is never sent.
This analysis originates from a long-lasting [downtime loop bug](https://github.com/Icinga/icinga2/issues/7198).
## TLS Network IO <a id="technical-concepts-tls-network-io"></a> ## TLS Network IO <a id="technical-concepts-tls-network-io"></a>
### TLS Connection Handling <a id="technical-concepts-tls-network-io-connection-handling"></a> ### TLS Connection Handling <a id="technical-concepts-tls-network-io-connection-handling"></a>