Merge pull request #6775 from Icinga/feature/docs-tech-concepts-check-scheduler

Add technical docs for the check scheduler (general, initial check, offsets)
2025-07-27 23:54:07 +02:00 · 2018-11-14 14:41:23 +01:00 · 2018-11-14 14:41:23 +01:00 · aec0f6de15
commit aec0f6de15
parent 2ce3e1d49d 9409e93bc7
1 changed files with 190 additions and 0 deletions
--- a/doc/19-technical-concepts.md
+++ b/doc/19-technical-concepts.md
@ -6,6 +6,7 @@ into specific Icinga 2 components such as:
 * [Application](19-technical-concepts.md#technical-concepts-application)
 * [Configuration](19-technical-concepts.md#technical-concepts-configuration)
 * [Features](19-technical-concepts.md#technical-concepts-features)
+* [Check Scheduler](19-technical-concepts.md#technical-concepts-check-scheduler)
 * [Cluster](19-technical-concepts.md#technical-concepts-cluster)
 * [TLS Network IO](19-technical-concepts.md#technical-concepts-tls-network-io)

@ -198,6 +199,195 @@ daemon of Graphite. The InfluxDBWriter is instead writing bulk metric messages
 to InfluxDB's HTTP API, similar to Elasticsearch.


+## Check Scheduler <a id="technical-concepts-check-scheduler"></a>
+
+The check scheduler starts a thread which loops forever. It waits for
+check events being inserted into `m_IdleCheckables`.
+
+If the current pending check event number is larger than the configured
+max concurrent checks, the thread waits up until it there's slots again.
+
+In addition, further checks on enabled checks, check periods, etc. are
+performed. Once all conditions have passed, the next check timestamp is
+calculated and updated. This also is the timestamp where Icinga expects
+a new check result ("freshness check").
+
+The object is removed from idle checkables, and inserted into the
+pending checkables list. This can be seen via REST API metrics for the
+checker component feature as well.
+
+The actual check execution happens asynchronously using the application's
+thread pool.
+
+Once the check returns, it is removed from pending checkables and again
+inserted into idle checkables. This ensures that the scheduler takes this
+checkable event into account in the next iteration.
+
+### Start <a id="technical-concepts-check-scheduler-start"></a>
+
+When checkable objects get activated during the startup phase,
+the checker feature registers a handler for this event. This is due
+to the fact that the `checker` feature is fully optional, and e.g. not
+used on command endpoint clients.
+
+Whenever such an object activation signal is triggered, Icinga 2 checks
+whether it is [authoritative for this object](19-technical-concepts.md#technical-concepts-cluster-ha-object-authority).
+This means that inside an HA enabled zone with two endpoints, only non-paused checkable objects are
+actively inserted into the idle checkable list for the check scheduler.
+
+### Initial Check <a id="technical-concepts-check-scheduler-initial"></a>
+
+When a new checkable object (host or service) is initially added to the
+configuration, Icinga 2 performs the following during startup:
+
+* `Checkable::Start()` is called and calculates the first check time
+* With a spread delta, the next check time is actually set.
+
+If the next check should happen within a time frame of 60 seconds,
+Icinga 2 calculates a delta from a random value. The minimum of `check_interval`
+and 60 seconds is used as basis, multiplied with a random value between 0 and 1.
+
+In the best case, this check gets immediately executed after application start.
+The worst case scenario is that the check is scheduled 60 seconds after start
+the latest.
+
+The reasons for delaying and spreading checks during startup is that
+the application typically needs more resources at this time (cluster connections,
+feature warmup, initial syncs, etc.). Immediate check execution with
+thousands of checks could lead into performance problems, and additional
+events for each received check results.
+
+Therefore the initial check window is 60 seconds on application startup,
+random seed for all checkables. This is not predictable over multiple restarts
+for specific checkable objects, the delta changes every time.
+
+### Scheduling Offset <a id="technical-concepts-check-scheduler-offset"></a>
+
+There's a high chance that many checkable objects get executed at the same time
+and interval after startup. The initial scheduling spreads that a little, but
+Icinga 2 also attempts to ensure to keep fixed intervals, even with high check latency.
+
+During startup, Icinga 2 calculates the scheduling offset from a random number:
+
+* `Checkable::Checkable()` calls `SetSchedulingOffset()` with `Utility::Random()`
+* The offset is a pseudo-random integral value between `0` and `RAND_MAX`.
+
+Whenever the next check time is updated with `Checkable::UpdateNextCheck()`,
+the scheduling offset is taken into account.
+
+Depending on the state type (SOFT or HARD), either the `retry_interval` or `check_interval`
+is used. If the interval is greater than 1 second, the time adjustment is calculated in the
+following way:
+
+`now * 100 + offset` divided by `interval * 100`, using the remainder (that's what `fmod()` is for)
+and dividing this again onto base 100.
+
+Example: offset is 6500, interval 300, now is 1542190472.
+
+```
+1542190472 * 100 + 6500 = 154219053714
+300 * 100 = 30000
+154219053714 / 30000 = 5140635.1238
+
+(5140635.1238 - 5140635.0) * 30000 = 3714
+3714 / 100 = 37.14
+```
+
+37.15 seconds as an offset would be far too much, so this is again used as a calculation divider for the
+real offset with the base of 5 times the actual interval.
+
+Again, the remainder is calculated from the offset and `interval * 5`. This is divided onto base 100 again,
+with an additional 0.5 seconds delay.
+
+Example: offset is 6500, interval 300.
+
+```
+6500 / 300 = 21.666666666666667
+(21.666666666666667 - 21.0) * 300 = 200
+200 / 100 = 2
+2 + 0.5 = 2.5
+```
+
+The minimum value between the first adjustment and the second offset calculation based on the interval is
+taken, in the above example `2.5` wins.
+
+The actual next check time substracts the adjusted time from the future interval addition to provide
+a more widespread scheduling time among all checkable objects.
+
+`nextCheck = now - adj + interval`
+
+You may ask, what other values can happen with this offset calculation. Consider calculating more examples
+with different interval settings.
+
+Example: offset is 34567, interval 60, now is 1542190472.
+
+```
+1542190472 * 100 + 34567 = 154219081767
+60 * 100 = 6000
+154219081767 / 6000 = 25703180.2945
+(25703180.2945 - 25703180.0) * 6000 / 100 = 17.67
+
+34567 / 60 = 576.116666666666667
+(576.116666666666667 - 576.0) * 60 / 100 + 0.5 = 1.2
+```
+
+`1m` interval starts at `now + 1.2s`.
+
+Example: offset is 12345, interval 86400, now is 1542190472.
+
+```
+1542190472 * 100 + 12345 = 154219059545
+86400 * 100 = 8640000
+154219059545 / 8640000 = 17849.428188078703704
+(17849.428188078703704 - 17849) * 8640000 = 3699545
+3699545 / 100 = 36995.45
+
+12345 / 86400 = 0.142881944444444
+0.142881944444444 * 86400 / 100 + 0.5 = 123.95
+```
+
+`1d` interval starts at `now + 2m4s`.
+
+> **Note**
+>
+> In case you have a better algorithm at hand, feel free to discuss this in a PR on GitHub.
+> It needs to fulfill two things: 1) spread and shuffle execution times on each `next_check` update
+> 2) not too narrowed window for both long and short intervals
+> Application startup and initial checks need to be handled with care in a slightly different
+> fashion.
+
+When `SetNextCheck()` is called, there are signals registered. One of them sits
+inside the `CheckerComponent` class whose handler `CheckerComponent::NextCheckChangedHandler()`
+deletes/inserts the next check event from the scheduling queue. This basically
+is a list with multiple indexes with the keys for scheduling info and the object.
+
+
+### Check Latency and Execution Time <a id="technical-concepts-check-scheduler-latency"></a>
+
+Each check command execution logs the start and end time where
+Icinga 2 (and the end user) is able to calculate the plugin execution time from it.
+
+```
+GetExecutionEnd() - GetExecutionStart()
+```
+
+The higher the execution time, the higher the command timeout must be set. Furthermore
+users and developers are encouraged to look into plugin optimizations to minimize the
+execution time. Sometimes it is better to let an external daemon/script do the checks
+and feed them back via REST API.
+
+Icinga 2 stores the scheduled start and end time for a check. If the actual
+check execution time differs from the scheduled time, e.g. due to performance
+problems or limited execution slots (concurrent checks), this value is stored
+and computed from inside the check result.
+
+The difference between the two deltas is called `check latency`.
+
+```
+(GetScheduleEnd() - GetScheduleStart()) - CalculateExecutionTime()
+```
+
+
 ## Cluster <a id="technical-concepts-cluster"></a>

 ### Communication <a id="technical-concepts-cluster-communication"></a>