Normally if for some reason an ack comment still exists on a checkable not
acked anymore, still clean it up. But while replaying log config objects
incl. ack comments come before check results and acks. I.e. 1) ack comment,
2) DOWN check result and 3) ack. Not 1) DOWN check result, 2) ack and 3) ack
comment. So the checkable is temporarily not acked, but already has the ack
comment. In this case the DOWN check result which is older than the ack
comment shall not clean up the latter.
This commit changes the Checkable notification suppression logic (notifications
are currently suppressed on the Checkable if it is unreachable, in a downtime,
or acknowledged) to that after the suppression reason ends, a state
notification is sent if and only if the first hard state after is different
from the last hard state from before. If the checkable is in a soft state after
the suppression ends, the notification is further suppressed until a hard state
is reached.
To achieve this behavior, a new attribute state_before_suppression is added to
Checkable. This attribute is set to the last hard state the first time either a
PROBLEM or a RECOVERY notification is suppressed. Compared to from before,
neither of these two flags in the suppressed_notification will ever be cleared
while the supression is still ongoing but only after the suppression ended and
the current state is compared with the old state stored in
state_before_suppression.
Without this commit, children and parents of a checkable were rescheduled on a
state change while holding the lock for the current checkable. If both ends of
a dependency are checked at the same time and both change state, they could end
up in a deadlock waiting for each other.
This commit fixes this problem by changing the code so that other checkables
are rescheduled only after releasing the lock for the current checkable.
to ensure Checkable#IsReachable() returns correctly for dependency children inside OnReachabilityChanged().
That needs the dependency parent to be already in the correct state.
refs #9143
When triggering a downtime, the time of the causing event is now passed on as
the trigger time. That time is:
* For fixed downtimes: the later one of start and entry time.
* If a check result triggers the downtime: The execution end of the check
result.
* If another downtime triggers the downtime: The trigger time of the first
downtime.
This is done so two nodes in a HA setup can write consistent Icinga DB downtime
history streams.
refs #9101
* Implement scheduling_source attribute
This implements the attribute `scheduling_source` for hosts and services to show which endpoint is running the scheduler for the check.
refs #4814
This patch changes the way checkresults are handled during a restart.
1. Check results coming in during a shutdown are ignored.
2. Upon start, checks which should have ran (next_check in the past),
are re-scheduled within the first minute.
This new behavior means there will be no more "Unknown - Terminated"
checkresults during a restart and checks with high check_interval will
be run earlier if they were already scheduled to run. The downside is
that after Icinga2 was down for a while, there will be a lot of checks
within the first minute. Our max concurrent check should take care of
this though.