1674 Commits

Author SHA1 Message Date
Julian Brost
069a56d84f Use Checkable::GetStateBeforeSuppression() only where relevant
This fixes an issue where recovery notifications get lost if they happen
outside of a notification time period.

Not all calls to `Checkable::NotificationReasonApplies()` need
`GetStateBeforeSuppression()` to be checked. In fact, for one caller,
`FireSuppressedNotifications()` in
`lib/notification/notificationcomponent.cpp`, the state before suppression may
not even be initialized properly, so that the default value of OK is used which
can lead to incorrect return values. Note the difference between suppressions
happening on the level of the `Checkable` object level and the `Notification`
object level. Only the first sets the state before suppression in the
`Checkable` object, but so far, also the latter used that value incorrectly.

This commit moves the check of `GetStateBeforeSuppression()` from
`Checkable::NotificationReasonApplies()` to the one place where it's actually
relevant: `Checkable::FireSuppressedNotifications()`. This made the existing
call to `NotificationReasonApplies()` unneccessary as it would always return
true: the `type` argument is computed based on the current check result, so
there's no need to check it against the current check result.
2024-11-14 15:01:14 +01:00
Yonas Habteab
f66f99516b Checkable: Don't recalculate next_check while processing remotely genrated check
Currently, when processing a `CheckResult`, it will first trigger an
`OnNextCheckChanged` event, which is sent to all connected endpoints.
Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult`
event is fired, which is of course also sent to all connected endpoints.

Next, the other endpoints receive the `event::SetNextCheck` cluster
event followed by `event::CheckResult`and invoke
`checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly
received check. So they also try to recalculate the next check
themselves and invalidate the previously received next check timestamp
from the source endpoint. Since each endpoint randomly initialises its
own scheduling offset, the recalculated next check will always differ by
a split second/millisecond on each of them. As a consequence, two Icinga
DB HA instances will generate two different checksums for the same state
and causes the state histories to be fully resynchronised after a
takeover/Icinga 2 reload.
2024-09-20 12:24:39 +02:00
Yonas Habteab
20a73f6d13
Merge pull request #10127 from Icinga/AddDowntime-trigger_name-2.13
Downtime::AddDowntime(): NULL-check pointer before deref not to crash
2024-09-18 10:50:16 +02:00
Yonas Habteab
3d1a2b4e44
Merge pull request #10128 from Icinga/broken-timeperiod-2.13
Fix broken `TimePeriod/ScheduledDowntime`s
2024-09-18 10:36:41 +02:00
Alexander A. Klimov
7761d6f78b l_LegacyDowntimesCache: delete removed objects not to leak memory 2024-09-17 16:44:57 +02:00
Alexander A. Klimov
d09d051fad /v1/actions/schedule-downtime: reject request on invalid trigger_name
For this purpose lookup the specified Downtime. Also pass Downtime objects,
not just names, to Downtime::AddDowntime() not to lookup it twice.
2024-09-17 16:44:57 +02:00
Alexander A. Klimov
0bba441a86 [Refactor] Downtime::GetDowntimeIDFromLegacyID(): return the Downtime itself
not just its name.
2024-09-17 16:44:57 +02:00
Alexander A. Klimov
5a155346e7 [Refactor] l_LegacyDowntimesCache: store Downtime objects, not just their names
to avoid names of vanished objects.
2024-09-17 16:44:57 +02:00
Yonas Habteab
e611dc65ea Check segemnt start date inclusively in TimePeriod::IsInside() 2024-09-17 16:44:32 +02:00
Yonas Habteab
3de7ec062b Fix broken timeperiods/scheduleddowntimes 2024-09-17 16:44:32 +02:00
Julian Brost
6986417533 Timeperiods: fix off by one when calculating n-th last weekday of the month
A day specification like "monday -1" refers to the last Monday of the month.
However, there was an off by one if the first day of the next month is the same
day of the week, i.e. a Monday in this example.

LegacyTimePeriod::FindNthWeekday() picks a day to start the search for the day
in question. When given a negative n to search for the n-th last day, it
wrongly used the first day of the following month as the start and counted it
as if it was within the current month. This resulted in a 1/7 chance that the
result was one week too late.

This is fixed by using the last day of the current month instead.
2024-09-17 16:44:09 +02:00
Alexander Aleksandrovič Klimov
45d5a3f5f3
Merge pull request #9817 from Icinga/flexible-downtimes-disappear-too-early-9797
Downtime#Start(): trigger flexible downtimes not earlier than fixed ones
2023-07-05 17:06:03 +02:00
Alexander Aleksandrovič Klimov
51afc74310
Merge pull request #9820 from Icinga/2.13.8/checkable-processcheckresult-only-clean-up-ack-comments-older-than-check-result-9718
Checkable#ProcessCheckResult(): only clean up ack comments older than check result
2023-07-05 11:20:55 +02:00
Yonas Habteab
ff0b45eca0 PluginUtility: Fix PerfData don't get parsed correctly
The problem was that some PerfData labels contained several whitespace characters,
not just one, and therefore it was parsed incorrectly in `SplitPerfdata()`. I.e. the condition
in line 144 checks whether the first and last character is a normal quote, but since the
label can contain spaces at the beginning and at the end respectively, this caused the problems.

This PR fixes the problem by removing all occurring whitespace from the beginning and end,
before starting to parse the actual label.
2023-07-04 11:10:38 +02:00
Alexander A. Klimov
6dffc57a37 Checkable#ProcessCheckResult(): only clean up ack comments older than check result
Normally if for some reason an ack comment still exists on a checkable not
acked anymore, still clean it up. But while replaying log config objects
incl. ack comments come before check results and acks. I.e. 1) ack comment,
2) DOWN check result and 3) ack. Not 1) DOWN check result, 2) ack and 3) ack
comment. So the checkable is temporarily not acked, but already has the ack
comment. In this case the DOWN check result which is older than the ack
comment shall not clean up the latter.
2023-07-04 10:56:30 +02:00
Alexander A. Klimov
c160c4b62e Checkable#RemoveAckComments(): add optional comment entry time filter 2023-07-04 10:56:30 +02:00
Alexander A. Klimov
0470fe12a7 Checkable#RemoveCommentsByType(): remove redundant parameter 2023-07-04 10:56:30 +02:00
Alexander A. Klimov
43c4feb645 Downtime#Start(): trigger flexible downtimes not earlier than fixed ones
the last state change could be a long time ago. If it's longer than
the new downtime's duration, the downtime expires immediately.

trigger time + duration < now
2023-07-04 10:39:14 +02:00
Alexander A. Klimov
34844c146d Deduplicate and stabilize fragile filesystem transactions
by using AtomicFile so they ensure all or nothing of a file gets replaced.
2023-02-15 17:19:57 +01:00
Yonas Habteab
a15bbfe913 Use service short name for evaluating targeted service rules 2022-11-04 12:47:41 +01:00
Alexander A. Klimov
cc67510063 ApplyRule::RuleMap: reduce complexity, save unnecessary lookups 2022-11-04 12:47:41 +01:00
Alexander A. Klimov
0bf093af14 Targeted apply rules: don't unnecessarily eval filter 2022-11-04 12:47:41 +01:00
Alexander A. Klimov
840bfb6be1 Unify storages of regular/targeted apply rules: std::vector<ApplyRule::Ptr> 2022-11-04 12:47:41 +01:00
Alexander A. Klimov
1935137a8d Separately handle apply rules targetting only specific parent objects
not to unnecessarily run e.g. the filter assign where host.name=="example.com"
for all hosts being not example.com.
2022-11-04 12:47:41 +01:00
Alexander A. Klimov
fcedb01d0d Lookup apply rules faster by Type*, not String and by map instead of ==/!=
1. The lookup of apply rules per source type now implies
   no String(const char*) (no malloc()) and just pointer (uint64) comparisions
2. Apply rules are now also grouped by target type via a nested map, that obsoletes
   checking the target type while iterating over all rules per source type
2022-11-02 11:06:12 +01:00
Alexander A. Klimov
17c652d8dd Construct string once, not unnecessarily N times 2022-10-31 12:40:59 +01:00
Alexander Aleksandrovič Klimov
6bf16fe4f1
Merge pull request #9395 from Icinga/bugfix/atomic-members-2.13
Replace EventuallyAtomic with AtomicOrLocked which falls back to a mutex
2022-06-23 11:32:18 +02:00
Alexander A. Klimov
20ae49ad49 Introduce Command#arguments[].separator
... for letting check commands produce argv like --key=value,
not just --key value.

refs #6277
2022-06-14 15:04:09 +02:00
Julian Brost
184548f4fe Replace EventuallyAtomic with AtomicOrLocked which falls back to a mutex
Apparently there was a reason for making the members of generated classes
atomic. However, this was only done for some types, others were still accessed
using non-atomic operations. For members of type T::Ptr (i.e.  intrusive_ptr<T>),
this can result in a double free when multiple threads access the same variable
and at least one of them writes to the variable.

This commit makes use of std::atomic<T> for more T (it removes the additional
constraint sizeof(T) <= sizeof(void*)) and uses a type including a mutex for
load and store operations as a fallback.
2022-06-14 13:46:40 +02:00
Julian Brost
d56afbc51a Take host state into account when sending suppressed notifications
Checkable::FireSuppressedNotifications() compares the time of the current
checkable with the last recovery time of parents to avoid notification right
after a parent recovered and before the current checkable was checked.

This commit makes this check also include to host if the checkable is a
service.  This makes the behavior consistent with the documentation that states
there is an implicit dependency on the host (which isn't realized as implicitly
generating a Dependency object unfortunately).
2022-04-22 12:04:46 +02:00
Alexander A. Klimov
45b723644c Introduce Comment#sticky
Carries whether ack was sticky for ack comments.
2022-03-30 09:45:39 +02:00
Julian Brost
f67a5532dc
Merge pull request #9285 from Icinga/bugfix/suppressed-state-notifications-2.13
Checkable: send state notifications after suppression if and only if the state differs compared to before the suppression started
2022-03-29 15:16:04 +02:00
Julian Brost
3be1202eb3
Merge pull request #9290 from Icinga/bugfix/override-default-template-apply-rules-7914
Apply rules: import default templates first
2022-03-29 13:55:41 +02:00
Alexander A. Klimov
07cd15f48f Apply rules: import default templates first
... to allow to override the attributes they set.

refs #7914
2022-03-24 14:04:58 +01:00
Julian Brost
ccb18a04ec Checkable: Add test for state notifications after a suppression ends 2022-03-09 17:06:09 +01:00
Julian Brost
6303d8df09 Checkable: sync state_before_suppression in cluster
This ensures that in case of a failover in an HA zone, the other can take over
properly and has the required state to send the proper notifications.
2022-03-09 17:06:09 +01:00
Julian Brost
29fc3ad151 Checkable: improve state notifications after suppression ends
This commit changes the Checkable notification suppression logic (notifications
are currently suppressed on the Checkable if it is unreachable, in a downtime,
or acknowledged) to that after the suppression reason ends, a state
notification is sent if and only if the first hard state after is different
from the last hard state from before. If the checkable is in a soft state after
the suppression ends, the notification is further suppressed until a hard state
is reached.

To achieve this behavior, a new attribute state_before_suppression is added to
Checkable. This attribute is set to the last hard state the first time either a
PROBLEM or a RECOVERY notification is suppressed. Compared to from before,
neither of these two flags in the suppressed_notification will ever be cleared
while the supression is still ongoing but only after the suppression ended and
the current state is compared with the old state stored in
state_before_suppression.
2022-03-09 17:06:09 +01:00
Julian Brost
e09eaa3ad2
Merge pull request #9239 from Icinga/bugfix/adjust-behavior-of-service-get-severity
Service#GetSeverity(): behave as the respective IDO query of Icinga Web
2022-03-08 16:34:18 +01:00
Julian Brost
dbe13e2f32
Merge pull request #9238 from Icinga/bugfix/timeperiod-dst-2.0
LegacyTimePeriod::ScriptFunc: fix DST edge-cases
2022-03-08 15:22:09 +01:00
Julian Brost
12293d999c
Merge pull request #9190 from Icinga/bugfix/sync-missing-history-information-213
Icinga DB: ensure consistent history streams in HA setup
2022-03-07 11:32:15 +01:00
Julian Brost
50ef32a0ad
Merge pull request #9228 from Icinga/bugfix/processcheckresult-dependency-deadlock-2.13
Prevent deadlock in ProcessCheckResult
2022-03-07 11:16:00 +01:00
Julian Brost
c6bac19da8
Merge pull request #9241 from Icinga/bugfix/icingadb-reachabilitychangehandler-9143
Icinga DB: ensure is_reachable and severity don't miss updates
2022-03-07 09:27:35 +01:00
Julian Brost
53a389769c
Merge pull request #9260 from Icinga/bugfix/event-handler-spamming-8704-213
Checkable#ExecuteEventHandler(): don't outsource event command run twice
2022-02-25 16:52:27 +01:00
Alexander A. Klimov
74935dad7b Checkable#ExecuteEventHandler(): don't outsource event command run twice
refs #8704
2022-02-24 14:03:57 +01:00
Alexander A. Klimov
88b041c7c9 Checkable#ProcessCheckResult(): call Checkable::OnReachabilityChanged less often
Call it only on state changes to reduce no-op Redis/IDO updates a lot.

refs #9143
2022-02-23 16:06:31 +01:00
Alexander A. Klimov
4ea65076b0 Checkable#ProcessCheckResult(): call Checkable::OnReachabilityChanged last
to ensure Checkable#IsReachable() returns correctly for dependency children inside OnReachabilityChanged().
That needs the dependency parent to be already in the correct state.

refs #9143
2022-02-23 16:06:31 +01:00
Alexander Aleksandrovič Klimov
501691cdde Service#GetSeverity(): behave as the respective IDO query of Icinga Web
which doesn't include host reachability.
2022-02-21 15:30:49 +01:00
Julian Brost
26246a4601 LegacyTimePeriod::ScriptFunc: fix DST edge-cases
This change fixes two problems:
* The internal functions used by ScriptFunc more or less expect to operate on
  full days, but ScriptFunc may have called them with some random timestamp
  during the day. This is fixed by always using midnight of the day as
  reference time.
* Previously, the code advanced a timestamp to the next day by adding 24 hours.
  On days with DST changes, this could either still be on the same day (a day
  may have 25 hours) or skip an entire day (a day may have 23 hours). This is
  fixed by using a struct tm to advance the time to the next day.
2022-02-21 15:24:15 +01:00
Julian Brost
c55615a048 Prevent deadlock in ProcessCheckResult
Without this commit, children and parents of a checkable were rescheduled on a
state change while holding the lock for the current checkable. If both ends of
a dependency are checked at the same time and both change state, they could end
up in a deadlock waiting for each other.

This commit fixes this problem by changing the code so that other checkables
are rescheduled only after releasing the lock for the current checkable.
2022-02-18 14:21:59 +01:00
Julian Brost
4c2f6faa61 Sync comment and downtime removal info for Icinga DB history
When a comment or downtime is removed manually, the name of the requestor and
timestamp have to be synced to other nodes in the cluster to allow all of them
to generate a consistent Icinga DB history stream.

refs #9101
2022-01-24 18:03:03 +01:00