1. The lookup of apply rules per source type now implies
no String(const char*) (no malloc()) and just pointer (uint64) comparisions
2. Apply rules are now also grouped by target type via a nested map, that obsoletes
checking the target type while iterating over all rules per source type
Apparently there was a reason for making the members of generated classes
atomic. However, this was only done for some types, others were still accessed
using non-atomic operations. For members of type T::Ptr (i.e. intrusive_ptr<T>),
this can result in a double free when multiple threads access the same variable
and at least one of them writes to the variable.
This commit makes use of std::atomic<T> for more T (it removes the additional
constraint sizeof(T) <= sizeof(void*)) and uses a type including a mutex for
load and store operations as a fallback.
Checkable::FireSuppressedNotifications() compares the time of the current
checkable with the last recovery time of parents to avoid notification right
after a parent recovered and before the current checkable was checked.
This commit makes this check also include to host if the checkable is a
service. This makes the behavior consistent with the documentation that states
there is an implicit dependency on the host (which isn't realized as implicitly
generating a Dependency object unfortunately).
This commit changes the Checkable notification suppression logic (notifications
are currently suppressed on the Checkable if it is unreachable, in a downtime,
or acknowledged) to that after the suppression reason ends, a state
notification is sent if and only if the first hard state after is different
from the last hard state from before. If the checkable is in a soft state after
the suppression ends, the notification is further suppressed until a hard state
is reached.
To achieve this behavior, a new attribute state_before_suppression is added to
Checkable. This attribute is set to the last hard state the first time either a
PROBLEM or a RECOVERY notification is suppressed. Compared to from before,
neither of these two flags in the suppressed_notification will ever be cleared
while the supression is still ongoing but only after the suppression ended and
the current state is compared with the old state stored in
state_before_suppression.
Without this commit, children and parents of a checkable were rescheduled on a
state change while holding the lock for the current checkable. If both ends of
a dependency are checked at the same time and both change state, they could end
up in a deadlock waiting for each other.
This commit fixes this problem by changing the code so that other checkables
are rescheduled only after releasing the lock for the current checkable.
to ensure Checkable#IsReachable() returns correctly for dependency children inside OnReachabilityChanged().
That needs the dependency parent to be already in the correct state.
refs #9143
When a comment or downtime is removed manually, the name of the requestor and
timestamp have to be synced to other nodes in the cluster to allow all of them
to generate a consistent Icinga DB history stream.
refs #9101
When creating a fixed downtime that starts immediately while the checkable is
in a non-OK state, previously the code path for flexible downtimes was used to
trigger this downtime. This is fixed by this commit which resolves two issued:
1. Missing downtime start notification: notifications work differently for
fixed and flexible downtimes. This resulted in missing downtime start
notifications under the conditions described above.
2. Incorrect downtime trigger time: this code path would incorrectly assume the
timestamp of the last checkable as the trigger time which is incorrect for
fixed downtimes.
not to confuse the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore null (not `{"type":32}`) as there's no type "32".
refs #8186
When triggering a downtime, the time of the causing event is now passed on as
the trigger time. That time is:
* For fixed downtimes: the later one of start and entry time.
* If a check result triggers the downtime: The execution end of the check
result.
* If another downtime triggers the downtime: The trigger time of the first
downtime.
This is done so two nodes in a HA setup can write consistent Icinga DB downtime
history streams.
refs #9101
`ConfigType::GetObjectsByType<Downtime>()` already returns a
`std::vector<Downtime::Ptr>` so there is no point in copying it into another
vector of the same type just to then iterate the copied vector instead of the
original one.
This commit fixes the following build error:
[ 55%] Building CXX object lib/icinga/CMakeFiles/icinga.dir/usergroup.cpp.o
lib/icinga/usergroup.cpp:79:24: error: incomplete type ‘icinga::Notification’ used in nested name specifier
79 | std::set<Notification::Ptr> UserGroup::GetNotifications() const
| ^~~
This change fixes two problems:
* The internal functions used by ScriptFunc more or less expect to operate on
full days, but ScriptFunc may have called them with some random timestamp
during the day. This is fixed by always using midnight of the day as
reference time.
* Previously, the code advanced a timestamp to the next day by adding 24 hours.
On days with DST changes, this could either still be on the same day (a day
may have 25 hours) or skip an entire day (a day may have 23 hours). This is
fixed by using a struct tm to advance the time to the next day.
The loop iterated over the services of the wrong host resulting in duplicate
downtimes scheduled for services of the parent host instead of downtimes for
services of the child host.
Many functions of LegacyTimePeriod take a tm pointer as an input parameter and
then pass it to mktime() which actually modifies it. This causes problems if
tm_isdst was intentionally set to -1 (to automatically detect whether DST is
active at some time) and then a function is called that implicitly sets
tm_isdst and then the values of tm are modified in a way that crosses a DST
change. This resulted in 1 hour offsets with ScheduledDowntimes on days with
DST changes.
* Implement scheduling_source attribute
This implements the attribute `scheduling_source` for hosts and services to show which endpoint is running the scheduler for the check.
refs #4814
Only handle messages with a trusted origin in
ClusterEvents::ExecuteCommandAPIHandler. Previously, it would not locally
execute any command but forward them to other nodes where they would then have
a trusted origin and be executed.
Note that even when passing `nullptr` as target zone to `RelayMessage()`, the
cluster message will still be sent to the parent zone. These incoming messages
will now be rejected by the parent nodes. At the moment, there's no way to only
send within the local zone.
Disregard passive check results while no active checks are being scheduled due to violated dependencies.
This copes with the fact that programs feeding passive check results into Icinga may have no notion of reachability and so drive a checkable into HARD state although dependencies have caused active check scheduling being suspended. This may prevent superflous problem notifications being emitted during recovery.
As disable_checks defaults to false, it was regarded OK (by @Al2Klimov) to make this behaviour (which resembles the active check case) unconditional and not conditionalize it on an additional attribute.
In the description of disable_checks, note that a value of true both disables scheduling of active checks and drops passive check results.
At numerous places in the code, something like this is performed:
String name = Downtime::AddDowntime(...);
Downtime::Ptr downtime = Downtime::GetByName(name);
However, `downtime` can be a `nullptr` after this as it is possible that
the downtime is deleted in between.
This commit changes the return type of `Downtime::AddDowntime` to return
a Downtime::Ptr instead of the full name of the downtime. `AddDowntime`
performs the very same `GetByName()` operation internally, but handles
the `nullptr` case correctly and throws an exception.
`this` could be deleted after `Notification::BeginExecuteNotification`
exited and before `Notification::ExecuteNotificationHelper` finished.
This is fixed by constructing a `Notification::Ptr` and operate on that
one as it is properly reference-counted.
Only two out of three cases were handled properly by the code: host
downtimes referencing a deleted host and service downtimes referencing a
deleted service worked fine. However, if a service downtime references a
deleted host, `Host::GetByName()` returns `nullptr` which isn't
accounted for. Use `Service::GetByNamePair()` instead as this performs a
check for the host being null internally.