It also removes the extra `SendNextUpdate()` call from the
`NewCheckResultHandler` handler in Icinga DB, since it's subscribed to
the `NextCheckChanged` event anyway and that event is always emitted
before the `NewCheckResult` event gets triggered. This call became
redundant.
This commit introduces a new kinda special `OnRescheduleCheck` signal
that is emitted whenever we want to inform the checker to reschedule the
checkable at a specific timestamp without actually changing the next
check time. Previously, we called `SetNextCheck` with some random
timestamp just to enforce the checker to either pick it up immediately
or at a specific time. Then at some point in time, subscribing to the
`OnNextCheckChanged` signal became effectively unusable for any other
purpose than to inform the checker about a new next check time. Thus,
it resulted in introducing a new signal that is solely responsible for
informing the Icigna DB and IDO about a new next check time in places
where calling `SetNextCheck` did make sense. This commit does quite the
opposite: it replaces all calls to `SetNextCheck` that were only used to
inform the checker about a new next check time wit `OnRescheduleCheck`
calls. Only places where we actually wanted to change the next check time
still call `SetNextCheck` and thus inform the checker and all other listeners
about the new next check time. And as a bonus point, we now got rid of the
two object locks for child and parent at the same time.
Since the scheduler accounts for already running checks, we only need to
update the `next_check` timestamp in `Checkable::ExecuteCheck()` only
where it actually makes sense to do so, and as for local checks this
doesn't make sense at all. There only two cases where we need to update
the next check beforehand:
1) The execute command event is sent to a connected remote endpoint, so
we need to set the next check to a time in the future until we actually
receive the check result back from the remote endpoint. However, it must
not be too far in the future to avoid that the check is not re-run for
too long in case the remote endpoint never responds.
2) The check is a remote check, but either the endpoint is currently syncing
replay logs or not connected at all, and we are within the magical 5min
cold startup window. In these cases, the check is effectively skipped, and
there will be no check result for it coming in, we manually update the next
check normally as if the check was executed.
In the other cases, either the check is executed locally, which means the
`m_RunningCheck` flag already prevents the scheduler from re-running the check,
or this is a remote check and the endpoint is not connected, but we are outside
the cold startup window, in which case we also don't do anything as we've already
called `Checkable::ProcessCheckResult()` with an appropriate error state, which
in turn will call `Checkable::UpdateNextCheck()`.
This commit changes the ordering of CheckableScheduleInfo in the
multi-index container to ensure that checkables with running checks are
pushed to the end of the ordering. This prevents them from being
prioritized for scheduling ahead of others, which could lead to
unnecessary CPU load due to repeated scheduling attempts. By using a
very large value for the index of checkables with running checks, they
are effectively deprioritized until their current check is completed and
they can be reinserted with their actual next check time.
It's annoying to have to wait 10 seconds for the `liveness_disconnect`
test to complete, so make the timeout configurable and set it to a way
lower value to test the functionality.
So far, calling Checkable::IsReachable() traversed all possible paths to it's
parents. In case a parent is reachable via multiple paths, all it's parents
were evaluated multiple times, result in a worst-case exponential complexity.
With this commit, the implementation keeps track of which checkables were
already visited and uses the already-computed reachability instead of repeating
the computation, ensuring a worst-case linear runtime within the graph size.
Checkable::IsReachable() and DependencyGroup::GetState() call each other
recursively. Moving them to a common helper class allows adding caching to them
in a later commit without having to pass a cache between the functions (through
a public interface) or resorting to thread_local variables.
This commit removes the existing m_IsNoOp bool and instead wraps the m_Buffer
std::ostringstream into std::optional. Functionally, this is pretty much the
same, with the exception that std::ostringstream is no longer constructed for
messages that will be discarded later.
There already is a template operator<< implemented, so far only for const
references though. Changing this to perfectly forward the argument to the
corresponding operator in the underlying std::ostringstring allows handling all
the cases there, removing the need for a separate overload for const char*.
Once the new worker process has read the config, it also includes a
`include */include.conf` statement within the config packages root
directory, and from there on we must not allow to delete any stage
directory from the config package. Otherwise, when the worker actually
evaluates that include statement, it will fail to find the directory
where the include file is located, or the `active.conf` file, which is
included from each stage's `include.conf` file, thus causing the worker
fail.
Co-Authored-By: Johannes Schmidt <johannes.schmidt@icinga.com>
Previously, we used a simple boolean to track the state of the package updates,
and didn't reset it back when the config validation was successful because it was
assumed that if we successfully validated the config beforehand, then the worker
would also successfully reload the config afterwards, and that the old worker would
be terminated. However, this assumption is not always true due to a number of reasons
that I can't even think of right now, but the most obvious one is that after we successfully
validated the config, the config might have changed again before the worker was able
to reload it. If that happens, then the new worker might fail to successfully validate
the config due to the recent changes, in which case the old worker would remain active,
and this flag would still be set to true, causing any subsequent requests to fail with a
`423` until you manually restart the Icinga 2 service.
So, in order to prevent such a situation, we are additionally tracking the last time a reload
failed and allow to bypass the `m_RunningPackageUpdates` flag only if the last reload failed
time was changed since the previous request.
This commit intruduces a small helper class that wraps any writer and
provides a flush operation that performs the corresponding action if the
writer is an AsyncJsonWriter and does nothing otherwise.
Replacing invalid UTF-8 characters beforehand by our selves doesn't make
any sense, the serializer can literally perform the same replacement ops
with the exact same Unicode replacement character (U+FFFD) on its own.
So, why not just use it directly? Instead of wasting memory on a temporary
`String` object to always UTF-8 validate every and each value, we just
use the serializer to directly to dump the replaced char (if any) into
the output writer. No memory waste, no fuss!
The Array, Dictionary, and Namespace types provide a Freeze() method that makes
them read-only. So far, there was the possibility to call some methods with
`overrideFrozen=true` which would then bypass the corresponding check and allow
modification of the data structures nonetheless.
With 24b57f0d3a222835178e88489eabd595755ed883, this possibility was already
removed from the Namespace type. However, for interface compatibility, it kept
the parameter and just ignores it, throwing an exception on any modification on
a frozen instance.
The only place using `overrideFrozen` was processing of the `-D`/`--define`
command line flag that allows setting additional variables in the DSL. At the
time it is evaluated, there are no user-created data structures yet that could
be frozen, so the only frozen objects that could be encountered are Namespaces
(Icinga doesn't freeze other types by itself) and for these, `overrideFrozen`
already has no effect.
Hence, there is no harm in removing `overrideFrozen` altogether. This
simplifies the code and also means that frozen objects are now indeed read-only
without exceptions, allowing further optimizations regarding locking in the
future.
The wait group gets passed to HttpServerConnection, then down to the
HttpHandlers. For those handlers that modify the program state, the
wait group is locked so ApiListener will wait on Stop() for the
request to complete. If the request iterates over config objects,
a further check on the state of the wait group is added to abort early
and not delay program shutdown. In that case, 503 responses will be
sent to the client.
Additionally, in HttpServerConnection, no further requests than the
one already started will be allowed once the wait group is joining.