Once the new worker process has read the config, it also includes a
`include */include.conf` statement within the config packages root
directory, and from there on we must not allow to delete any stage
directory from the config package. Otherwise, when the worker actually
evaluates that include statement, it will fail to find the directory
where the include file is located, or the `active.conf` file, which is
included from each stage's `include.conf` file, thus causing the worker
fail.
Co-Authored-By: Johannes Schmidt <johannes.schmidt@icinga.com>
Previously, we used a simple boolean to track the state of the package updates,
and didn't reset it back when the config validation was successful because it was
assumed that if we successfully validated the config beforehand, then the worker
would also successfully reload the config afterwards, and that the old worker would
be terminated. However, this assumption is not always true due to a number of reasons
that I can't even think of right now, but the most obvious one is that after we successfully
validated the config, the config might have changed again before the worker was able
to reload it. If that happens, then the new worker might fail to successfully validate
the config due to the recent changes, in which case the old worker would remain active,
and this flag would still be set to true, causing any subsequent requests to fail with a
`423` until you manually restart the Icinga 2 service.
So, in order to prevent such a situation, we are additionally tracking the last time a reload
failed and allow to bypass the `m_RunningPackageUpdates` flag only if the last reload failed
time was changed since the previous request.
The wait group gets passed to HttpServerConnection, then down to the
HttpHandlers. For those handlers that modify the program state, the
wait group is locked so ApiListener will wait on Stop() for the
request to complete. If the request iterates over config objects,
a further check on the state of the wait group is added to abort early
and not delay program shutdown. In that case, 503 responses will be
sent to the client.
Additionally, in HttpServerConnection, no further requests than the
one already started will be allowed once the wait group is joining.
Pin child objects of hosts (HOST!...) to the same endpoint as the host.
This reduces cross-object action latency withing the same host.
If all endpoints know this algorithm, we can use it.
for (const T& needle : haystack) creates the illusion that haystack is a
container of T and we're just borrowing needle. In these cases that's not true.
This was mistakenly introduced with PR #7686 due to too many open
connections (#7680). This was wrong in the sense that closing the
connection is simply out of place here and should have been handled
differently. After we revised the RPC connection disconnect procedure
with `v2.14.4`, it becomes clear why it is wrong, because the connection
is closed abruptly before the corresponding response (`result`) has
even been written. Now if you remove the disconnect here, shouldn't the
issue #7680 occur again, you ask? The answer is no, because we now also
have a maximum timeout of `10s` for anonymous connections, after which
they are automatically closed. Thanks to the introduction of this
timeout by @julianbrost in #8479, this `Disconnect()` call has become
superfluous.
PR #7445 incorrectly assumed that a peer that had already disconnected
and never reconnected was due to the endpoint client being dropped after
a successful socket shutdown. However, the issue at that time was that
there was not a single timeout guards that could cancel the `async_shutdown`
call, petentially blocking indefinetely. Although removing the client from
cache early might have allowed the endpoint to reconnect, it did not
resolve the underlying problem. Now that we have a proper cancellation
timeout, we can wait until the currently used socket is fully closed
before dropping the client from our cache. When our socket termination
works reliably, the `ApiListener` reconnect timer should attempt to
reconnect this endpoint after the next tick. Additionally, we now have
logs both for before and after socket termination, which may help
identify if it is hanging somewhere in between.
It's not used. Also, the callback shall run completely at once. This ensures that it won't (continue to) run once another coroutine on the strand calls Timeout#Cancel().