PR #7445 incorrectly assumed that a peer that had already disconnected
and never reconnected was due to the endpoint client being dropped after
a successful socket shutdown. However, the issue at that time was that
there was not a single timeout guards that could cancel the `async_shutdown`
call, petentially blocking indefinetely. Although removing the client from
cache early might have allowed the endpoint to reconnect, it did not
resolve the underlying problem. Now that we have a proper cancellation
timeout, we can wait until the currently used socket is fully closed
before dropping the client from our cache. When our socket termination
works reliably, the `ApiListener` reconnect timer should attempt to
reconnect this endpoint after the next tick. Additionally, we now have
logs both for before and after socket termination, which may help
identify if it is hanging somewhere in between.
In fact, this is already done for the outer loop (for each bulk), just not yet for the inner one (for each message of a bulk). So once the remote signals EOF, don't try to process the remaining queue until write error (which can't be associated with a particular message anyway, due to buffering), but just let the peer go. Flush already half-written messages, though, if possible.
When the `Desconnect()` method is called, clients are not disconnected
immediately. Instead, a new coroutine is spawned using the same strand
as the other coroutines. This coroutine calls `async_shutdown` on the
TCP socket, which might be blocking. However, in order not to block
indefintely, the `Timeout` class cancels all operations on the socket
after `10` seconds. Though, the timeout does not trigger the handler
immediately; it creates spawns another coroutine using the same strand
as in the `JsonRpcConnection` class. This can cause unexpected delays if
e.g. `HandleIncomingMessages` gets resumed before the coroutine from the
timeout class. Apart from that, the coroutine for writing messages uses
the same condition, making the two symmetrical.
Especially ApiListener#ReplayLog() enqueued lots of messages into
JsonRpcConnection#{m_IoStrand,m_OutgoingMessagesQueue} (RAM) even if
the connection was shut(ting) down. Now #Disconnect() takes effect ASAP.
Although there is locking involved here, it shoudln't take too long for
the thread to actually acquire it, since there aren't that many threads
dealing with endpoint clients concurrently. It's just wasting pointless
time trying to obtain a CPU slot.
Anonymous connections are normally only used for requesting a
certificate and are closed after this request is received. However, the
request is only sent if the child has successfully verified the
certificate of its parent so that it is an authenticated connection from
its perspective. In case this verification fails, both ends view it as
an anonymous connection and never actually use it but attempt a
reconnect after 10 seconds leaking the connection. Therefore close it
after a timeout.
This is required to
- catch all exceptions and wrap them into Boost exceptions. They
are the only ones allowed with Boost.Coroutine.
- set a dedicated coroutine stack size for Windows.
refs #7431
Exceptions in Disconnect() might be thrown (this has been reworked
into error_code locally) which are swallowed inside the Destructor
for being dangerous. On the other hand, swallowing them may
corrupt the stack unwinding operation from the coroutine layer.
The best is to avoid Defer inside lib/remote and call Disconnect()
directly after breaking from other operations.
refs #7351
refs #7431
config::UpdateObject would create a new object, but this may
have been silently ignored with 'ignore_on_error' - downtimes, etc.
Since we cannot simply fetch the error from inside the config compiler,
we'd just check whether there's a config object created at this stage.
This happens synchronously, and once there is, log something.
The previous code always logged the creation, even if the downtime
was ignored, e.g. when the first master sent one for local host objects.
This commit also adds more details: identity, endpoint, zone to extract
the MessageOrigin details into log messages for better troubleshooting
and debugging.
refs #7198
This basically drops the "corked" implementation which just stalled the
TLS IO polling after some requests. If you need sort of rate limiting
for these events, use an external TLS proxy which terminates that in front
of Icinga.
fixes#6635