6375 Commits

Author SHA1 Message Date
Alexander Aleksandrovič Klimov
d66d7672a4 JsonRpcConnection: don't write new messages on shutdown
In fact, this is already done for the outer loop (for each bulk), just not yet for the inner one (for each message of a bulk). So once the remote signals EOF, don't try to process the remaining queue until write error (which can't be associated with a particular message anyway, due to buffering), but just let the peer go. Flush already half-written messages, though, if possible.
2024-11-29 08:18:49 +01:00
Yonas Habteab
e0e87a13fe Log before & after an RPC client is disconnected 2024-11-25 11:43:05 +01:00
Yonas Habteab
b16386ff6a IcingaCheckTask: Add available I/O cpu semaphores as a perfdata 2024-11-20 15:06:54 +01:00
Julian Brost
081ae6df8a Add comment for remaining uses of async_shutdown() why it's safe
The reason for introducing AsioTlsStream::GracefulDisconnect() was to handle
the TLS shutdown properly with a timeout since it involves a timeout. However,
the implementation of this timeout involves spwaning coroutines which are
redundant in some cases. This commit adds comments to the remaining calls of
async_shutdown() stating why calling it is safe in these places.
2024-11-19 18:10:26 +01:00
Julian Brost
5b057fb971 HttpServerConnection: use AsioTlsStream::GracefulDisconnect()
This new helper function has proper timeout handling which was missing here.
2024-11-19 18:10:20 +01:00
Julian Brost
d0a2b3da32 JsonRpcConnection: use AsioTlsStream::GracefulDisconnect()
This new helper functions allows deduplicating the timeout handling for
`async_shutdown()`.
2024-11-19 18:07:34 +01:00
Julian Brost
a7aeefee0e AsioTlsStream: add GracefulDisconnect() and ForceDisconnect()
Calling `AsioTlsStream::async_shutdown()` performs a TLS shutdown which
exchanges messages (that's why it takes a `yield_context`) and thus has the
potential to block the coroutine. Therefore, it should be protected with a
timeout. As `async_shutdown()` doesn't simply take a timeout, this has to be
implemented using a timer. So far, these timers are scattered throughout the
codebase with some places missing them entirely. This commit adds helper
functions to properly shutdown a TLS connection with a single function call.
2024-11-19 18:07:34 +01:00
Julian Brost
41433551bc AsioTlsStream: inherit from SharedObject
All usages of `AsioTlsStream` were already using `Shared<AsioTlsStream>` to
keep a reference-counted instance. This commit moves the reference counting to
`AsioTlsStream` itself by inheriting from `SharedObject`. This will allow to
implement methods making use of the fact that these objects are
reference-counted.

The changes outside of `lib/base/tlsstream.hpp` are merely replacing
`Shared<AsioTlsStream>::Ptr` with `AsioTlsStream::Ptr` everywhere.
2024-11-19 18:07:34 +01:00
Yonas Habteab
f8e7a4266b JsonRpcConnection: Log message processing time stats
Co-Authored-By: Julian Brost <julian.brost@icinga.com>
2024-11-19 18:06:48 +01:00
Yonas Habteab
7fe7e8a1ab HttpServerConnection: Log noticable CPU semaphore wait time 2024-11-19 18:06:48 +01:00
Yonas Habteab
d768c90937 HttpServerConnection: Don't spawn useless coroutines
Currently, for each `Disconnect()` call, we spawn a coroutine, but every
one of them is just usesless, except the first one. However, since all
`Disconnect()` usages share the same asio strand and cannot interfere
with each other, spawning another coroutine within `Disconnect()` isn't
even necessary. When a coroutine calls `Disconnect()` now, it will
immediately initiate an async shutdown of the socket, potentially causing
the coroutine to yield and allowing the others to resume. Therefore, the
`m_ShuttingDown` flag is still required by the coroutines to be checked
regularly.
2024-11-19 16:08:37 +01:00
Yonas Habteab
6f9ae05948
Merge pull request #10239 from Icinga/state-before-suppression214
Fix lost recovery notifications after recovery outside of notification time period
2024-11-14 13:49:15 +01:00
Yonas Habteab
eb32283751
Merge pull request #10237 from Icinga/log-connected-endpoint-connection-attempts-214
ApiListener: Log connection attempts from an already connected client
2024-11-14 12:57:41 +01:00
Julian Brost
e313b3f2aa Use Checkable::GetStateBeforeSuppression() only where relevant
This fixes an issue where recovery notifications get lost if they happen
outside of a notification time period.

Not all calls to `Checkable::NotificationReasonApplies()` need
`GetStateBeforeSuppression()` to be checked. In fact, for one caller,
`FireSuppressedNotifications()` in
`lib/notification/notificationcomponent.cpp`, the state before suppression may
not even be initialized properly, so that the default value of OK is used which
can lead to incorrect return values. Note the difference between suppressions
happening on the level of the `Checkable` object level and the `Notification`
object level. Only the first sets the state before suppression in the
`Checkable` object, but so far, also the latter used that value incorrectly.

This commit moves the check of `GetStateBeforeSuppression()` from
`Checkable::NotificationReasonApplies()` to the one place where it's actually
relevant: `Checkable::FireSuppressedNotifications()`. This made the existing
call to `NotificationReasonApplies()` unneccessary as it would always return
true: the `type` argument is computed based on the current check result, so
there's no need to check it against the current check result.
2024-11-14 12:07:02 +01:00
Yonas Habteab
8acfb9b214 ApiListener: Log connection attempts from an already connected client
Something is definitely going wrong if a client tries to reconnect to
this endpoint while it still has an active connection to that client. So
we shouldn't hide this, but at least log it at info level. Apart from
that, I've added some additional information about the currently active
client, such as when the last message was sent and received.
2024-11-14 11:09:00 +01:00
Yonas Habteab
d5051c7ea3 ApiListener: Log error context only once
When logging at the warning level, the logger will automatically look up
for registered context and append them to the log entry accordingly.
2024-11-14 11:05:53 +01:00
Yonas Habteab
d5cd5aff2c
Merge pull request #10080 from Icinga/net-stack-2.14.3
Fix network stack stability issues
2024-11-14 11:02:36 +01:00
Yonas Habteab
4850018464 Don't use thread-local variable in coroutine & process final cr in global thread pool 2024-11-13 15:36:57 +01:00
Yonas Habteab
2854c618dd HttpServerConnection: Drop yet another superfluous CpuBoundWork usage 2024-11-13 15:36:57 +01:00
Yonas Habteab
660b82b4f9 JsonRpcConnection: Don't read any data on shutdown
When the `Desconnect()` method is called, clients are not disconnected
immediately. Instead, a new coroutine is spawned using the same strand
as the other coroutines. This coroutine calls `async_shutdown` on the
TCP socket, which might be blocking. However, in order not to block
indefintely, the `Timeout` class cancels all operations on the socket
after `10` seconds. Though, the timeout does not trigger the handler
immediately; it creates spawns another coroutine using the same strand
as in the `JsonRpcConnection` class. This can cause unexpected delays if
e.g. `HandleIncomingMessages` gets resumed before the coroutine from the
timeout class. Apart from that, the coroutine for writing messages uses
the same condition, making the two symmetrical.
2024-11-13 15:35:57 +01:00
Julian Brost
58197dbbaa
Merge commit from fork
Icinga 2.14.3
2024-11-12 15:01:58 +01:00
Julian Brost
2febc5e18a Security: fix TLS certificate validation bypass
The previous validation in set_verify_callback() could be bypassed, tricking
Icinga 2 into treating invalid certificates as valid. To fix this, the
validation checks were moved into the IsVerifyOK() function.

This is tracked as CVE-2024-49369, more details will be published at a later time.
2024-10-22 10:41:00 +02:00
Yonas Habteab
3b57e4915f
Merge pull request #10163 from Icinga/next-check-cluster-sync-issue-2.14
Checkable: Don't recalculate `next_check` while processing remotely genrated check
2024-09-20 11:32:47 +02:00
Yonas Habteab
200f198a35 Checkable: Don't recalculate next_check while processing remotely genrated check
Currently, when processing a `CheckResult`, it will first trigger an
`OnNextCheckChanged` event, which is sent to all connected endpoints.
Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult`
event is fired, which is of course also sent to all connected endpoints.

Next, the other endpoints receive the `event::SetNextCheck` cluster
event followed by `event::CheckResult`and invoke
`checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly
received check. So they also try to recalculate the next check
themselves and invalidate the previously received next check timestamp
from the source endpoint. Since each endpoint randomly initialises its
own scheduling offset, the recalculated next check will always differ by
a split second/millisecond on each of them. As a consequence, two Icinga
DB HA instances will generate two different checksums for the same state
and causes the state histories to be fully resynchronised after a
takeover/Icinga 2 reload.
2024-09-19 13:27:50 +02:00
Yonas Habteab
eeaa25c284 Unregister invalid config objects properly 2024-09-19 13:24:23 +02:00
Yonas Habteab
bdc089ba95 IcingaDB: Don't sync partially initialised objects 2024-09-18 15:52:13 +02:00
Yonas Habteab
1d0a984d33
Merge pull request #10124 from Icinga/do-not-fail-removing-obsolete-downtimes-2.14
Don't fail to remove obsolete downtimes, remove RemoveAllDowntimes()
2024-09-17 17:17:42 +02:00
Yonas Habteab
2d970bcd3b
Merge pull request #10123 from Icinga/AddDowntime-trigger_name-2.14
Downtime::AddDowntime(): NULL-check pointer before deref not to crash
2024-09-17 16:09:16 +02:00
Yonas Habteab
221041487c
Merge pull request #10122 from Icinga/broken-timeperiod-2.14
Fix broken `TimePeriod/ScheduledDowntime`s
2024-09-17 15:56:49 +02:00
Yonas Habteab
235e4d4824
Merge pull request #10121 from Icinga/broken-runtime-config-sync-2.14
Fix broken runtime config sync
2024-09-17 15:20:56 +02:00
Yonas Habteab
06b01cb574
Merge pull request #10125 from Icinga/output-exit-code-2.14
Mention plugin exit codes outside [0..3] in the plugin output and warning log
2024-09-17 15:18:27 +02:00
Yonas Habteab
7a04966e78
Merge pull request #10131 from Icinga/heavy-update-missing-table-relations-2.14
Update object#config_hash after all relations queries
2024-09-17 15:14:28 +02:00
Yonas Habteab
cad41fa76a
Merge pull request #10129 from Icinga/malloc_info-2.14
/v1/debug/malloc_info: call malloc_info(3) if available
2024-09-17 15:13:47 +02:00
Yonas Habteab
ad7495dc0e Don't fail to remove obsolete downtimes 2024-09-17 15:06:45 +02:00
Yonas Habteab
6347c9089d Don't loose args in recursive Downtime::RemoveDowntime() call 2024-09-17 15:06:45 +02:00
Yonas Habteab
088f4e0f48 Introduce & use enum DowntimeRemovalReason 2024-09-17 15:06:45 +02:00
Alexander A. Klimov
d7a590e3ef Remove unused Checkable#RemoveAllDowntimes() 2024-09-17 15:06:45 +02:00
Yonas Habteab
b70f4da208 Don't allow to modify/create/delete an object concurrently 2024-09-17 12:33:35 +02:00
Yonas Habteab
395a1398f6 ConfigObjectUtility#CreateObject(): Use Defer for config path cleanup 2024-09-17 12:33:35 +02:00
Yonas Habteab
42891028ca ApiListener: Process cluster config updates sequentially 2024-09-17 12:33:35 +02:00
Yonas Habteab
3d5e0fef69 Introduce RAII style ObjectNameLock class 2024-09-17 12:33:35 +02:00
Yonas Habteab
cf11fe0177 ConfigObjectUtility: Use AtomicFile to store object config files 2024-09-17 12:33:35 +02:00
Yonas Habteab
574cbb4b4e Check segemnt start date inclusively in TimePeriod::IsInside() 2024-09-17 12:33:27 +02:00
Yonas Habteab
ed16377349 Fix broken timeperiods/scheduleddowntimes 2024-09-17 12:33:27 +02:00
Alexander A. Klimov
b3c914242c l_LegacyDowntimesCache: delete removed objects not to leak memory 2024-09-17 12:33:16 +02:00
Alexander A. Klimov
58a10ad312 /v1/actions/schedule-downtime: reject request on invalid trigger_name
For this purpose lookup the specified Downtime. Also pass Downtime objects,
not just names, to Downtime::AddDowntime() not to lookup it twice.
2024-09-17 12:33:16 +02:00
Alexander A. Klimov
a3dd32e6e5 [Refactor] Downtime::GetDowntimeIDFromLegacyID(): return the Downtime itself
not just its name.
2024-09-17 12:33:16 +02:00
Alexander A. Klimov
9cf585b625 [Refactor] l_LegacyDowntimesCache: store Downtime objects, not just their names
to avoid names of vanished objects.
2024-09-17 12:33:16 +02:00
Alexander A. Klimov
1446139125 PluginCheckTask::ProcessFinishedHandler(): warn about exit codes outside 0..3
in the plugin output as well, in addition to the warning log.
2024-09-17 12:33:01 +02:00
Alexander A. Klimov
0605577d4d Make ProcessResult#ExitStatus and CheckResult#exit_status 64-bit ints
so that they can hold Windows exit codes like 3221225477 (>2147483647).
2024-09-17 12:33:01 +02:00