The point of logging to the Windows Event Log was to catch errors that happen
before the full logging configuration has been loaded and enabled. Messages
like the number of loaded objects per type just cause noise in the log and
provide little benefit. Therefore raise the required log level at this stage.
Note that this commit removes the (never documented) ability to use the -x flag
to change the level. But doing so would require patching the command line of
the service in the registry anyways.
If some kind of query is not supposed to be processed at the moment, there is
little point in checking it. During a full dump, state updates are suppressed
(i.e. delayed), so when a dump takes very long, this would have resulted in a
false Redis backlog warning.
The check makes no attempt to explicitly connect to Redis, it uses the
connection of the IcingaDB feature, so this message better describes the state
in this situation.
IcingaDB::GetConnection() uses IcingaDB::m_Rcon which is only initialized in
IcingaDB::Start(), therefore add a nullptr check to the check command.
Additionally, as m_Rcon is potentially accessed concurrently, add a copy of the
value that is safe for concurrent use.
icingadb-web shows multiple lines from the check output collapsed into a single
line. The lines containing just minuses make this look cluttered and making
making it a heading provides little to no benefit. Even when rendering markdown
in the check output at some point, having the lists labeled using normal
paragraphs would look just fine.
- Add icinga2_ and icingadb_ prefixes to make clear which component is
responsible for the value.
- Rename heartbeat_lag to heartbeat_age, describes it better in my opinion and
sound a bit less like something that should be as close to zero as possible.
- Rename redis_dump/database_sync into full_dump/full_sync as this is how these
operations are refered to in log messages as well.
- Rename Redis backlog into Redis query backlog, makes it a bit clearer in my
opinion.
- Rename runtime_backlog into runtime_update_backlog, as the component in
Icinga DB is called that way and this naming is also exposed in log messages.
- Rename dump_config/state/history into config/state/history_dump, makes it
sound more natural.
They add no additional information compared to the *_1min values as it's always
the same value divided by 60 anyways. Adding the actual value from the last
second makes little sense for realistic values of check_interval.
Use the already existing format to pass performance data to Icinga 2 rather
than some new JSON structure. Has the additional benefit of doing more things
in Go than in C++.
IcingaDB may receive callbacks from Boost signals before being fully started.
This resulted in situations where m_EnvironmentId was used before it was
initialized properly. This is fixed by initializing it earlier (during the
config validation stage). However, at this stage, it should not yet write to
disk, therefore, persisting the environment ID to disk is delayed until later
in the startup process.
Initializing at this stage has an extra benefit: if there is an error for some
reason (possibly corrupt icingadb.env file), this now shows up as a nice error
during config validation.
Additionally, this replaces the use of std::call_once with std::mutex due to
bug in libstdc++ (see inline comment for reference).
Icinga 2 treats null (Empty) as if the corresponding attribute is not
specified. However, without this commit, it would serialize the value as "null"
(i.e. type string), so that it ends up in the database as this string instead
of NULL. This commit adds handling for ValueEmpty so that is serialized as JSON
null value and ends up in the database as NULL.
Apparently there was a reason for making the members of generated classes
atomic. However, this was only done for some types, others were still accessed
using non-atomic operations. For members of type T::Ptr (i.e. intrusive_ptr<T>),
this can result in a double free when multiple threads access the same variable
and at least one of them writes to the variable.
This commit makes use of std::atomic<T> for more T (it removes the additional
constraint sizeof(T) <= sizeof(void*)) and uses a type including a mutex for
load and store operations as a fallback.
The very same object is already serialized a few lines above, the result is
even stored in a variable, but that variable was not used before. Simply using
this variable results in a noticeable improvement of config validation times.
Checkable::FireSuppressedNotifications() compares the time of the current
checkable with the last recovery time of parents to avoid notification right
after a parent recovered and before the current checkable was checked.
This commit makes this check also include to host if the checkable is a
service. This makes the behavior consistent with the documentation that states
there is an implicit dependency on the host (which isn't realized as implicitly
generating a Dependency object unfortunately).
as they would be re-connected in Resume() (HA).
Before they were still connected during pause and connected X+1 times
after X split-brains (the same data was written X+1 times).
To prevent Icinga2 from being restarted while
one or more requests are still in progress and end up
as corrupted stages without status file and startup logs.
If comments get removed in unintended ways (i.e. not by expiring or by using
the remove-comment API action), the comment object misses information to create
a proper history event for Icinga DB. Therefor, discard these events.
On ack Icinga first adds a comment, then acks the checkable
so the ack event has the comment ID.
But due to the yet missing ack the comment was missing is_sticky.
That's corrected now.
Icinga DB doesn't expect comment history for ack comments.
Before:
1. Acked checkable recovers
2. Icinga clears ack comments w/o setting removal time
3. Icinga DB gets neither removal time, nor expire time
4. Icinga DB falls back to NULL and violates NOT NULL constraint
icinga::Array requires locking by the caller when iterating using Begin() and
End(). This is only checked in debug builds but there it makes this function
fail.
This commit changes the Checkable notification suppression logic (notifications
are currently suppressed on the Checkable if it is unreachable, in a downtime,
or acknowledged) to that after the suppression reason ends, a state
notification is sent if and only if the first hard state after is different
from the last hard state from before. If the checkable is in a soft state after
the suppression ends, the notification is further suppressed until a hard state
is reached.
To achieve this behavior, a new attribute state_before_suppression is added to
Checkable. This attribute is set to the last hard state the first time either a
PROBLEM or a RECOVERY notification is suppressed. Compared to from before,
neither of these two flags in the suppressed_notification will ever be cleared
while the supression is still ongoing but only after the suppression ended and
the current state is compared with the old state stored in
state_before_suppression.