1. Don't set a custom handler for SIGCHLD (in the umbrella process)
as that handler doesn't actually handle SIGCHLD anymore
2. Don't reset the SIGCHLD handler (in the worker process)
as there's nothing to reset anymore due to the above change
3. Don't block SIGCHLD across fork(2) as its handler doesn't change anymore
due to the above changes
Before:
On SIGCHLD from the forked worker the umbrella process sets a failure flag.
StartUnixWorker() recognises that and does waitpid(), failure message, etc..
On OpenBSD we can't tell the signal source, so we always set the failure flag.
That's not how our IPC shall work, that breaks the IPC sooner or later.
After:
No SIGCHLD handling and no failure flag setting.
Instead StartUnixWorker()'s wait loop uses waitpid(x,y,WNOHANG)
to avoid false positives while watching the forked worker.
This ensures that `frame.Depth` is only decreased when preceding `frame.IncreaseStackDepth()` callee was successful.
This way, `frame.Depth` will have the same depth prior to and after evaluating a frame.
Before (time: vertical, stack: horizontal):
* Checkable::ExecuteCheck
* Checkable::UpdateNextCheck
* IcingaDB::NextCheckChangedHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* RandomCheckTask::ScriptFunc
* Checkable::ProcessCheckResult
* Checkable::UpdateNextCheck
* IcingaDB::NextCheckChangedHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::NewCheckResultHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::StateChangeHandler
* XADD icinga:runtime:state
* IcingaDB::ForwardHistoryEntries
* XADD icinga:history:stream:state
After:
* Checkable::ExecuteCheck
* Checkable::UpdateNextCheck
* RandomCheckTask::ScriptFunc
* Checkable::ProcessCheckResult
* Checkable::UpdateNextCheck
* IcingaDB::NewCheckResultHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::StateChangeHandler
* XADD icinga:runtime:state
* IcingaDB::ForwardHistoryEntries
* XADD icinga:history:stream:state
The first state + nextupdate (for overdue) update comes from next_check being
set to now + interval immediately before doing the actual check (not to trigger
it twice). This update is not only not important for the end user, but even
inappropriate. The end user SHALL see next_check being e.g. in -4s, not 5m, as
the check is running at the moment.
The second one is just redundant as IcingaDB::NewCheckResultHandler (the third
one) is called anyway and will update state + nextupdate as well.
Case:
1. icinga2 api setup
2. icinga2 daemon -C -x debug
Before: Second commands crashes at exit.
After: No crash.
As the comment between the removed lines clearly says:
Our destructors haven't been built for static data.
This is build type independent.
The point of logging to the Windows Event Log was to catch errors that happen
before the full logging configuration has been loaded and enabled. Messages
like the number of loaded objects per type just cause noise in the log and
provide little benefit. Therefore raise the required log level at this stage.
Note that this commit removes the (never documented) ability to use the -x flag
to change the level. But doing so would require patching the command line of
the service in the registry anyways.
If some kind of query is not supposed to be processed at the moment, there is
little point in checking it. During a full dump, state updates are suppressed
(i.e. delayed), so when a dump takes very long, this would have resulted in a
false Redis backlog warning.
The check makes no attempt to explicitly connect to Redis, it uses the
connection of the IcingaDB feature, so this message better describes the state
in this situation.
IcingaDB::GetConnection() uses IcingaDB::m_Rcon which is only initialized in
IcingaDB::Start(), therefore add a nullptr check to the check command.
Additionally, as m_Rcon is potentially accessed concurrently, add a copy of the
value that is safe for concurrent use.
icingadb-web shows multiple lines from the check output collapsed into a single
line. The lines containing just minuses make this look cluttered and making
making it a heading provides little to no benefit. Even when rendering markdown
in the check output at some point, having the lists labeled using normal
paragraphs would look just fine.
- Add icinga2_ and icingadb_ prefixes to make clear which component is
responsible for the value.
- Rename heartbeat_lag to heartbeat_age, describes it better in my opinion and
sound a bit less like something that should be as close to zero as possible.
- Rename redis_dump/database_sync into full_dump/full_sync as this is how these
operations are refered to in log messages as well.
- Rename Redis backlog into Redis query backlog, makes it a bit clearer in my
opinion.
- Rename runtime_backlog into runtime_update_backlog, as the component in
Icinga DB is called that way and this naming is also exposed in log messages.
- Rename dump_config/state/history into config/state/history_dump, makes it
sound more natural.
They add no additional information compared to the *_1min values as it's always
the same value divided by 60 anyways. Adding the actual value from the last
second makes little sense for realistic values of check_interval.
Use the already existing format to pass performance data to Icinga 2 rather
than some new JSON structure. Has the additional benefit of doing more things
in Go than in C++.
IcingaDB may receive callbacks from Boost signals before being fully started.
This resulted in situations where m_EnvironmentId was used before it was
initialized properly. This is fixed by initializing it earlier (during the
config validation stage). However, at this stage, it should not yet write to
disk, therefore, persisting the environment ID to disk is delayed until later
in the startup process.
Initializing at this stage has an extra benefit: if there is an error for some
reason (possibly corrupt icingadb.env file), this now shows up as a nice error
during config validation.
Additionally, this replaces the use of std::call_once with std::mutex due to
bug in libstdc++ (see inline comment for reference).
Icinga 2 treats null (Empty) as if the corresponding attribute is not
specified. However, without this commit, it would serialize the value as "null"
(i.e. type string), so that it ends up in the database as this string instead
of NULL. This commit adds handling for ValueEmpty so that is serialized as JSON
null value and ends up in the database as NULL.
Apparently there was a reason for making the members of generated classes
atomic. However, this was only done for some types, others were still accessed
using non-atomic operations. For members of type T::Ptr (i.e. intrusive_ptr<T>),
this can result in a double free when multiple threads access the same variable
and at least one of them writes to the variable.
This commit makes use of std::atomic<T> for more T (it removes the additional
constraint sizeof(T) <= sizeof(void*)) and uses a type including a mutex for
load and store operations as a fallback.
The very same object is already serialized a few lines above, the result is
even stored in a variable, but that variable was not used before. Simply using
this variable results in a noticeable improvement of config validation times.
Checkable::FireSuppressedNotifications() compares the time of the current
checkable with the last recovery time of parents to avoid notification right
after a parent recovered and before the current checkable was checked.
This commit makes this check also include to host if the checkable is a
service. This makes the behavior consistent with the documentation that states
there is an implicit dependency on the host (which isn't realized as implicitly
generating a Dependency object unfortunately).
as they would be re-connected in Resume() (HA).
Before they were still connected during pause and connected X+1 times
after X split-brains (the same data was written X+1 times).
To prevent Icinga2 from being restarted while
one or more requests are still in progress and end up
as corrupted stages without status file and startup logs.
If comments get removed in unintended ways (i.e. not by expiring or by using
the remove-comment API action), the comment object misses information to create
a proper history event for Icinga DB. Therefor, discard these events.
On ack Icinga first adds a comment, then acks the checkable
so the ack event has the comment ID.
But due to the yet missing ack the comment was missing is_sticky.
That's corrected now.
Icinga DB doesn't expect comment history for ack comments.
Before:
1. Acked checkable recovers
2. Icinga clears ack comments w/o setting removal time
3. Icinga DB gets neither removal time, nor expire time
4. Icinga DB falls back to NULL and violates NOT NULL constraint
icinga::Array requires locking by the caller when iterating using Begin() and
End(). This is only checked in debug builds but there it makes this function
fail.
This commit changes the Checkable notification suppression logic (notifications
are currently suppressed on the Checkable if it is unreachable, in a downtime,
or acknowledged) to that after the suppression reason ends, a state
notification is sent if and only if the first hard state after is different
from the last hard state from before. If the checkable is in a soft state after
the suppression ends, the notification is further suppressed until a hard state
is reached.
To achieve this behavior, a new attribute state_before_suppression is added to
Checkable. This attribute is set to the last hard state the first time either a
PROBLEM or a RECOVERY notification is suppressed. Compared to from before,
neither of these two flags in the suppressed_notification will ever be cleared
while the supression is still ongoing but only after the suppression ended and
the current state is compared with the old state stored in
state_before_suppression.
Without this commit, children and parents of a checkable were rescheduled on a
state change while holding the lock for the current checkable. If both ends of
a dependency are checked at the same time and both change state, they could end
up in a deadlock waiting for each other.
This commit fixes this problem by changing the code so that other checkables
are rescheduled only after releasing the lock for the current checkable.
When there are multiple active IDO instances on the same node, before this
commit, all of them would share a single DbValue object for the notification_id
column of the icinga_contactnotifications table. This resulted in the issue
that one database references the notification_id in another database.
This commit fixes this by using a separate DbValue value for each IDO instance.
This needs a new signal as the existing OnQuery and OnMultipleQueries signals
perform the same queries on all IDO instances, but different queries are needed
here per instance (they only differ in the referenced DbValue). Therefore, a
new signal OnMakeQueries is added that takes a std::function which is called
once per IDO instance and can access callbacks to perform one or multiple
queries only on this specific IDO instance.
to ensure Checkable#IsReachable() returns correctly for dependency children inside OnReachabilityChanged().
That needs the dependency parent to be already in the correct state.
refs #9143
StateChangeHandler() is the function used when the actual hard/soft state
changes and thus also writes state history. This is not desired in this case,
instead, a runtime update should be generated, therefore call UpdateState()
instead.
refs #9063
StateChangeHandler() is the function used when the actual hard/soft state
changes and thus also writes state history. This is not desired in this case,
instead, a runtime update should be generated, therefore call UpdateState()
instead.
refs #9063
StateChangeHandler() is the function used when the actual hard/soft state
changes and thus also writes state history. This is not desired in this case,
instead, a runtime update should be generated, therefore call UpdateState()
instead.
refs #9063
Previously, both funktions did related operations but had unclear and confusing
naming:
- UpdateState updated the icinga:{host,service}:state Redis keys.
- SendStatusUpdate sent a runtime update for the icinga:{host,service}:state.
This commit merges both functions into one with a new mode parameter. The
following modes are now supported:
- Volatile: Update the icinga:{host,service}:state Redis key.
- Full: Perform the volatile state update and in addition send a corresponding
runtime update so that this state update gets written through to the
persistent database by a running icingadb process.
- RuntimeOnly: Special mode for callers that can ensure that a volatile update
for the current state was already performed but has to be upgraded to a full
update.
refs #9063
When a comment or downtime is removed manually, the name of the requestor and
timestamp have to be synced to other nodes in the cluster to allow all of them
to generate a consistent Icinga DB history stream.
refs #9101
i.e. the confusion of the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore (the now ignored) null (not `{"type":32}`) as there's no type "32".
refs #8186
The config parser requires *Command#arguments#order to be a Number, i.e. 42,
4.2 or even "4.2". That's int-casted where needed, now also for Icinga DB.
Before:
```
object CheckCommand "9117" {
command = [ "true" ]
arguments = {
"4.2" = { order = "4.2" }
}
}
```
2022-01-03T13:25:07.166+0100 FATAL icingadb json: cannot unmarshal string into Go value of type int64
m_DataBuffer may be modified concurrently while StatsFunc() is called, thus
it's unsafe to call size() on it. As write access to m_DataBuffer is already
synchronized by only modifying it from the single work queue thread, instead of
adding a mutex, this commit adds a new std::atomic_size_t which is additionally
updated when modifying m_DataBuffer and can safely be accessed in StatsFunc().
There is no explicit synchronization of access to m_DataBuffer which is fine if
it is only accessed from the single-threaded work queue. However, Stop() also
called Flush() in another thread, leading to concurrent write access to
m_DataBuffer which can result in a crash due to use after free/double free.
Changes in this commit:
* Flush() is renamed to FlushWQ() to show that it should only be called from
the work queue. Additionally, it now asserts that it is running on the work
queue.
* Visibility of some data members is changed from protected to private. No
other classes have to access these at the moment. By this change, accidental
concurrent access from derived classes in the future is prevented.
* Stop() now flushes by posting FlushWQ() to the work queue and joining it.
In the 2.12.6 release, the full schema file sets the version to 1.14.3, whereas
the latest available upgrade file 2.11.0.sql sets it to 1.15.0. Therefore, ship
a new upgrade file 2.12.7.sql for all users who imported their schema with
version 2.11.0 or later and never performed an upgrade since then. Their
databases incorrectly state schema version 1.14.3 and is bumped to the correct
version 1.15.0 by the upgrade.
In the 2.13.2 release, the full schema file sets the version to 1.15.0, whereas
the latest available upgrade file 2.13.0.sql sets it to 1.15.1. Therefore,
rename the incorrectly named upgrade file 2.13.1.sql (it was not shipped in
this or any other release so far) to 2.13.3.sql for users who imported their
schema with version 2.13.0 or later and never performed an upgrade since then.
Their databases incorrectly state schema version 1.15.0 and are bumped to the
correct version 1.15.1 by the upgrade.
The full schema is not touched by this commit as for the current branch, this
was already fixed by 815533b334.
When creating a fixed downtime that starts immediately while the checkable is
in a non-OK state, previously the code path for flexible downtimes was used to
trigger this downtime. This is fixed by this commit which resolves two issued:
1. Missing downtime start notification: notifications work differently for
fixed and flexible downtimes. This resulted in missing downtime start
notifications under the conditions described above.
2. Incorrect downtime trigger time: this code path would incorrectly assume the
timestamp of the last checkable as the trigger time which is incorrect for
fixed downtimes.
not to confuse the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore null (not `{"type":32}`) as there's no type "32".
refs #8186
When triggering a downtime, the time of the causing event is now passed on as
the trigger time. That time is:
* For fixed downtimes: the later one of start and entry time.
* If a check result triggers the downtime: The execution end of the check
result.
* If another downtime triggers the downtime: The trigger time of the first
downtime.
This is done so two nodes in a HA setup can write consistent Icinga DB downtime
history streams.
refs #9101
`ConfigType::GetObjectsByType<Downtime>()` already returns a
`std::vector<Downtime::Ptr>` so there is no point in copying it into another
vector of the same type just to then iterate the copied vector instead of the
original one.
This commit fixes the following build error:
[ 55%] Building CXX object lib/icinga/CMakeFiles/icinga.dir/usergroup.cpp.o
lib/icinga/usergroup.cpp:79:24: error: incomplete type ‘icinga::Notification’ used in nested name specifier
79 | std::set<Notification::Ptr> UserGroup::GetNotifications() const
| ^~~
This change fixes two problems:
* The internal functions used by ScriptFunc more or less expect to operate on
full days, but ScriptFunc may have called them with some random timestamp
during the day. This is fixed by always using midnight of the day as
reference time.
* Previously, the code advanced a timestamp to the next day by adding 24 hours.
On days with DST changes, this could either still be on the same day (a day
may have 25 hours) or skip an entire day (a day may have 23 hours). This is
fixed by using a struct tm to advance the time to the next day.
Having the command type be a part of the command ID isn't needed anywhere. Removing this simplifies the way we generate IDs in general, because we don't need Prepend() anymore.
The command type was only needed to prevent ID collisions within the command_envvar and command_argument tables. Those tables have since been separated into {check,event,notification}command_envvar and {check,event,notification}command_argument tables.
PR #9036 introduces some incompatible changes to the Redis schema, most
importantly where Icinga DB has to read the environment from: now it has to use
a new top-level key of the icinga:stats message instead of a value in the
IcingaApplication part of that message.
In order to avoid changes to the environment ID, it is now no longer derived
from the Environment constant but instead from the public key of the CA
certificate. This ensures that it is different between clusters by default, so
no additional changes have to be done to allow two clusters to use Icinga DB to
write into the same database.
To prevent the ID from changing when the CA certificate is replaced, it is also
persisted into the file /var/lib/icinga2/icingadb.env, so if that file exists,
it takes precedence over the CA certificate.
... i.e. UUID -> SHA1(env, eventType, x...) given that SHA1(env, x...) = type-specific ID.
Rationale: allow both masters to write the same history concurrently (while not
in split-brain), so that REPLACE INTO deduplicates the same events written twice.
* ack: SHA1(env, "ack_set"|"ack_clear", checkable.name, setTime)
* comment: SHA1(env, "comment_add"|"comment_remove", comment.name)
* downtime: SHA1(env, "downtime_start"|"downtime_end", downtime.name)
* flapping: SHA1(env, "flapping_start"|"flapping_end", checkable.name, startTime)
* notification: SHA1(env, "notification", notification.name, notificationType, sendTime)
* state: SHA1(env, "state_change", checkable.name, changeTime)
... i.e. UUID -> SHA1(x..., send time) given that SHA1(x...) = notification id.
Rationale: allow both masters to write the same notification history concurrently (while
not in split-brain), so that REPLACE INTO deduplicates the same events written twice.
... i.e. UUID -> SHA1(x..., check time) given that SHA1(x...) = checkable id.
Rationale: allow both masters to write the same state history concurrently (while
not in split-brain), so that REPLACE INTO deduplicates the same events written twice.
add_definitions would set SD_JOURNAL_SUPPRESS_LOCATION for all targets
in directory and sub-directories. However, another future target might
want the opposite, so define it as local as possible to journaldlogger.cpp.
To make this work, we must take journaldlogger.cpp out of the unity
build, because all files from a unity of share compiler definitions.
As proposed in #8857, this adds a Logger subclass that writes structured
log messages via journald's native protocol by calling sd_journal_sendv.
The feature therefore depends on the systemd library. sd_journal_sendv is
available since the early days (systemd v38), so a version check is
probably superflous.
We add the following fields to each record:
- MESSAGE: The log message
- PRIORITY (aka severity): Numeric severity as in RFC5424 section 6.2.1
- SYSLOG_FACILITY: Numeric facility as in RFC5424 section 6.2.1
- SYSLOG_IDENTIFIER: If provided, use value from configuration.
Else use systemd's default behaior, which is to determine the field
by using libc's program_invocation_short_name, resulting in "icinga2".
- ICINGA2_FACILITY: Facility as in Log::Log(..., String facility, ...),
e.g. "ApiListener"
- some more fields are added automatically by systemd
Fields are stored indexed, so we can do fast queries for certain field
values. Example:
$ journalctl -t icinga2 ICINGA2_FACILITY=ApiListener -n 5
Syslog compatiblity is ratained because good old tag, severity and facility
is stored along, and systemd can forward to syslog daemons.
See also https://systemd.io/JOURNAL_NATIVE_PROTOCOL/.
The upcoming JournaldLogger will need the same syslog validation and
conversion logic, so factor it out from SyslogLogger to make it
reusable.
Also explicitely include syslog.h, which defines the syslog()
function.
The loop iterated over the services of the wrong host resulting in duplicate
downtimes scheduled for services of the parent host instead of downtimes for
services of the child host.
IdoPgsqlConnection::Escape() internally uses PQescapeStringConn() and its
documentation states the following:
Furthermore, PQescapeStringConn does not generate the single quotes that must
surround PostgreSQL string literals; they should be provided in the SQL
command that the result is inserted into.
So it's intended to use the result in 'string' literals, not in E'string'
literals as Icinga did. This results in problems as the behavior of
PQescapeStringConn() depends on how the current connection will interpret
regular single quoted literals, namely on the value of the
standard_conforming_strings variable.
The E'string' literals were initially introduced in
ac6f3f8acf to fix#1206 where PostgreSQL started
warning about escape sequences in string literals not supported by the SQL
standard (but by PostgreSQL depending on the value of
standard_conforming_strings). In the meantime the oldest PostgreSQL version on
any platform supported by Icinga increased to 9.2 (CentOS 7) and starting with
9.1, standard_conforming_strings is enabled by default, so there will be no
warnings about escape sequences (as the warning is only issued if the escape
sequence is actually interpreted by PostgreSQL).
Icinga started the initial config sync right after the first Redis connection
was established. If any other connections would take longer to connect than
when it's first needed, queries were discarded.
The loop in the connected callback of the parent connection uses m_Rcons which
previously was only initialized after that connection was already started.
This allows the callback to call RedisConnection::SetConnectedCallback() to set
another callback for this connection. This sets m_ConnectedCallback and thereby
destroys the std::function while it's running resulting in undefined behavior.
By operating on a copy, m_ConnectedCallback can be set without affecting the
currently running callback.
As Icinga first sends a SIGTERM to a check plugin on timeout to allow it to
terminate gracefully, this is not really part of the plugin API specification
and we cannot assume that plugins will handle this correctly and still exit
with an exit code that maps to UNKNOWN. Therefore, once Icinga decides to kill
a process, force its exit code to 128 to be sure the state will be UNKNOWN
after a timeout.