`ConfigType::GetObjectsByType<Downtime>()` already returns a
`std::vector<Downtime::Ptr>` so there is no point in copying it into another
vector of the same type just to then iterate the copied vector instead of the
original one.
This commit fixes the following build error:
[ 55%] Building CXX object lib/icinga/CMakeFiles/icinga.dir/usergroup.cpp.o
lib/icinga/usergroup.cpp:79:24: error: incomplete type ‘icinga::Notification’ used in nested name specifier
79 | std::set<Notification::Ptr> UserGroup::GetNotifications() const
| ^~~
This change fixes two problems:
* The internal functions used by ScriptFunc more or less expect to operate on
full days, but ScriptFunc may have called them with some random timestamp
during the day. This is fixed by always using midnight of the day as
reference time.
* Previously, the code advanced a timestamp to the next day by adding 24 hours.
On days with DST changes, this could either still be on the same day (a day
may have 25 hours) or skip an entire day (a day may have 23 hours). This is
fixed by using a struct tm to advance the time to the next day.
Having the command type be a part of the command ID isn't needed anywhere. Removing this simplifies the way we generate IDs in general, because we don't need Prepend() anymore.
The command type was only needed to prevent ID collisions within the command_envvar and command_argument tables. Those tables have since been separated into {check,event,notification}command_envvar and {check,event,notification}command_argument tables.
PR #9036 introduces some incompatible changes to the Redis schema, most
importantly where Icinga DB has to read the environment from: now it has to use
a new top-level key of the icinga:stats message instead of a value in the
IcingaApplication part of that message.
In order to avoid changes to the environment ID, it is now no longer derived
from the Environment constant but instead from the public key of the CA
certificate. This ensures that it is different between clusters by default, so
no additional changes have to be done to allow two clusters to use Icinga DB to
write into the same database.
To prevent the ID from changing when the CA certificate is replaced, it is also
persisted into the file /var/lib/icinga2/icingadb.env, so if that file exists,
it takes precedence over the CA certificate.
... i.e. UUID -> SHA1(env, eventType, x...) given that SHA1(env, x...) = type-specific ID.
Rationale: allow both masters to write the same history concurrently (while not
in split-brain), so that REPLACE INTO deduplicates the same events written twice.
* ack: SHA1(env, "ack_set"|"ack_clear", checkable.name, setTime)
* comment: SHA1(env, "comment_add"|"comment_remove", comment.name)
* downtime: SHA1(env, "downtime_start"|"downtime_end", downtime.name)
* flapping: SHA1(env, "flapping_start"|"flapping_end", checkable.name, startTime)
* notification: SHA1(env, "notification", notification.name, notificationType, sendTime)
* state: SHA1(env, "state_change", checkable.name, changeTime)
... i.e. UUID -> SHA1(x..., send time) given that SHA1(x...) = notification id.
Rationale: allow both masters to write the same notification history concurrently (while
not in split-brain), so that REPLACE INTO deduplicates the same events written twice.
... i.e. UUID -> SHA1(x..., check time) given that SHA1(x...) = checkable id.
Rationale: allow both masters to write the same state history concurrently (while
not in split-brain), so that REPLACE INTO deduplicates the same events written twice.
add_definitions would set SD_JOURNAL_SUPPRESS_LOCATION for all targets
in directory and sub-directories. However, another future target might
want the opposite, so define it as local as possible to journaldlogger.cpp.
To make this work, we must take journaldlogger.cpp out of the unity
build, because all files from a unity of share compiler definitions.
As proposed in #8857, this adds a Logger subclass that writes structured
log messages via journald's native protocol by calling sd_journal_sendv.
The feature therefore depends on the systemd library. sd_journal_sendv is
available since the early days (systemd v38), so a version check is
probably superflous.
We add the following fields to each record:
- MESSAGE: The log message
- PRIORITY (aka severity): Numeric severity as in RFC5424 section 6.2.1
- SYSLOG_FACILITY: Numeric facility as in RFC5424 section 6.2.1
- SYSLOG_IDENTIFIER: If provided, use value from configuration.
Else use systemd's default behaior, which is to determine the field
by using libc's program_invocation_short_name, resulting in "icinga2".
- ICINGA2_FACILITY: Facility as in Log::Log(..., String facility, ...),
e.g. "ApiListener"
- some more fields are added automatically by systemd
Fields are stored indexed, so we can do fast queries for certain field
values. Example:
$ journalctl -t icinga2 ICINGA2_FACILITY=ApiListener -n 5
Syslog compatiblity is ratained because good old tag, severity and facility
is stored along, and systemd can forward to syslog daemons.
See also https://systemd.io/JOURNAL_NATIVE_PROTOCOL/.
The upcoming JournaldLogger will need the same syslog validation and
conversion logic, so factor it out from SyslogLogger to make it
reusable.
Also explicitely include syslog.h, which defines the syslog()
function.
The loop iterated over the services of the wrong host resulting in duplicate
downtimes scheduled for services of the parent host instead of downtimes for
services of the child host.
IdoPgsqlConnection::Escape() internally uses PQescapeStringConn() and its
documentation states the following:
Furthermore, PQescapeStringConn does not generate the single quotes that must
surround PostgreSQL string literals; they should be provided in the SQL
command that the result is inserted into.
So it's intended to use the result in 'string' literals, not in E'string'
literals as Icinga did. This results in problems as the behavior of
PQescapeStringConn() depends on how the current connection will interpret
regular single quoted literals, namely on the value of the
standard_conforming_strings variable.
The E'string' literals were initially introduced in
ac6f3f8acf to fix#1206 where PostgreSQL started
warning about escape sequences in string literals not supported by the SQL
standard (but by PostgreSQL depending on the value of
standard_conforming_strings). In the meantime the oldest PostgreSQL version on
any platform supported by Icinga increased to 9.2 (CentOS 7) and starting with
9.1, standard_conforming_strings is enabled by default, so there will be no
warnings about escape sequences (as the warning is only issued if the escape
sequence is actually interpreted by PostgreSQL).
Icinga started the initial config sync right after the first Redis connection
was established. If any other connections would take longer to connect than
when it's first needed, queries were discarded.
The loop in the connected callback of the parent connection uses m_Rcons which
previously was only initialized after that connection was already started.
This allows the callback to call RedisConnection::SetConnectedCallback() to set
another callback for this connection. This sets m_ConnectedCallback and thereby
destroys the std::function while it's running resulting in undefined behavior.
By operating on a copy, m_ConnectedCallback can be set without affecting the
currently running callback.
As Icinga first sends a SIGTERM to a check plugin on timeout to allow it to
terminate gracefully, this is not really part of the plugin API specification
and we cannot assume that plugins will handle this correctly and still exit
with an exit code that maps to UNKNOWN. Therefore, once Icinga decides to kill
a process, force its exit code to 128 to be sure the state will be UNKNOWN
after a timeout.
I.e. do the following in parallel (highest priority first):
* Stream state changes to icinga:runtime:state
* Sync config and initial state,
then let queued runtime updates to the just synced state pass
Many functions of LegacyTimePeriod take a tm pointer as an input parameter and
then pass it to mktime() which actually modifies it. This causes problems if
tm_isdst was intentionally set to -1 (to automatically detect whether DST is
active at some time) and then a function is called that implicitly sets
tm_isdst and then the values of tm are modified in a way that crosses a DST
change. This resulted in 1 hour offsets with ScheduledDowntimes on days with
DST changes.
* Implement scheduling_source attribute
This implements the attribute `scheduling_source` for hosts and services to show which endpoint is running the scheduler for the check.
refs #4814
IdoMysqlConnection, IdoPgsqlConnection, IcingaDB, and ElasticsearchWriter
require passwords in their configuration to authenticate against external
services. This commit ensures that these can no longer be accessed using the
API.
Only handle messages with a trusted origin in
ClusterEvents::ExecuteCommandAPIHandler. Previously, it would not locally
execute any command but forward them to other nodes where they would then have
a trusted origin and be executed.
So far, the documentation has claimed that loggers have a default severity
(information for FileLogger and warning for SyslogLogger). However, this was
not the case and not setting the severity resulted in a configuration error.
This commit changes the default value to be information for all loggers.
This commit adds a timeout for both establishing new outgoing and incoming
connections. This timeout applies to everything until the connection is in a
state where either JsonRpcConnection or HttpServerConnection takes over.
As silent now no longer only controls the generation of log messages, a better
name is required. This changes its name, inverts its value to reflect the new
name and adds a documentation comment.
When Icinga 2 is started as a service, the early log messages generated
until the FileLogger object is activated are lost and make it really
hard to debug issues that (only) occur when Icinga 2 reloads.
With this commit, these early log messages are written to the Windows
Event Log.
68a0079c26 introduced two problems that are fixed
with this commit:
1. The new truncated/hashed name did not use EscapeName()
2. There was a possible collision of names when creating objects with a full
name of format "[80 characters]...[40 hex digits]" (i.e. the same as the
truncated/hashed variant but short enough that it isn't hashed)
The old validation regex matched if the name consists only of invalid
character, not that it does not contain them, i.e. something like "foo/bar" was
considered valid.
This commit replaces the regex with a check that all characters in the name are
allowed characters.
I.e. keep the serializations as simple as possible:
null => null
true => true
42.0 => 42
"foobar" => foobar
{{42}} => Object of type 'Function'
(["foobar"] and {"foo"="bar"} can't occur there.)
Even if a double represents an integer value, it might not be safe to cast it
to long long as it may overflow the type. Instead just use print the double
value with 0 decimals using std::setprecision.
Before:
<1> => 18446744073709551616.to_string()
"-9223372036854775808"
After:
<1> => 18446744073709551616.to_string()
"18446744073709551616"
Fixes the following build error:
/home/jbrost/dev/icinga2/lib/base/stdiostream.cpp: In member function ‘virtual size_t icinga::StdioStream::Read(void*, size_t, bool)’:
/home/jbrost/dev/icinga2/lib/base/stdiostream.cpp:28:15: error: invalid use of incomplete type ‘std::iostream’ {aka ‘class std::basic_iostream<char>’}
28 | m_InnerStream->read(static_cast<char *>(buffer), size);
| ^~
Note that even when passing `nullptr` as target zone to `RelayMessage()`, the
cluster message will still be sent to the parent zone. These incoming messages
will now be rejected by the parent nodes. At the moment, there's no way to only
send within the local zone.
Unfortunately, the symbol resolution of boost::stacktrace is broken on
FreeBSD, therefore fall back to using backtrace_symbols() to print the
stack trace saved by Boost.
Additionally, -D_GNU_SOURCE is required on FreeBSD for the
_Unwind_Backtrace function used by boost::stacktrace.
This makes the format more similar to what the uncaught C++ and SEH
exception handlers write. Previously there was no indication in the
crash log that a SIGABRT happened.
Maybe this will save the next person who has to look at this code some
time. Please don't blame me for the implementation, I'm just trying to
reconstruct what it does.
The logic for selecting the traces to print stays the same, but there
are fewer nested ifs now. This changes the format of the returned string
a bit by adding a heading for both traces.
By default, DiagnosticInformation uses the stack trace saved when the
exception was thrown, but this mechanism is not in use on Windows.
Gathering a stacktrace in the terminate handler serves as a fallback.
On Windows, the termination handler is executed for uncaught C++
exceptions unless a SEH unhandled exception filter is also set. In this
case, this filter has to explicitly chain the default filter to keep
this behavior.
Previously:
1. You delete an object from a config file
2. You reload Icinga
3. Icinga fetches all objects and whether they're active from the IDO
4. Icinga recognizes that the just deleted object doesn't exist anymore
5. Icinga marks it as inactive in the IDO, but not in memory
6. You re-create the just deleted object via API
7. Icinga still thinks it's active and doesn't activate it - it's invisible
refs #8584
Previously, the initial config dump was started in a timer executed
every 15 seconds. During the first execution of the timer, the Redis
connection is typically not established yet. Therefore, this delayed the
initial sync by up to 15 seconds.
This commit instead triggers the sync from a callback that is executed
after the connection is successfully established.
The timer is removed completely. On first glance, it looks like it would
ensure that a lost connection is reestablished, but this is handled
internally by RedisConnection. After the config has been dumped once,
that timer wouldn't ever attempt a reconnect anyways.
Disregard passive check results while no active checks are being scheduled due to violated dependencies.
This copes with the fact that programs feeding passive check results into Icinga may have no notion of reachability and so drive a checkable into HARD state although dependencies have caused active check scheduling being suspended. This may prevent superflous problem notifications being emitted during recovery.
As disable_checks defaults to false, it was regarded OK (by @Al2Klimov) to make this behaviour (which resembles the active check case) unconditional and not conditionalize it on an additional attribute.
In the description of disable_checks, note that a value of true both disables scheduling of active checks and drops passive check results.
This delays the log message stating that the initial dump is done until
all queries are actually done and now logs a meaningful duration. In
addition, this delays the return of the function and therefore when
state variables are updated by the caller.
This commit sets the activation priority if IcingaDB objects to 100 (the
same value as IDO uses) so that it get's activated after most regular
config objects (hosts, services, ...).
Before (note how Icinga 2 continues to active objects for over a minute
after IcingaDB is started and thinks the initial dump is done):
[2021-01-19 08:33:19 +0000] information/IcingaDB: 'icingadb' started.
[2021-01-19 08:34:02 +0000] information/IcingaDB: Initial config/status dump finished in 28.247 seconds.
[2021-01-19 08:35:49 +0000] information/ConfigItem: Activated all objects.
After (now activation of objects is done right after IcingaDB is
started, as it's one of the last objects to be activated):
[2021-01-19 08:39:01 +0000] information/IcingaDB: 'icingadb' started.
[2021-01-19 08:39:02 +0000] information/ConfigItem: Activated all objects.
[2021-01-19 08:39:38 +0000] information/IcingaDB: Initial config/status dump finished in 21.6606 seconds.
There is an assertion that after activating items, all these items are
active, which sounds reasonable at first. However, with concurrent API
queries, some of these could already be deleted and therefore be
deactivated again.
At numerous places in the code, something like this is performed:
String name = Downtime::AddDowntime(...);
Downtime::Ptr downtime = Downtime::GetByName(name);
However, `downtime` can be a `nullptr` after this as it is possible that
the downtime is deleted in between.
This commit changes the return type of `Downtime::AddDowntime` to return
a Downtime::Ptr instead of the full name of the downtime. `AddDowntime`
performs the very same `GetByName()` operation internally, but handles
the `nullptr` case correctly and throws an exception.
Since commit d9010c7b9f, ActivateItems no
longer uses the WorkQueue upq to perform tasks but instead performs
these locally. One instance of `upq.Join()`/`upq.HasExceptions()`
remained in the function, but I believe this was just missed when
removing the `upq.Enqueue()` call just before.
This commit removes the corresponding parameter and updates all call
sites accordingly.
`this` could be deleted after `Notification::BeginExecuteNotification`
exited and before `Notification::ExecuteNotificationHelper` finished.
This is fixed by constructing a `Notification::Ptr` and operate on that
one as it is properly reference-counted.
Only two out of three cases were handled properly by the code: host
downtimes referencing a deleted host and service downtimes referencing a
deleted service worked fine. However, if a service downtime references a
deleted host, `Host::GetByName()` returns `nullptr` which isn't
accounted for. Use `Service::GetByNamePair()` instead as this performs a
check for the host being null internally.
Boost.Beast changed the signature of
boost::beast::http::basic_fields::set in version 1.74 so that no longer
allows passing an icinga::String instance as value. This adds a
conversion function so that it works again.
Boost.Beast changed the signature of the previously used generic `set`
method so that it no longer accepts integer types, however there is
alreay a more specific method for setting the Content-Length header, so
use this one instead.
When a CRL is specified in the ApiListener configuration, Icinga 2 only
used it when connections were established so far, but not when a
certificate is requested. This allows a node to automatically renew a
revoked certificate if it meets the other conditions for auto-renewal
(issued before 2017 or expires in less than 30 days).
The function was never used and it's implementation contains a bug where
a buffer of too small size is used as a paramter to ERR_error_string.
According to the `man 3 ERR_error_info`, the buffer has to be at least
256 bytes in size.
Also the function seems of limited use as it allows to output the tag
object used with additional error information for exceptions in Boost.
However, you boost::get_error_info<>() just returns the value type but
not the full tag object from the exception.
The scientific notation is basically allowed in our performance data
parser. In an edge case where scientific notation is delivered with
an upercas E letter our parser does not recognize it and drops it as
false performance data.
The wrong interpretation as false performance data affects all parts of
the performance data, the value itself, the warning value, the critical
value, the minimum value and the maximum value.
Merge AsyncTryActivateZonesStage and TryActivateZonesStageCallback and
name the result TryActivateZonesStage. The old split was a leftover from
the one being a callback function with no actual meaningful separation.
Anonymous connections are normally only used for requesting a
certificate and are closed after this request is received. However, the
request is only sent if the child has successfully verified the
certificate of its parent so that it is an authenticated connection from
its perspective. In case this verification fails, both ends view it as
an anonymous connection and never actually use it but attempt a
reconnect after 10 seconds leaking the connection. Therefore close it
after a timeout.
RelayMessageOne used to relay the message only to one other endpoint for
other zones, which is fine, as long as the target zone is a child/parent
zone but breaks if the target zone is a global one. In this case, the
message has to be forwarded within the local zone as well as to one node
in each child zone.
The initial config object sync for each new connection (in
`ApiListener::SendRuntimeConfigObjects()`) only considers currently
existing objects and has no way to pass the information that objects
were deleted in the meantime.
This commit logs config object deletions to the replay log if required
so that there is a chance that it will be propagated to nodes that were
offline when the deletion happened.
Note that this can only be considered a workaround as the replay log
might be pruned or could even be completely disabled. Also, there still
seems to be a race-condition between the config sync and replay log of
multiple new connections at the same time.
Before postgres 9.1, this setting defaulted to off and icinga2 code was
making heavy use of this feature. Since postgres 9.1, this settings
defaults to on. During the adoption of postgres >= 9.1, the icinga2
postgres ido code maintained compatibility by setting it to off
explicitly.
In the mean time, the postgres ido code has been converted to using the
`E'...'` escape literal syntax exclusively.
The last remaining step is now to no longer force the setting to off
because no query is using the feature any longer.
Closes github issue #8122.