The problem was that some PerfData labels contained several whitespace characters,
not just one, and therefore it was parsed incorrectly in `SplitPerfdata()`. I.e. the condition
in line 144 checks whether the first and last character is a normal quote, but since the
label can contain spaces at the beginning and at the end respectively, this caused the problems.
This PR fixes the problem by removing all occurring whitespace from the beginning and end,
before starting to parse the actual label.
This allows for a faster config load due to less locking required.
The change is slightly backwards-incompatible. Before, you could manipulate the
globals namespace at a later stage, but disallowing this feels reasonable for
the performance benefit alone (which especially shows on many-core machines).
Apart from that, it's doubtful if doing so is even useful at all as the DSL
provides no mechanism for you to synchronize your operations that may run in
parallel. The data structures itself are protected from race conditions, but
anything implemented on top of this may still be subject to race conditions.
And even if some user has a good reason for doing this, there's a feasible
workaround by creating your own namespace like globals.mutable and using that
instead.
If a connection hangs for too long in ApiListener#NewClientHandler(),
ApiListener#AddConnection()'s Timeout calls boost::asio::basic_socket#cancel()
on that connection to trigger an exception which unwinds
ApiListener#NewClientHandler(). Previously that unwind could trigger a Defer
which called boost::asio::ssl::stream#async_shutdown() which extended the hang.
Normally if for some reason an ack comment still exists on a checkable not
acked anymore, still clean it up. But while replaying log config objects
incl. ack comments come before check results and acks. I.e. 1) ack comment,
2) DOWN check result and 3) ack. Not 1) DOWN check result, 2) ack and 3) ack
comment. So the checkable is temporarily not acked, but already has the ack
comment. In this case the DOWN check result which is older than the ack
comment shall not clean up the latter.
Traditional behaviour was to regard all dependecies as cumulative (e.g., the parent considered unreachable if any one dependency is violated), commit ed58922389 made all dependencies regarded redundant (e.g., the parent considered unreachable only if all dependency are violated). This may lead to unrelated services (or even hosts vs. services) inadvertantly regarded to be redundant to each other.
Most importantly, applying the explicit "disable-host-service-checks" dependency described in the "Monitoring Basics" chapter will defeat all other dependencies.
This commit introduces a new "redundancy_group" attribute for dependencies.
Specifying a redundancy_group causes a dependency to be regarded as redundant only inside that redundancy group.
Dependencies lacking a redundancy_group attribute are regarded as essential for the parent.
This allows for both cumulative and redundant dependencies and even a combination (cumulation of redundancies, like SSH depeding on both LDAP and DNS to function, while operating redundant LDAP servers as well as redundant DNS resolvers).
This commit lacks changes to the tests.
master before #9627 (a0286e9c6):
<1> => namespace n { x = 42; x = 42 }
^^^^^^
Constant must not be modified.
<2> =>
HEAD of #9627 (24b57f0d3):
<1> => namespace n { x = 42; x = 42 }
null
<2> =>
if caused by dependency or check period.
Now as long as any of the above causes check skips
next check and next update will be up-to-date in Icinga DB,
so the checkable won't slide into false positive overdue.
This was accidentally broken by #9627 because during config sync, a config
validation happens that uses `--define System.ZonesStageVarDir=...` which fails
on the now frozen namespace.
This commit changes this to use `Internal.ZonesStageVarDir` instead. After all,
this is used for internal functionality, users should not directly interact
with this flag.
Additionally, it no longer freezes the `Internal` namespace which actually
allows using `Internal.ZonesStageVarDir` in the first place. This also fixes
`--define Internal.Debug*` which was also broken by said PR. Freezing of the
`Internal` namespace is not necessary for performance reasons as it's not
searched implicitly (for example when accessing `globals.x`) and should users
actually interact with it, they should know by that name that they are on their
own.
This commit moves the initialization of the globals.Types namespace to type.cpp
in order to keep a pointer to the Namespace object in Type::m_Namespace and
simplify Type::GetByName() using it.
The dynamic type check is moved into an assertion after freezing the namespace.
This makes freezing a namespace an irrevocable operation but in return allows
omitting further lock operations. This results in a performance improvement as
reading an atomic bool is faster than acquiring and releasing a shared lock.
ObjectLocks on namespaces remain untouched as these mostly affect write
operations which there should be none of after freezing (if there are some,
they will throw exceptions anyways).
This commit removes EmbeddedNamespaceValue and ConstEmbeddedNamespaceValue and
reduces NamespaceValue down to a simple struct without inheritance or member
functions. The code from these clases is inlined into the Namespace class. The
class hierarchy determining whether a value is const is moved to an attribute
of NamespaceValue.
This is done in preparation for changes to the locking in the Namespace class.
Currently, it relies on a recursive mutex. In the future, a shared mutex
(read/write lock) should be used instead, which cannot allow recursive locking
(without failing or risk deadlocking on lock upgrades). With this change, all
operations requiring a lock for one operation are within one function, no
recursive locking is not needed any more.
This commit adds a new initialization priority `FreezeNamespaces` that is run
last and moves all calls to `Namespace::Freeze()` there. This allows all other
initialization functions to still update namespaces without the use of the
`overrideFrozen` flag.
It also moves the initialization of `System.Platform*` and `System.Build*` to
an initialize function so that these can also be set without setting
`overrideFrozen`.
This is preparation for a following commit that will make the frozen flag in
namespaces finial, no longer allowing it to be overriden (freezing the
namespace will disable locking, so performing further updates would be unsafe).
Now that all values are in one place, there is no reason for this numbering
with gaps anymore. If you need to insert a new value in between, you can just
do so in the enum.
This reverses the sort order of the enum, thereby requiring a change to the
sort order of the std::priority_queue containing the elements.
Change the type of the priority values from int to a new enum. By replacing the
magic int values throughout the code base with named values, there is now a
single place where all priority values are defined and you get an overview over
the initialization order.
InitializeOnceHelper calls Loader::AddDeferredInitializer which takes a
std::function, so it's eventually converted to that anyways. This commit just
does this a bit earlier, and by saving the step of the intermediate C function
pointer, this would now also work for capturing lambdas (which there are none
of at the moment).
to restore the behavior before the previous commit. Otherwise we'd delete all
API object's child objects' files including applied child object rules in /etc.
In essence, namespace behaviors acted as hooks for update operations on
namespaces. Two behaviors were implemented:
- `NamespaceBehavior`: allows the update operation unless it acts on a value
that itself was explicitly marked as constant.
- `ConstNamespaceBehavior`: initially allows insert operations but marks the
individual values as const. Additionally provides a `Freeze()` member
function. After this was called, updates are rejected unless a special
`overrideFrozen` flag is set explicitly.
This marvel of object-oriented programming can be replaced with a simple bool.
This commit basically replaces `Namespace::m_Behavior` with
`Namespace::m_ConstValues` and inlines the behavior functions where they were
called. While doing so, the code was slightly simplified by assuming that
`m_ConstValues` is true if `m_Frozen` is true. This is similar to what the API
allowed in the old code as you could only freeze a `ConstNamespaceBehavior`.
However, this PR moves the `Freeze()` member function and the related
`m_Freeze` member variable to the `Namespace` class. So now the API allows any
namespace to be frozen. The new code also makes sense with the previously
mentioned simplification: a `Namespace` with `m_ConstValues = false` can be
modified without restrictions until `Freeze()` is called. When this is done, it
becomes read-only.
The changes outside of `namespace.*` just adapt the code to the slightly
changed API.
Copying an ObjectLock results in the underlying mutex being unlocked too often.
There's also no good reason for copying a scoped locking class (if at all, it
should be moved).
instead of computing from scratch if they're in the _api package.
For now this changes literally nothing as paths of existing objects still follow
the scheme of paths of new objects which didn't change. Now Icinga only doesn't expect
existing objects at particular paths. However, with the latter in v2.14+ (agent,
satellite) we can just change the path scheme of new objects in v2.16+ (master)
as we wish. The child nodes will just follow the new scheme of paths of new objects.
This prevents the `m_HasMatches` property from being altered simultaneously.
This might seem harmless (since this property can only be set to true by any calling thread),
however, from a technical (C++) point of view, this constitutes a data race.
1. The lookup of apply rules per source type now implies
no String(const char*) (no malloc()) and just pointer (uint64) comparisions
2. Apply rules are now also grouped by target type via a nested map, that obsoletes
checking the target type while iterating over all rules per source type
When committing an item with `ignore_on_error` flag set fails, the `Commit()` method only returns `nullptr`
and the current item is not being dropped from `m_Items`. `CommittNewItems()` also doesn't check the return
value of `Commit()` but just continues and tries to commit all items from `m_Items` in recursive call. Since
this corrupt item is never removed from `m_Items`, it ends up in an endless recursion till it finally crashes.
by caching the total minimum log severity of all loggers in a
"global variable" and whether a message's severity is large enough for any of
the loggers in a per-message no-op flag.
1. Don't set a custom handler for SIGCHLD (in the umbrella process)
as that handler doesn't actually handle SIGCHLD anymore
2. Don't reset the SIGCHLD handler (in the worker process)
as there's nothing to reset anymore due to the above change
3. Don't block SIGCHLD across fork(2) as its handler doesn't change anymore
due to the above changes
Before:
On SIGCHLD from the forked worker the umbrella process sets a failure flag.
StartUnixWorker() recognises that and does waitpid(), failure message, etc..
On OpenBSD we can't tell the signal source, so we always set the failure flag.
That's not how our IPC shall work, that breaks the IPC sooner or later.
After:
No SIGCHLD handling and no failure flag setting.
Instead StartUnixWorker()'s wait loop uses waitpid(x,y,WNOHANG)
to avoid false positives while watching the forked worker.
This ensures that `frame.Depth` is only decreased when preceding `frame.IncreaseStackDepth()` callee was successful.
This way, `frame.Depth` will have the same depth prior to and after evaluating a frame.
Before (time: vertical, stack: horizontal):
* Checkable::ExecuteCheck
* Checkable::UpdateNextCheck
* IcingaDB::NextCheckChangedHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* RandomCheckTask::ScriptFunc
* Checkable::ProcessCheckResult
* Checkable::UpdateNextCheck
* IcingaDB::NextCheckChangedHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::NewCheckResultHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::StateChangeHandler
* XADD icinga:runtime:state
* IcingaDB::ForwardHistoryEntries
* XADD icinga:history:stream:state
After:
* Checkable::ExecuteCheck
* Checkable::UpdateNextCheck
* RandomCheckTask::ScriptFunc
* Checkable::ProcessCheckResult
* Checkable::UpdateNextCheck
* IcingaDB::NewCheckResultHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::StateChangeHandler
* XADD icinga:runtime:state
* IcingaDB::ForwardHistoryEntries
* XADD icinga:history:stream:state
The first state + nextupdate (for overdue) update comes from next_check being
set to now + interval immediately before doing the actual check (not to trigger
it twice). This update is not only not important for the end user, but even
inappropriate. The end user SHALL see next_check being e.g. in -4s, not 5m, as
the check is running at the moment.
The second one is just redundant as IcingaDB::NewCheckResultHandler (the third
one) is called anyway and will update state + nextupdate as well.
Case:
1. icinga2 api setup
2. icinga2 daemon -C -x debug
Before: Second commands crashes at exit.
After: No crash.
As the comment between the removed lines clearly says:
Our destructors haven't been built for static data.
This is build type independent.
The point of logging to the Windows Event Log was to catch errors that happen
before the full logging configuration has been loaded and enabled. Messages
like the number of loaded objects per type just cause noise in the log and
provide little benefit. Therefore raise the required log level at this stage.
Note that this commit removes the (never documented) ability to use the -x flag
to change the level. But doing so would require patching the command line of
the service in the registry anyways.
If some kind of query is not supposed to be processed at the moment, there is
little point in checking it. During a full dump, state updates are suppressed
(i.e. delayed), so when a dump takes very long, this would have resulted in a
false Redis backlog warning.
The check makes no attempt to explicitly connect to Redis, it uses the
connection of the IcingaDB feature, so this message better describes the state
in this situation.
IcingaDB::GetConnection() uses IcingaDB::m_Rcon which is only initialized in
IcingaDB::Start(), therefore add a nullptr check to the check command.
Additionally, as m_Rcon is potentially accessed concurrently, add a copy of the
value that is safe for concurrent use.
icingadb-web shows multiple lines from the check output collapsed into a single
line. The lines containing just minuses make this look cluttered and making
making it a heading provides little to no benefit. Even when rendering markdown
in the check output at some point, having the lists labeled using normal
paragraphs would look just fine.
- Add icinga2_ and icingadb_ prefixes to make clear which component is
responsible for the value.
- Rename heartbeat_lag to heartbeat_age, describes it better in my opinion and
sound a bit less like something that should be as close to zero as possible.
- Rename redis_dump/database_sync into full_dump/full_sync as this is how these
operations are refered to in log messages as well.
- Rename Redis backlog into Redis query backlog, makes it a bit clearer in my
opinion.
- Rename runtime_backlog into runtime_update_backlog, as the component in
Icinga DB is called that way and this naming is also exposed in log messages.
- Rename dump_config/state/history into config/state/history_dump, makes it
sound more natural.
They add no additional information compared to the *_1min values as it's always
the same value divided by 60 anyways. Adding the actual value from the last
second makes little sense for realistic values of check_interval.
Use the already existing format to pass performance data to Icinga 2 rather
than some new JSON structure. Has the additional benefit of doing more things
in Go than in C++.
IcingaDB may receive callbacks from Boost signals before being fully started.
This resulted in situations where m_EnvironmentId was used before it was
initialized properly. This is fixed by initializing it earlier (during the
config validation stage). However, at this stage, it should not yet write to
disk, therefore, persisting the environment ID to disk is delayed until later
in the startup process.
Initializing at this stage has an extra benefit: if there is an error for some
reason (possibly corrupt icingadb.env file), this now shows up as a nice error
during config validation.
Additionally, this replaces the use of std::call_once with std::mutex due to
bug in libstdc++ (see inline comment for reference).
Icinga 2 treats null (Empty) as if the corresponding attribute is not
specified. However, without this commit, it would serialize the value as "null"
(i.e. type string), so that it ends up in the database as this string instead
of NULL. This commit adds handling for ValueEmpty so that is serialized as JSON
null value and ends up in the database as NULL.
Apparently there was a reason for making the members of generated classes
atomic. However, this was only done for some types, others were still accessed
using non-atomic operations. For members of type T::Ptr (i.e. intrusive_ptr<T>),
this can result in a double free when multiple threads access the same variable
and at least one of them writes to the variable.
This commit makes use of std::atomic<T> for more T (it removes the additional
constraint sizeof(T) <= sizeof(void*)) and uses a type including a mutex for
load and store operations as a fallback.
The very same object is already serialized a few lines above, the result is
even stored in a variable, but that variable was not used before. Simply using
this variable results in a noticeable improvement of config validation times.
Checkable::FireSuppressedNotifications() compares the time of the current
checkable with the last recovery time of parents to avoid notification right
after a parent recovered and before the current checkable was checked.
This commit makes this check also include to host if the checkable is a
service. This makes the behavior consistent with the documentation that states
there is an implicit dependency on the host (which isn't realized as implicitly
generating a Dependency object unfortunately).
as they would be re-connected in Resume() (HA).
Before they were still connected during pause and connected X+1 times
after X split-brains (the same data was written X+1 times).
To prevent Icinga2 from being restarted while
one or more requests are still in progress and end up
as corrupted stages without status file and startup logs.
If comments get removed in unintended ways (i.e. not by expiring or by using
the remove-comment API action), the comment object misses information to create
a proper history event for Icinga DB. Therefor, discard these events.
On ack Icinga first adds a comment, then acks the checkable
so the ack event has the comment ID.
But due to the yet missing ack the comment was missing is_sticky.
That's corrected now.
Icinga DB doesn't expect comment history for ack comments.
Before:
1. Acked checkable recovers
2. Icinga clears ack comments w/o setting removal time
3. Icinga DB gets neither removal time, nor expire time
4. Icinga DB falls back to NULL and violates NOT NULL constraint
icinga::Array requires locking by the caller when iterating using Begin() and
End(). This is only checked in debug builds but there it makes this function
fail.