mirror of https://github.com/Icinga/icinga2.git
269 lines
12 KiB
Markdown
269 lines
12 KiB
Markdown
Icinga 2 is a network monitoring application that tries to improve upon
|
||
the success of Icinga 1.x while fixing some of its shortcomings. A few
|
||
frequently encountered issues are:
|
||
|
||
- Scalability problems in large monitoring setups
|
||
|
||
- Difficult configuration with dozens of "magic" tweaks and several
|
||
ways of defining services
|
||
|
||
- Code quality and the resulting inability to implement changes
|
||
without breaking add-ons
|
||
|
||
- Limited access to the runtime state of Icinga (e.g. for querying a
|
||
service’s state or for dynamically creating new services)
|
||
|
||
Fixing these issues would involve major breaking changes to the Icinga
|
||
1.x core and configuration syntax. Icinga users would likely experience
|
||
plenty of problems with the Icinga versions introducing these changes.
|
||
Many of these changes would likely break add-ons which rely on the NEB
|
||
API and other core internals.
|
||
|
||
From a developer standpoint this may be justifiable in order to get to a
|
||
better end-product. However, for (business) users spending time on
|
||
getting familiar with these changes for each new version may become
|
||
quite frustrating and may easily cause users to lose their confidence in
|
||
Icinga.
|
||
|
||
Nagios™ 4 is currently following this approach and it remains to be seen
|
||
how this fares with its users.
|
||
|
||
Instead the Icinga project will maintain two active development
|
||
branches. There will be one branch for Icinga 1.x which focuses on
|
||
improving the existing Icinga 1.x code base - just like it has been done
|
||
so far.
|
||
|
||
Independently from Icinga 1.x development on Icinga 2 will happen in a
|
||
separate branch and some of the long-term design goals will be outlined
|
||
in this document. Status updates for Icinga 2 will be posted on the
|
||
project website (www.icinga.org) as they become available.
|
||
|
||
Code Quality
|
||
============
|
||
|
||
Icinga 2 will not be using any code from the Icinga 1.x branch due to
|
||
the rampant code quality issues with the existing code base. However, an
|
||
important property of the Icinga development process has always been to
|
||
rely on proven technologies and Icinga 2 will be no exception.
|
||
|
||
A lot of effort has gone into designing a maintainable architecture for
|
||
Icinga 2 and making sure that algorithmic choices are in alignment with
|
||
our scalability goals for Icinga 2.
|
||
|
||
There are plans to implement unit tests for most Icinga 2 features in
|
||
order to make sure that changes to the code base do not break things
|
||
that were known to work before.
|
||
|
||
Language Choice
|
||
===============
|
||
|
||
Icinga 1.x is written in C and while in general C has quite a number of
|
||
advantages (e.g. performance and relatively easy portability to other
|
||
\*NIX- based platforms) some of its disadvantages show in the context of
|
||
a project that is as large as Icinga.
|
||
|
||
With a complex software project like Icinga an object-oriented design
|
||
helps tremendously with keeping things modular and making changes to the
|
||
existing code easier.
|
||
|
||
While it is true that you can write object-oriented software in C (the
|
||
Linux kernel is one of the best examples of how to do that) a truly
|
||
object-oriented language makes the programmers' life just a little bit
|
||
easier.
|
||
|
||
For Icinga 2 we have chosen C++ as the main language. This decision was
|
||
influenced by a number of criteria including performance, support on
|
||
different platforms and general user acceptability.
|
||
|
||
In general there is nothing wrong with other languages like Java, C\# or
|
||
Python; however - even when ignoring technical problems for just a
|
||
moment - in a community as conservative as the monitoring community
|
||
these languages seem out of place.
|
||
|
||
Knowing that users will likely want to run Icinga 2 on older systems
|
||
(which are still fully vendor-supported even for years to come) we will
|
||
make every effort to ensure that Icinga 2 can be built and run on
|
||
commonly used operating systems and refrain from using new and exotic
|
||
features like C++11.
|
||
|
||
Unlike Icinga 1.x there will be Windows support for Icinga 2. Some of
|
||
the compatibility features (e.g. the command pipe) which rely on \*NIX
|
||
features may not be supported on Windows but all new features will be
|
||
designed in such a way as to support \*NIX as well as Windows.
|
||
|
||
Configuration
|
||
=============
|
||
|
||
Icinga 1.x has a configuration format that is fully backwards-compatible
|
||
to the Nagios™ configuration format. This has the advantage of allowing
|
||
users to easily upgrade their existing Nagios™ installations as well as
|
||
downgrading if they choose to do so (even though this is generally not
|
||
the case).
|
||
|
||
The Nagios™ configuration format has evolved organically over time and
|
||
for the most part it does what it’s supposed to do. However this
|
||
evolutionary process has brought with it a number of problems that make
|
||
it difficult for new users to understand the full breadth of available
|
||
options and ways of setting up their monitoring environment.
|
||
|
||
Experience with other configuration formats like the one used by Puppet
|
||
has shown that it is often better to have a single "right" way of doing
|
||
things rather than having multiple ways like Nagios™ does (e.g. defining
|
||
host/service dependencies and parent/child relationships for hosts).
|
||
|
||
Icinga 2 tries to fix those issues by introducing a new object-based
|
||
configuration format that is heavily based on templates and supports
|
||
user-friendly features like freely definable macros.
|
||
|
||
External Interfaces
|
||
===================
|
||
|
||
While Icinga 1.x has easily accessible interfaces to its internal state
|
||
(e.g. status.dat, objects.cache and the command pipe) there is no
|
||
standards-based way of getting that information.
|
||
|
||
For example, using Icinga’s status information in a custom script
|
||
generally involves writing a parser for the status.dat format and there
|
||
are literally dozens of Icinga-specific status.dat parsers out there.
|
||
|
||
While Icinga 2 will support these legacy interfaces in order to make
|
||
migration easier and allowing users to use the existing CGIs and
|
||
whatever other scripts they may have Icinga 2 will focus on providing a
|
||
unified interface to Icinga’s state and providing similar functionality
|
||
to that provided by the command pipe in Icinga 1.x. The exact details
|
||
for such an interface are yet to be determined but this will likely be
|
||
an RPC interface based on one of the commonly used web-based remoting
|
||
technologies.
|
||
|
||
Icinga 1.x exports historical data using the IDO database interface
|
||
(Icinga Data Output). Icinga 2 will support IDO in a
|
||
backwards-compatible fashion in order to support icinga-web.
|
||
Additionally there will be a newly-designed backend for historical data
|
||
which can be queried using the built-in API when available. Effort will
|
||
be put into making this new data source more efficient for use with SLA
|
||
reporting.
|
||
|
||
Icinga 2 will also feature dynamic reconfiguration using the API which
|
||
means users can create, delete and update any configuration object (e.g.
|
||
hosts and services) on-the-fly. Based on the API there are plans to
|
||
implement a command-line configuration tool similar to what Pacemaker
|
||
has with "crm". Later on this API may also be used to implement
|
||
auto-discovery for new services.
|
||
|
||
The RPC interface may also be used to receive events in real-time, e.g.
|
||
when service checks are being executed or when a service’s state
|
||
changes. Some possible uses of this interface would be to export
|
||
performance data for services (RRD, graphite, etc.) or general log
|
||
information (logstash, graylog2, etc.).
|
||
|
||
Checks
|
||
======
|
||
|
||
In Icinga 2 services are the only checkable objects. Hosts only have a
|
||
calculated state and no check are ever run for them.
|
||
|
||
In order to maintain compatibility with the hundreds of existing check
|
||
plugins for Icinga 1.x there will be support for Nagios™-style checks.
|
||
The check interface however will be modular so that support for other
|
||
kinds of checks can be implemented later on (e.g. built-in checks for
|
||
commonly used services like PING, HTTP, etc. in order to avoid spawning
|
||
a process for each check).
|
||
|
||
Based on the availability of remote Icinga 2 instances the core can
|
||
delegate execution of service checks to them in order to support
|
||
large-scale distributed setups with a minimal amount of maintenance.
|
||
Services can be assigned to specific check instances using configuration
|
||
settings.
|
||
|
||
Notifications
|
||
=============
|
||
|
||
Event handlers and notifications will be supported similar to Icinga
|
||
1.x. Thanks to the dynamic configuration it is possible to easily adjust
|
||
the notification settings at runtime (e.g. in order to implement on-call
|
||
rotation).
|
||
|
||
Scalability
|
||
===========
|
||
|
||
Icinga 1.x has some serious scalability issues which explains why there
|
||
are several add-ons which try to improve the core’s check performance.
|
||
One of these add-ons is mod\_gearman which can be used to distribute
|
||
checks to multiple workers running on remote systems.
|
||
|
||
A problem that remains is the performance of the core when processing
|
||
check results. Scaling Icinga 1.x beyond 25.000 services proves to be a
|
||
challenging problem and usually involves setting up a cascade of Icinga
|
||
1.x instances and dividing the service checks between those instances.
|
||
This significantly increases the maintenance overhead when updating the
|
||
configuration for such a setup.
|
||
|
||
Icinga 2 natively supports setting up multiple Icinga 2 instances in a
|
||
cluster to distribute work between those instances. Independent tasks
|
||
(e.g. performing service checks, sending notifications, updating the
|
||
history database, etc.) are implemented as components which can be
|
||
loaded for each instance. Configuration as well as program state is
|
||
automatically replicated between instances.
|
||
|
||
In order to support using Icinga 2 in a partially trusted environment
|
||
SSL is used for all network communication between individual instances.
|
||
Objects (like hosts and services) can be grouped into security domains
|
||
for which permissions can be specified on a per-instance basis (so e.g.
|
||
you can have a separate API or checker instance for a specific domain).
|
||
|
||
Agent-based Checks
|
||
==================
|
||
|
||
Traditionally most service checks have been performed actively, meaning
|
||
that check plugins are executed on the same server that is also running
|
||
Icinga. This works great for checking most network-based services, e.g.
|
||
PING and HTTP. However, there are a number of services which cannot be
|
||
checked remotely either because they are not network-based or because
|
||
firewall settings or network policies ("no unencrypted traffic")
|
||
disallow accessing these services from the network where Icinga is
|
||
running.
|
||
|
||
To solve this problem two add-ons have emerged, namely NRPE and NSCA.
|
||
NRPE can be thought of as a light-weight remote shell which allows the
|
||
execution of a restricted set of commands while supporting some
|
||
Nagios™-specific concepts like command timeouts. However unlike with the
|
||
design of commonly used protocols like SSH security in NRPE is merely an
|
||
afterthought.
|
||
|
||
In most monitoring setups all NRPE agents share the same secret key
|
||
which is embedded into the NRPE binary at compile time. This means that
|
||
users can extract this secret key from their NRPE agent binary and use
|
||
it to query sensitive monitoring information from other systems running
|
||
the same NRPE binary. NSCA has similar problems.
|
||
|
||
Based on Icinga 2’s code for check execution there will be an agent
|
||
which can be used on \*NIX as well as on Windows platforms. The agent
|
||
will be using the same configuration format like Icinga 2 itself and
|
||
will support SSL and IPv4/IPv6 to communicate with Icinga 2.
|
||
|
||
Business Processes
|
||
==================
|
||
|
||
In most cases users don’t care about the availability of individual
|
||
services but rather the aggregated state of multiple related services.
|
||
For example one might have a database cluster that is used for a web
|
||
shop. For an end-user the shop is available as long as at least one of
|
||
the database servers is working.
|
||
|
||
Icinga 1.x does not have any support for business processes out of the
|
||
box. There are several add-ons which implement business process support
|
||
for Icinga, however none of those are well-integrated into Icinga.
|
||
|
||
Icinga 2 will have native support for business processes which are built
|
||
right into the core and can be configured in a similar manner to
|
||
Nagios™-style checks. Users can define their own services based on
|
||
business rules which can be used as dependencies for other hosts or
|
||
services.
|
||
|
||
Logging
|
||
=======
|
||
|
||
Icinga 2 supports file-based logged as well as syslog (on \*NIX) and
|
||
event log (on Windows). Additionally Icinga 2 supports remote logging to
|
||
a central Icinga 2 instance.
|