Icinga 2
========

Icinga 2 is a network monitoring application that tries to improve upon the
success of Icinga 1.x while fixing some of its shortcomings. A few frequently
encountered issues are:

- Scalability problems in large monitoring setups
- Difficult configuration with dozens of "magic" tweaks and several ways of
  defining services
- Code quality and the resulting inability to implement changes without
  breaking add-ons
- Limited access to the runtime state of Icinga (e.g. for querying a service's
  state or for dynamically creating new services)

Fixing these issues would involve major breaking changes to the Icinga 1.x core
and configuration syntax. Icinga users would likely experience plenty of
problems with the Icinga versions introducing these changes. Many of these
changes would likely break add-ons which rely on the NEB API and other core
internals.

From a developer standpoint this may be justifiable in order to get to a better
end-product. However, for (business) users spending time on getting familiar
with these changes for each new version may become quite frustrating and may
easily cause users to lose their confidence in Icinga.

Nagios(TM) 4 is currently following this approach and it remains to be seen how
this fares with its users.

Instead the Icinga project will maintain two active development branches. There
will be one branch for Icinga 1.x which focuses on improving the existing
Icinga 1.x code base - just like it has been done so far.

Independently from Icinga 1.x development on Icinga 2 will happen in a separate
branch and some of the long-term design goals will be outlined in this
document. Status updates for Icinga 2 will be posted on the project website
(www.icinga.org) as they become available.

Code Quality
------------

Icinga 2 will not be using any code from the Icinga 1.x branch due to the
rampant code quality issues with the existing code base. However, an important
property of the Icinga development process has always been to rely on proven
technologies and Icinga 2 will be no exception.

A lot of effort has gone into designing a maintainable architecture for Icinga
2 and making sure that algorithmic choices are in alignment with our
scalability goals for Icinga 2.

There are plans to implement unit tests for most Icinga 2 features in order to
make sure that changes to the code base do not break things that were known
to work before.

Language Choice
---------------

Icinga 1.x is written in C and while in general C has quite a number of
advantages (e.g. performance and relatively easy portability to other *NIX-
based platforms) some of its disadvantages show in the context of a project
that is as large as Icinga.

With a complex software project like Icinga an object-oriented design helps
tremendously with keeping things modular and making changes to the existing
code easier.

While it is true that you can write object-oriented software in C (the Linux
kernel is one of the best examples of how to do that) a truly object-oriented
language makes the programmers' life just a little bit easier.

For Icinga 2 we have chosen C++ as the main language. This decision was
influenced by a number of criteria including performance, support on different
platforms and general user acceptability.

In general there is nothing wrong with other languages like Java, C# or Python;
however - even when ignoring technical problems for just a moment - in a
community as conservative as the monitoring community these languages seem out
of place.

Knowing that users will likely want to run Icinga 2 on older systems (which
are still fully vendor-supported even for years to come) we will make every
effort to ensure that Icinga 2 can be built and run on commonly used operating
systems and refrain from using new and exotic features like C++11.

Unlike Icinga 1.x there will be Windows support for Icinga 2. Some of the
compatibility features (e.g. the command pipe) which rely on *NIX features
may not be supported on Windows but all new features will be designed in such
a way as to support *NIX as well as Windows.

Configuration
-------------

Icinga 1.x has a configuration format that is fully backwards-compatible to the
Nagios(TM) configuration format. This has the advantage of allowing users to
easily upgrade their existing Nagios(TM) installations as well as downgrading
if they choose to do so (even though this is generally not the case).

The Nagios(TM) configuration format has evolved organically over time and
for the most part it does what it's supposed to do. However this evolutionary
process has brought with it a number of problems that make it difficult for
new users to understand the full breadth of available options and ways of
setting up their monitoring environment.

Experience with other configuration formats like the one used by Puppet has
shown that it is often better to have a single "right" way of doing things
rather than having multiple ways like Nagios(TM) does (e.g. defining
host/service dependencies and parent/child relationships for hosts).

Icinga 2 tries to fix those issues by introducing a new object-based
configuration format that is heavily based on templates and supports
user-friendly features like freely definable macros.

External Interfaces
-------------------

While Icinga 1.x has easily accessible interfaces to its internal state (e.g.
status.dat, objects.cache and the command pipe) there is no standards-based
way of getting that information.

For example, using Icinga's status information in a custom script generally
involves writing a parser for the status.dat format and there are literally
dozens of Icinga-specific status.dat parsers out there.

While Icinga 2 will support these legacy interfaces in order to make migration
easier and allowing users to use the existing CGIs and whatever other scripts
they may have Icinga 2 will focus on providing a unified interface to Icinga's
state and providing similar functionality to that provided by the command pipe
in Icinga 1.x. The exact details for such an interface are yet to be determined
but this will likely be an RPC interface based on one of the commonly used
web-based remoting technologies.

Icinga 1.x exports historical data using the IDO database interface (Icinga
Data Output). Icinga 2 will support IDO in a backwards-compatible fashion in
order to support icinga-web. Additionally there will be a newly-designed
backend for historical data which can be queried using the built-in API when
available. Effort will be put into making this new data source more efficient
for use with SLA reporting.

Icinga 2 will also feature dynamic reconfiguration using the API which means
users can create, delete and update any configuration object (e.g. hosts and
services) on-the-fly. Based on the API there are plans to implement a
command-line configuration tool similar to what Pacemaker has with "crm". Later
on this API may also be used to implement auto-discovery for new services.

The RPC interface may also be used to receive events in real-time, e.g. when
service checks are being executed or when a service's state changes. Some
possible uses of this interface would be to export performance data for
services (RRD, graphite, etc.) or general log information (logstash, graylog2,
etc.).

Checks
------

In Icinga 2 services are the only checkable objects. Hosts only have a
calculated state and no check are ever run for them.

In order to maintain compatibility with the hundreds of existing check plugins
for Icinga 1.x there will be support for Nagios(TM)-style checks. The check
interface however will be modular so that support for other kinds of checks
can be implemented later on (e.g. built-in checks for commonly used services
like PING, HTTP, etc. in order to avoid spawning a process for each check).

Based on the availability of remote Icinga 2 instances the core can delegate
execution of service checks to them in order to support large-scale distributed
setups with a minimal amount of maintenance. Services can be assigned to
specific check instances using configuration settings.

Notifications
-------------

Event handlers and notifications will be supported similar to Icinga 1.x.
Thanks to the dynamic configuration it is possible to easily adjust the
notification settings at runtime (e.g. in order to implement on-call rotation).

Scalability
-----------

Icinga 1.x has some serious scalability issues which explains why there are
several add-ons which try to improve the core's check performance. One of
these add-ons is mod_gearman which can be used to distribute checks to
multiple workers running on remote systems.

A problem that remains is the performance of the core when processing check
results. Scaling Icinga 1.x beyond 25.000 services proves to be a challenging
problem and usually involves setting up a cascade of Icinga 1.x instances and
dividing the service checks between those instances. This significantly
increases the maintenance overhead when updating the configuration for such a
setup.

Icinga 2 natively supports setting up multiple Icinga 2 instances in a cluster
to distribute work between those instances. Independent tasks (e.g. performing
service checks, sending notifications, updating the history database, etc.) are
implemented as components which can be loaded for each instance. Configuration
as well as program state is automatically replicated between instances.

In order to support using Icinga 2 in a partially trusted environment SSL is
used for all network communication between individual instances. Objects (like
hosts and services) can be grouped into security domains for which permissions
can be specified on a per-instance basis (so e.g. you can have a separate API
or checker instance for a specific domain).

Agent-based Checks
------------------

Traditionally most service checks have been performed actively, meaning that
check plugins are executed on the same server that is also running Icinga.
This works great for checking most network-based services, e.g. PING and HTTP.
However, there are a number of services which cannot be checked remotely either
because they are not network-based or because firewall settings or network
policies ("no unencrypted traffic") disallow accessing these services from the
network where Icinga is running.

To solve this problem two add-ons have emerged, namely NRPE and NSCA. NRPE
can be thought of as a light-weight remote shell which allows the execution
of a restricted set of commands while supporting some Nagios(TM)-specific
concepts like command timeouts. However unlike with the design of commonly used
protocols like SSH security in NRPE is merely an afterthought.

In most monitoring setups all NRPE agents share the same secret key which is
embedded into the NRPE binary at compile time. This means that users can
extract this secret key from their NRPE agent binary and use it to query
sensitive monitoring information from other systems running the same NRPE
binary. NSCA has similar problems.

Based on Icinga 2's code for check execution there will be an agent which can
be used on *NIX as well as on Windows platforms. The agent will be using the
same configuration format like Icinga 2 itself and will support SSL and
IPv4/IPv6 to communicate with Icinga 2.

Business Processes
------------------

In most cases users don't care about the availability of individual services
but rather the aggregated state of multiple related services. For example one
might have a database cluster that is used for a web shop. For an end-user the
shop is available as long as at least one of the database servers is working.

Icinga 1.x does not have any support for business processes out of the box.
There are several add-ons which implement business process support for Icinga,
however none of those are well-integrated into Icinga.

Icinga 2 will have native support for business processes which are built right
into the core and can be configured in a similar manner to Nagios(TM)-style
checks. Users can define their own services based on business rules which can
be used as dependencies for other hosts or services.

Logging
-------

Icinga 2 supports file-based logged as well as syslog (on *NIX) and event log
(on Windows). Additionally Icinga 2 supports remote logging to a central Icinga
2 instance.