Icinga 2 ======== Icinga 2 is a network monitoring application that tries to improve upon the success of Icinga 1.x while fixing some of its shortcomings. A few frequently encountered issues are: - Scalability problems in large monitoring setups - Difficult configuration with dozens of "magic" tweaks and several ways of defining services - Code quality and the resulting inability to implement changes without breaking add-ons - Limited access to the runtime state of Icinga (e.g. for querying a service's state or for dynamically creating new services) Fixing these issues would involve major breaking changes to the Icinga 1.x core and configuration syntax. Icinga users would likely experience plenty of problems with the Icinga versions introducing these changes. Many of these changes would likely break add-ons which rely on the NEB API and other core internals. From a developer standpoint this may be justifiable in order to get to a better end-product. However, for (business) users spending time on getting familiar with these changes for each new version may become quite frustrating and may easily cause users to lose their confidence in Icinga. Nagios(TM) 4 is currently following this approach and it remains to be seen how this fares with its users. Instead the Icinga project will maintain two active development branches. There will be one branch for Icinga 1.x which focuses on improving the existing Icinga 1.x code base - just like it has been done so far. Independently from Icinga 1.x development on Icinga 2 will happen in a separate branch and some of the long-term design goals will be outlined in this document. Status updates for Icinga 2 will be posted on the project website (www.icinga.org) as they become available. Code Quality ------------ Icinga 2 will not be using any code from the Icinga 1.x branch due to the rampant code quality issues with the existing code base. However, an important property of the Icinga development process has always been to rely on proven technologies and Icinga 2 will be no exception. A lot of effort has gone into designing a maintainable architecture for Icinga 2 and making sure that algorithmic choices are in alignment with our scalability goals for Icinga 2. There are plans to implement unit tests for most Icinga 2 features in order to make sure that changes to the code base do not break things that were known to work before. Language Choice --------------- Icinga 1.x is written in C and while in general C has quite a number of advantages (e.g. performance and relatively easy portability to other *NIX- based platforms) some of its disadvantages show in the context of a project that is as large as Icinga. With a complex software project like Icinga an object-oriented design helps tremendously with keeping things modular and making changes to the existing code easier. While it is true that you can write object-oriented software in C (the Linux kernel is one of the best examples of how to do that) a truly object-oriented language makes the programmers' life just a little bit easier. For Icinga 2 we have chosen C++ as the main language. This decision was influenced by a number of criteria including performance, support on different platforms and general user acceptability. In general there is nothing wrong with other languages like Java, C# or Python; however - even when ignoring technical problems for just a moment - in a community as conservative as the monitoring community these languages seem out of place. Knowing that users will likely want to run Icinga 2 on older systems (which are still fully vendor-supported even for years to come) we will make every effort to ensure that Icinga 2 can be built and run on commonly used operating systems and refrain from using new and exotic features like C++11. Unlike Icinga 1.x there will be Windows support for Icinga 2. Some of the compatibility features (e.g. the command pipe) which rely on *NIX features may not be supported on Windows but all new features will be designed in such a way as to support *NIX as well as Windows. Configuration ------------- Icinga 1.x has a configuration format that is fully backwards-compatible to the Nagios(TM) configuration format. This has the advantage of allowing users to easily upgrade their existing Nagios(TM) installations as well as downgrading if they choose to do so (even though this is generally not the case). The Nagios(TM) configuration format has evolved organically over time and for the most part it does what it's supposed to do. However this evolutionary process has brought with it a number of problems that make it difficult for new users to understand the full breadth of available options and ways of setting up their monitoring environment. Experience with other configuration formats like the one used by Puppet has shown that it is often better to have a single "right" way of doing things rather than having multiple ways like Nagios(TM) does (e.g. defining host/service dependencies and parent/child relationships for hosts). Icinga 2 tries to fix those issues by introducing a new object-based configuration format that is heavily based on templates and supports user-friendly features like freely definable macros. External Interfaces ------------------- While Icinga 1.x has easily accessible interfaces to its internal state (e.g. status.dat, objects.cache and the command pipe) there is no standards-based way of getting that information. For example, using Icinga's status information in a custom script generally involves writing a parser for the status.dat format and there are literally dozens of Icinga-specific status.dat parsers out there. While Icinga 2 will support these legacy interfaces in order to make migration easier and allowing users to use the existing CGIs and whatever other scripts they may have Icinga 2 will focus on providing a unified interface to Icinga's state and providing similar functionality to that provided by the command pipe in Icinga 1.x. The exact details for such an interface are yet to be determined but this will likely be an RPC interface based on one of the commonly used web-based remoting technologies. Icinga 1.x exports historical data using the IDO database interface (Icinga Data Output). Icinga 2 will support IDO in a backwards-compatible fashion in order to support icinga-web. Additionally there will be a newly-designed backend for historical data which can be queried using the built-in API when available. Effort will be put into making this new data source more efficient for use with SLA reporting. Icinga 2 will also feature dynamic reconfiguration using the API which means users can create, delete and update any configuration object (e.g. hosts and services) on-the-fly. Based on the API there are plans to implement a command-line configuration tool similar to what Pacemaker has with "crm". Later on this API may also be used to implement auto-discovery for new services. The RPC interface may also be used to receive events in real-time, e.g. when service checks are being executed or when a service's state changes. Some possible uses of this interface would be to export performance data for services (RRD, graphite, etc.) or general log information (logstash, graylog2, etc.). Checks ------ In Icinga 2 services are the only checkable objects. Hosts only have a calculated state and no check are ever run for them. In order to maintain compatibility with the hundreds of existing check plugins for Icinga 1.x there will be support for Nagios(TM)-style checks. The check interface however will be modular so that support for other kinds of checks can be implemented later on (e.g. built-in checks for commonly used services like PING, HTTP, etc. in order to avoid spawning a process for each check). Based on the availability of remote Icinga 2 instances the core can delegate execution of service checks to them in order to support large-scale distributed setups with a minimal amount of maintenance. Services can be assigned to specific check instances using configuration settings. Notifications ------------- Event handlers and notifications will be supported similar to Icinga 1.x. Thanks to the dynamic configuration it is possible to easily adjust the notification settings at runtime (e.g. in order to implement on-call rotation). Scalability ----------- Icinga 1.x has some serious scalability issues which explains why there are several add-ons which try to improve the core's check performance. One of these add-ons is mod_gearman which can be used to distribute checks to multiple workers running on remote systems. A problem that remains is the performance of the core when processing check results. Scaling Icinga 1.x beyond 25.000 services proves to be a challenging problem and usually involves setting up a cascade of Icinga 1.x instances and dividing the service checks between those instances. This significantly increases the maintenance overhead when updating the configuration for such a setup. Icinga 2 natively supports setting up multiple Icinga 2 instances in a cluster to distribute work between those instances. Independent tasks (e.g. performing service checks, sending notifications, updating the history database, etc.) are implemented as components which can be loaded for each instance. Configuration as well as program state is automatically replicated between instances. In order to support using Icinga 2 in a partially trusted environment SSL is used for all network communication between individual instances. Objects (like hosts and services) can be grouped into security domains for which permissions can be specified on a per-instance basis (so e.g. you can have a separate API or checker instance for a specific domain). Agent-based Checks ------------------ Traditionally most service checks have been performed actively, meaning that check plugins are executed on the same server that is also running Icinga. This works great for checking most network-based services, e.g. PING and HTTP. However, there are a number of services which cannot be checked remotely either because they are not network-based or because firewall settings or network policies ("no unencrypted traffic") disallow accessing these services from the network where Icinga is running. To solve this problem two add-ons have emerged, namely NRPE and NSCA. NRPE can be thought of as a light-weight remote shell which allows the execution of a restricted set of commands while supporting some Nagios(TM)-specific concepts like command timeouts. However unlike with the design of commonly used protocols like SSH security in NRPE is merely an afterthought. In most monitoring setups all NRPE agents share the same secret key which is embedded into the NRPE binary at compile time. This means that users can extract this secret key from their NRPE agent binary and use it to query sensitive monitoring information from other systems running the same NRPE binary. NSCA has similar problems. Based on Icinga 2's code for check execution there will be an agent which can be used on *NIX as well as on Windows platforms. The agent will be using the same configuration format like Icinga 2 itself and will support SSL and IPv4/IPv6 to communicate with Icinga 2. Business Processes ------------------ In most cases users don't care about the availability of individual services but rather the aggregated state of multiple related services. For example one might have a database cluster that is used for a web shop. For an end-user the shop is available as long as at least one of the database servers is working. Icinga 1.x does not have any support for business processes out of the box. There are several add-ons which implement business process support for Icinga, however none of those are well-integrated into Icinga. Icinga 2 will have native support for business processes which are built right into the core and can be configured in a similar manner to Nagios(TM)-style checks. Users can define their own services based on business rules which can be used as dependencies for other hosts or services. Logging ------- Icinga 2 supports file-based logged as well as syslog (on *NIX) and event log (on Windows). Additionally Icinga 2 supports remote logging to a central Icinga 2 instance.