Programmatic Real-time Monitoring and Alerting with RiemannAgile, Alerting, DevOps, and Monitoring
A streaming event-processor that empowers development teams to define their own monitoring and alerting in a manner more compatible with Agile projects than traditional approaches.
I've worked on a lot of software projects, but have yet to see one fully embrace the power of software engineering in all aspects of its delivery. I'm talking specifically about operational monitoring and alerting; and, anecdotally, how this is handled in most projects.
Software monitoring as a discipline has grown dramatically in recent years, no doubt fuelled by the shift in how we operate and manage our infrastructure (among many other factors). We instrument our applications heavily and measure everything, with the knowledge that this will give us better insight into our applications and how to keep them running as efficiently and effectively as possible. We want to know when things go wrong... more importantly: we want to know before things go wrong so we can avert failure! We test our software rigorously and regularly, applying well-established programming disciplines like version control, TDD, code coverage analysis, static code analysis, and more. Yet somehow we've allowed operational alerting to escape our clutches!
Production systems tend to make use of a number of different tools for monitoring and alerting. These include the likes of Graphite , OpenTSDB , Nagios , Ganglia , InfluxDB , OpsView , Logstash , Kibana , Splunk , and many more. The measurements we emit from our applications ultimately find their way into a number of products such as these, and are then analysed both manually and automatically for patterns that indicate (potential) failure.
Unfortunately, it's been my experience that few of these tools provide a convenient programmatic API to the developers that wrote the applications being monitored. It appears that the typical path to production involves manual definition of alerting, separate from the application itself - and certainly not involving the software engineers that actually wrote it! Additionally, I've not come across any teams that define their application alerting in code, versioned-controlled alongside the application itself. This is a major weakness and an area that we should improve.
The result of this limitation is that the developers and testers are only involved up to the point of defining alerting requirements, after which the "Ops" team take over and implement it. Seldom are these tested, and they're certainly not visible to the teams that will receive the 2AM phone call. Claims of "Agile" are thrown around in many projects these days, but we still insist on throwing software over the proverbial wall. The only difference is that we've just moved the wall: instead of separating development and testing, it now separates delivery and operations!
Developers are the most knowledgeable about the applications they build and what might constitute failure, so surely they should be responsible for defining the alerts for their apps. And if that's the case, they should be responsible for testing the alerts too! What good is an alert that doesn't work? Who really wants to find out that their alert was incorrectly defined after major production system failure, just because they didn't test it beforehand?
Riemann is a Clojure -based event stream processor. It receives events (which are really just maps) from a number of different sources in real-time, performs some processing (splitting, aggregating, throttling, time-slicing, etc.) and takes some actions (transforming, emailing, graphing, etc.).
All of this behaviour is defined in Clojure using a simple API , and it needs nothing more than a JVM to run. This also means it's easily testable; something that we don't really get with other tools. With this setup, we can define our alerts in Clojure: version-controlled, tested, and deployed alongside our application. This is a key requirement for Continuous Delivery in my books. Every run through a delivery pipeline produces a working application that has been tested and can be deployed into production - together with accurate monitoring and alerting for that version of the app !
In some respects, you can think of this as similar to how we define application logging. We want to log in a well-defined manner, but how we configure each environment will vary. Some will use console appenders, others will use rolling file appenders, others still using syslog. The app doesn't and shouldn't care - we've decoupled the log appender from the logging API. We can configure Riemann in a similar way: alerting to different targets depending on which environment you're running in. Development and UAT environments might alert over XMPP. Production might alert to multiple targets like email, PagerDuty, HipChat, or others. What matters is that the conditions that triggered the alert are the SAME in all environments. This allows our alerting to be just as well tested leading up to production as every other aspect of our code.
Hopefully the benefits of this approach are obvious to you already, but here are a few that come to mind for those that may be less familiar with this problem:
- Alerts are written by the people that know the system best
- Alerts are version-controlled alongside the app, and evolve with it
- Alerts are tested to ensure they do what we expect of them
- Events are handled in real-time
- We have more opportunities to feed event analysis back into our apps: e.g. trigger circuit-breakers, auto-scaling / node re-provisioning, etc.
Just to be clear, I'm not suggesting there are no drawbacks. I'm also not suggesting that Riemann in particular is the best solution to this problem. What I am trying to do is highlight how this approach differs markedly from what I've come to see as the norm for software projects; how empowering development teams to become delivery teams can produce better and more resilient software.
I expect to see more tools emerging in this direction in the near future. It's a very exciting time to be involved in building and delivering software, and I hope to see more software projects breaking down the walls that separate the various disciplines involved in delivering software.
Side Note: This does not obviate the benefits of metric visualisations via tools such as OpenTSDB, Graphite, etc. It is a complementary practice for operating systems in production.
If you want to learn more about Riemann, here are a few resources to get you started:
- Riemann Learnings
- Getting Started with Riemann
- Riemann usage at BlueMountain
- Riemann + InfluxDB + Grafana
- Monitoring Cassandra with Riemann