Preparing for ProductionOperations
Agile software development provides many opportunities to get ready for production. This article discusses some key issues that are often neglected and hopes to bring them to the foreground where they belong.
Procedural memory is memory for the performance of particular types of action. Procedural memory guides the processes we perform and most frequently resides below the level of conscious awareness. [...] Procedural memory is created through "procedural learning" or, repeating a complex activity over and over again until all of the relevant neural systems work together to automatically produce the activity.Wikipedia
If there's one thing that Continuous Integration did for software engineering, it's that it brought repetition front and centre. Not automation. Repetition. Yes, of course automation is a good thing, but it arises out of the need to repeat the same tasks ad infinitum. There's seldom any real business value in automating something unless you intend on repeating that task (or perhaps if it is sufficiently complex that automation would help).
In any given day, we go through the same cycle tens, sometimes hundreds or even thousands of times. Commit, build, test, deploy, test, deploy further down the pipeline, test more, analyse failures, fix, commit... and so the cycle goes. The key areas I want to focus on in this cycle are deployment and failure analysis [or root cause analysis (RCA)].
Building software takes time. In that time we have ample opportunity to rehearse for the real deal when everything is live — assuming, of course, we follow an iterative process that facilitates it. By the time the software is in production, everyone involved in the operation of the software should be so well versed in how it operates that everything is second nature. Imagine an F1 driver in a race that needs a tyre change. That's something the team would've rehearsed repeatedly so everyone know where they belong, how to do their job, where all the equipment is, and what's expected of them. This is the kind of environment we all dream of in production... when things go wrong (and they will ), everyone is ready for action.
Let's make this practical and refer to some real projects I've been involved in. I've worked on teams where production operation was a well-oiled machine; and then I've worked on teams where production was an afterthought and effectively thrown over the proverbial wall to the Operations Team. The ones that worked well included the ops guys early on (embedding them within the team)... long before production was in sight. Here are the points that stood out to me from the more successful cases:
Deploy to Production Early
Assuming that a production deployment will go smoothly when you've not practised it many times in advance is foolish. Deploying to other environments — no matter how technically similar — is never the same. It's something you want to rehearse until it's second nature.
Often driven by a fear of putting 'incomplete' software into production, many teams wait until everything is 'perfect'. First of all, it'll never be perfect. And more importantly, deploying to production doesn't mean exposing the system to the world. The incomplete system, warts and all, can be deployed into the real production environment but locked down to be accessible only by those you want to see it. A common pattern I've seen if for these 'dark prod' environments to be restricted to internal IP addresses only — so you effectively gain your internal organisation as beta testers. This example is driven from consumer-facing web applications, but similar examples can be conjured up for other scenarios.
Going all the way to production as early as possible helps to build a deeper understanding of the system architecture within the development and operations teams. Ultimately, this understanding will be used to perform RCA, fix issues, and ideally preempt and mitigate against failure.
Deployment should be something that happens without any drama. If you're not practicing deployment regularly in the course of development, you're likely to have problems when you deploy into production. This goes beyond the argument for automation, as even a well rehearsed manual process is better than an unrehearsed one.
RCA is a Practiced Skill
It often surprises me to see how few developers have the required skills and approach to debugging and root cause analysis. I've worked with some outstanding people in many companies, but have unfortunately found that this crucial skill is deficient in many teams. We cannot rely on a small number of team members to do all the failure or defect diagnosis. Aside from the fact that it is a joint responsibility that the team owns, it's also a major risk that must be mitigated.
We should be able to call on any member of the team to do this, and be confident that they are up to the job. To do that, we must provide the opportunity to learn and practice the skill. Some people are naturally good at it, but others need to be guided. There are many ways this can be done, and a number of decent books to get started (e.g. Debug It! ). Make sure these resources are available to the team if required.
This kind of work requires a logical mindset and a methodical approach to diagnosis, coupled with a strong understanding of the system at all levels; so it should be an achievable skill for any member of the development team. If not, you may find that there are underlying problems that need to be solved.
A 'Production-Like Environment' Includes Tooling
One significant issue for many software development teams is the provision of a production-like environment. There are many well-established arguments justifying this requirement, but they tend to focus on the hardware (whether physical or virtual) rather than the entire production environment. Without a doubt, the hardware should be as similar as possible / feasible, but there are many supporting tools that should be in place too.
Some of the major supporting tools are around logging, monitoring and alerting. Tools like Splunk and Ganglia are invaluable in production, but you need to know how to use them effectively. Unless you're using it in development and test environments, there's a good chance many team members won't know how to use it in production. They will not have practiced the skill until it's second nature, which means they won't use it effectively when it really matters and time is essential.
The same goes for any automated deployment tools like Chef and Pallet . They are far more valuable when any member of the team can operate them. Understanding how your software is deployed is crucial to providing a smooth delivery process all the way through to production.
Finally, to be able to prevent failures in production, you need to recognise the behavioural patterns your application exhibits prior to these failures. You need some reliable means of predicting how, why and when your application will fail. Once you have that, you need a means of avoiding that failure scenario. John Allspaw talks about this in The Art of Capacity Planning where he gives examples of predicting capacity requirements and application failure at Flickr.
Far too many applications rely on little more than the built-in JVM stats for memory usage, garbage collection, threadpools, etc. These are extremely valuable metrics, but provide too fine-grained measurements in many cases. Having more coarse-grained metrics can be highly advantageous when predicting failure. For example, if you know that your application tends to fail when some external service takes more than 2 seconds to respond, then that's a good indicator to monitor. The interrelationships between the various aspects of a system can be quite complex, and there's seldom a single metric that can be examined in isolation.
With sufficient monitoring in place and a profile to predict failure, our next step is to have a failure avoidance mechanism. This could involve switching off parts of the system that are considered non-essential, or perhaps having a back-off strategy that lets the system self-heal by easing off the pressure until the component under load has recovered (this is particularly effective when dealing with external services). Clearly there needs to be a good understanding of the business environment to make these judgements, as they're not necessarily fitting — although most systems can facilitate them (or alternatives) in some way or another. It's rare that an entire system can either be operational or non-operational with no middle ground for turning off non-critical functionality.