Keep It Running or Fix It Quick?
I was recently involved in a discussion about IT services and how to deliver acceptable levels of availability. This discussion was triggered by a failure of the London air traffic control (ATC) system on 12 December 2014, but the ideas apply to any system, not just safety critical services like air traffic control.
Although the ATC failure did not last long, the impact was enormous, as many flights were diverted, resulting in lots of aircraft being in the wrong place. Airline schedules took a full day to get back to normal, many passengers were stranded, and there was a lot of disruption to travel plans.
There are two ways to improve the availability of an IT service. One is to reduce the frequency of failure. The other is to reduce the time needed to recover from it. The ATC system is a safety critical service. Failure is unacceptable, since it will result in deaths and injuries, and this is why planes had to be grounded. Some of my colleagues argued that since failure of the ATC system is unacceptable, it should have been designed to prevent any possible failure; fast recovery would not have helped as planes would still have been grounded. I, however, argued that in the real world we can never prevent every possible failure, so reduced recovery time will always be essential.
I found support for my view in an article published by The Register, which said that ATC can continue to operate for up to 8 minutes when they lose access to flight plans (which is what happened on 12 December), but that after 8 minutes they must start to divert planes. So a failure that recovers within 8 minutes has negligible impact, and one that lasts even a few minutes longer has a major impact.
It’s Not Just About Air Traffic Control Systems
I have come across similar issues in many other IT services. In one case we designed a service that could fully fail over to a backup location within 300 milliseconds of any hardware or software failure (yes that really is less than 1/3 of a second). Clearly this kind of solution is not going to be needed for the sort of IT services that most of us work with, but it certainly was a viable solution for this particular customer, albeit one that was difficult to design and expensive to provide.
About 20 years ago I was involved in a project to provide laptops to mobile engineers. This service enabled the engineers to collect their calls, and update them, remotely, providing a significant competitive advantage over the previous telephone based system. Management suggested that we needed to make sure the laptops were locked down, to prevent the engineers from making changes that could impact the key business application, but I know something about the way engineers behave, and I didn’t think this would be possible. The solution we designed involved giving every engineer a CD that took about 20 minutes to completely recover the laptop back to the initial working configuration – and, crucially, without erasing any of the data that they had already stored on the laptop. This meant that nothing they did to the laptop could result in extended downtime, unless they actually managed to physically break it.
I often see service level agreements that specify availability in the form of percentage uptime, with figures like 99.95% availability during business hours. The problem with this is that it is almost impossible to design a solution to meet this target. We can predict the likely frequency of predictable hardware failures, but most real IT failures aren’t due to predictable hardware failures, they are caused by complex interactions of people, processes, software and networks. In these circumstances the best we can do is have a good plan to restore service to our users when it does go wrong, and this means getting the designers to focus on recovery time.
How many of your IT services have been designed with recovery time as a key design constraint? How confident are you that you could recover each of your IT services within a time that is acceptable to your customers? How well tested are your recovery plans? If you can’t confidently provide a positive answer to all of these questions, then maybe it’s time to review how you plan to meet your customers’ availability needs.
Please share your thoughts in the comments or on Twitter, Google+, or Facebook where we are always listening.
Did you find this interesting?Share it with others:
Did you find this interesting? Share it with others: