We had an outages yesterday that was totally unexpected. It was so big that the major news reported them i.e. Akamai outages. It was surprising because we would have thought this type of outages were a relic of the past. How should outages being handled in a world where users cannot tolerate 5 min of downtime?
Formation of Outage Team
Do you have an outage team? It is likely majority of organisations do not have an outage team. Neither are outage drills conducted in large scale for the organisation. You can relate this like a fire drill where you will have a fire drill team for your entire buildings. This team is usually cross functional and well drilled in various skillsets from infrastructure, security and applications. Although this team could be costly, it is worthwhile to note that outages may cost a 100 times more.
Outage management is different from disaster recovery. Disaster recovery usually deals mostly with internal applications or related downtime. These SOP (Standard Operating Procedure) could be formal processes with your external parties. When outage happens in the old ways, it usually starts with a Sev1 (Severity 1) and manage with standard incident management process. It will take an hour or more before this is deem an outage. The slower pace and passive nature of incident management may not suit today’s outage.
Outages are no longer isolated in nature due to Cloud and networked environment. Case study like Akamai outage will lead organisations to consider a need for outage team who are highly trained and focused to manage outages swiftly.