AWS. IBM Cloud. Microsoft Azure. What do they have in common? Yes, they’re all public clouds, but there’s something else that binds them: They’ve all experienced a major outage or service interruption of some consequence to customers.
This is not to say they’re worthy of being called out and shamed on which fail was worse than the other. The fact remains that technology in all its various forms and offerings is inherently vulnerable given the massive complexity associated with virtualized IT environments.
While it’s possible to look back and calculate the effects – either in lost revenue, lost productivity, customer churn, or other measures – one thing is certain: No business can afford an outage in today’s ultra-competitive business environment. The impact of the outages brings to light the stark reality that as “cloud” continues to evolve, there are serious challenges to its true resiliency to withstand unforced errors.
What does this mean to businesses and IT teams? It means you must have resiliency-in-layers for your cloud infrastructure. In other words, think how you can enable an organization to quickly avoid disruptions, minimizing impact to end-users and customers to the point that they are unaware a disruption even occurred. Business impact avoidance needs to be the main goal of a layered approach.
The effects of down time
The effect of the AWS S3, Microsoft Azure, and other outages will have a shelf life and will be pushed out of view for a time. But in a world where we increasingly rely on IT capabilities, with much of it supported in the cloud, this shift is fraught with vulnerability. Some businesses, like those without resiliency-in-layers, are more vulnerable than others. At the time of the AWS outage, some organizations, including the Review, found their entire website, and consequently their business, was offline.
Others such as Yahoo, Adobe, Netflix, Instagram and even Apple’s Cloud experienced problems that had a direct effect on their business. The outage even affected some smart homes and IoT services, as things such as parking meters and thermostats were reported to have issues that were directly linked to S3. Born-in-the-cloud, eggs-all-in-one-basket business applications will be the hardest to have these multiple layers as they are designed in place and are not designed for portability. To solve this challenge, and have the right safety nets in place, many organizations are now looking at how to build a more hybrid cloud, leveraging a managed service provider or their own data centers.
The cost of downtime is something that many businesses cannot afford. One Gartner study estimated that on average, every minute of downtime will cost a business $5,600, which adds up to over $300k per hour. In the case of SMBs especially, the financial hit can be too great and the business can fail. And with large companies, the reputation hit can be just as detrimental. But with a hybrid plan with resiliency-in-layers, it is possible to avoid both.
Disaster recovery plans are good, but must be tested
Having a “DR plan” is not enough. Again, resiliency-in-layers is key here when it comes to business continuity. This means examining your vendors, locations and technologies to understand how to make this all heterogeneous. Why? Having diversity adds to the resiliency-in-layers effect by separating one action, activity, bug or catastrophic event from impacting the rest of the business environment. And you must test that plan regularly, to ensure when the poop hits the fan you avoid downtime by having the automation muscles built into your plan. This means it is critical that organizations implement DR strategies that are easy to test often at all layers of the infrastructure stack.
A dirty, but widely known secret within the IT industry is how often DR tests fail because of un-scalable, complicated, manual processes and incompatible technologies. It is because of this shortcoming that many organizations either don’t conduct regular tests, lie to themselves because a domain controller came up and they could ping it, or worse, not at all while giving the illusion of being covered by virtue of some piece of shelfware that is not deployed.
A successful DR infrastructure needs to be highly automated and continuously replicate data, allowing for applications to be quickly “rewound” to the seconds just before an outage. It must be able to meet recover point objective (RPO) defined by the business, with little to no loss of data or loss of application availability. Even a few seconds can cost businesses tens of thousands of dollars either in the way of lost revenue from an application being down and unable to transact or incurring fines from a compliance failure.
IT professionals and companies need to build and adopt tools and platforms with redundant, scalable, simple recovery and DR testing processes. The quicker a company can recover data, the less of an effect it will have on the business with significant cost and time savings realized.
IT Resilience – The Case for Hybrid Cloud
For many different reasons, IT leaders have their own preferences and organizational requirements. Some IT leaders have compliance challenges and others may have data locality issues. For this reason DR plans can sometimes seem as unique as a fingerprint in how they are built, maintained and where they recover too. While IT is clearly moving towards cloud-based infrastructures, the centerpiece of this trend revolves around the ability to thrive through every permutation of a disaster being more than just natural causes, but even common power failures and human error.
Though each element within hybrid cloud has its own associated strengths and weaknesses, it begs the larger question; what is the best way to manage against technology service disruptions? Here are the three pillars that help enterprise-class organizations achieve IT resilience:
- You must have resiliency-in-layers, meaning a secondary (or more), geographically and meteorologically diverse, off-premise recovery data center. This ensures that, should anything happen to your primary site, you will always have the redundant location to reduce the risk of an extended outage altogether. Potential high capital costs of building or renting data center space needs to be weighed. Still, some larger enterprises with high compliance mandates, such as those in financial services or healthcare, must have such a facility to guard against outage for regulatory reasons alone.
- Using a managed service provider (MSP) or cloud service provider (CSP). This switches the financial model to OpEx and allows you to leverage a ready-made infrastructure and service provider hired experts contractually obligated to deliver on the defined service level agreement (SLA). What you give up, in some cases, is the day-to-day administration and control at a fine grain level. But many companies end up asking themselves, “Is my company in the data center and IT business or the business of healthcare” (for example).
- Dip your toe into a public cloud infrastructure. More and more organizations are rolling their own or leveraging MSP/CSP partners to “test drive” public cloud as a second or third site. Businesses must understand and match their data and application priority with the associated target and SLA requirements. In a “roll your own” situation – you are still on the hook for the SLA. This is why many companies are leveraging an MSP/CSP. A major advantage is the ability to enable tremendous scalability.
While every public cloud outage demonstrates that it’s not immune to catastrophes, looking at public cloud as a part of your resiliency-in-layers, hybrid-based plan can be a cost effective way to get a third or more site and add some geo and meteorological diversity to your plan. Augment and leverage the expertise of a managed service provider to help achieve your SLAs. And be the right answer when the CEO asks, “What are we doing in the cloud?”
Rob Strechay, Senior Vice President of Product, Zerto
Strechay serves as SVP of Product at Zerto, a leading provider of cloud IT resilience and application mobility solutions. With 20 years of experience in the areas of disaster recovery, storage, application performance management and cloud and virtual infrastructure management, Strechay is a hands-on evangelist that specializes in taking products from concept to revenue.