Amazon Web Services is a leader in providing cloud infrastructure as a service, so when its system went down in Northern Virginia during a major electrical storm on June 29 and took high profile clients like Netflix, Instagram and Pinterest with it as it did, people sat up and took notice.
Amazon issued a long and very detailed explanation of what happened during and after the several-hour outage a few days later—a previously unencountered load balancing issue—but the event helped highlight what infrastructure as a service is, and what it isn’t.
AWS and other cloud providers may eventually be headed towards being a completely fault-tolerant utility like water, electricity and Internet access that large enterprises require, but that’s a long way off, according to Forrester analyst Lauren Nelson.
It’s not plug-and-play like other utilities, Nelson said: applications must be custom built to run on a cloud environment, and a cloud outage would leave most large enterprises with legacy applications in the lurch with significant scalability issues.
“A lot of enterprises talk about cloud not having great enough performance, or having challenges with up time … don’t really understand cloud services,” she said. “It’s a give and take. Those that are using cloud environments really need to think about cloud services not as what they already have in house. It’s a very different type of approach.”
Nelson said that the companies that run best on a cloud environment have built applications from the ground up to handle sudden outages and can switch instantly to a new cloud environment to avoid loss of service. Nelson points to Message Bus, a CRM company built entirely for the cloud by former Twitter employees, as a good example. “For small and medium businesses, and they’re designing from the ground up, they have the ability to not have to worry about an existing data center, having existing resources and having to completely redesign applications,” she said. “For most cases for the enterprises, a lot of these have people on staff have spent the last 10 years of their lives specifically customizing one of these enterprise applications to best fit that organization.”
Nelson “It’s going to take a great deal of time to be able to move that core set of resources into a cloud environment, and I don’t see that happening anytime soon,” Nelson said. “There are going to be some drawbacks, and I think we’re very far from considering Amazon to be a utility. It’s hard to get past that status quo.”
For its part, Amazon said in its statement about the outage that is trying to be exactly that kind of resilient resource and pledged to do better. “We know how critical our services are to our customers’ businesses,” the company noted. “If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes.”
Amazon did not respond to interview requests.
Bizo’s Resilient AWS System
Bizo is an organization that built its applications from the ground up to run on Amazon Web Services. Bizo, a business-to-business marketing firm that launched its product on the cloud service in 2008, was a runner-up in the 2009 Amazon Web Services Challenge, where start-ups competed for AWS credits by showing off the best use case of the service.
Donald Flood, Bizo’s vice president of engineering, said Amazon’s quick and detailed explanation about the outage brought further confidence that his choice to use them entirely for his infrastructure was correct. But he went into the relationship with eyes wide open, he said.
“With Amazon, they created such a big infrastructure and I think it’s such a viable place to run your business that they’re going to have some growing pains,” Flood said. “Let’s assume there are going to be growing pains, how do you create infrastructure around your business that mitigates that risk?”
Flood said Bizo hasn’t dropped service with any of Amazon’s outages because of the tools he and his staff have built to instantly switch to a new environment. Flood wrote his own blog post about how other companies using AWS can avoid down time. Among the tips: “Take advantage of AWS support for multiple regions and availability zones,” and “learn from your service outages and expect more failures.”
“We do hundreds of millions of requests a day in Amazon’s infrastructure, and we have zero operations people that I employ,” Flood said. “We don’t have any operations people that manage those machines. We actually have tools we built on top of their infrastructure for the software engineers to run their applications in a scalable way. [Amazon] gives you the tools and the infrastructure to run you services in a redundant way and in a way to mitigate that risk, and it’s really up to the user to build their infrastructure in such a way that it takes advantage of that possibility.”
The View from Netflix
Netflix was one of the major companies taken down for a few hours on that Friday night by the AWS outage. But the outage was not because it was running in only a single AWS availability zone on the East Coast. According to its blog post, Netflix ran into its own previously undiscovered load-balancing issue which created a networking bottleneck in a part of its system designed prevent outages.
Netflix did not respond to interview requests, but the company noted on its blog that it is dedicated to the cloud and its cloud operations and reliability engineering team is working to improve its systems.
“The state of the cloud will continue to mature and improve over time,” the company said. “We’re working closely with Amazon on ways that they can improve their systems, focusing our efforts on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones. We take our availability very seriously and strive to provide an uninterrupted service to all our members. We’re still bullish on the cloud and continue to work hard to insulate our members from service disruptions in our infrastructure.”
Forrester’s Nelson said Netflix, to the company’s credit, has created its own “Chaos Monkey,” a program that shuts down the virtual machines (VMs) that run the cloud environment at random to test the company’s ability to quickly reroute critical applications to other cloud environments.
“This is the type of process that enterprises need to do in order to really effectively run applications in a cloud environment,” Nelson said. “The Chaos Monkey just spontaneously shuts down VMs, all the time just to make sure that application is resilient so in the case where this is actually happening in real time, they’re not going to suffer unexpected downtime.
“VM shut down is very common in a cloud environment,” Nelson said. “Another one starts up but with [most] specific [enterprise] applications that aren’t designed for scalability. It actually makes the whole application fail and you suffer down time.”
Email Staff Writer Ian B. Murphy at firstname.lastname@example.org. Follow him on Twitter .