Euro Security Watch with Mathew J. Schwartz

Business Continuity/Disaster Recovery , Governance

'Kill Your Darlings' for Better Disaster Recovery Lesson From Amazon S3 Outage: Identify Weaknesses
'Kill Your Darlings' for Better Disaster Recovery

For any of the tens of thousands of organization that may be smarting from this week's Amazon Web Services and Simple Storage Solution (S3) outage, take the following advice to heart: "You must kill your darlings."

See Also: Balancing Fraud Detection & the Consumer Banking Experience

Recommendations don't get more Gothic than that. The famous quote from William Faulkner concerns his writing advice, warning against the danger of getting too comfortable with what you know.

"We need to identify weaknesses before they manifest in system-wide, aberrant behaviors." 

The same advice, however, also applies to technology, and handily highlights the disaster recovery secrets practiced by the world's most cloud-savvy organizations.

Netflix, for example, uses an internal tool called Chaos Monkey. "This service pseudo-randomly plucks a server from our production deployment on AWS and kills it. At the time we were met with incredulity and skepticism. Are we crazy? In production?!?" the company's chaos team recounts in a 2015 blog post.

The group has even coined a related term: chaos engineering. "We need to identify weaknesses before they manifest in system-wide, aberrant behaviors," the group says.

The principle is simple: Summon demons and learn how to beat them before they sneak up and eat you for lunch.

Netflix says it's continued to refine its approach. "Building on the success of Chaos Monkey, we looked at an extreme case of infrastructure failure. We built Chaos Kong, which doesn't just kill a server. It kills an entire AWS Region."

While such outages are unusual, they do happen, and Netflix says its preparatory work has helped it sidestep many availability blips that it would have otherwise suffered.

No Amazon S3 For You

Such advice is relevant as more organizations and services rely on cloud-based infrastructure for everything from serving websites, to cloud-enabling IoT devices, to storing backups.

Of course, cloud-connected services can have bad days. Early on Feb. 28, for example, Amazon reported that it was seeing "high error rates with S3" in its eastern United States, tied to a data center in northern Virginia. "We are working hard at repairing S3," it promised.

Source: SimilarTech

Numerous organizations were affected, including Netflix. Indeed, users of the service from around the world experienced disruptions, as did a range of other sites and services, including Medium, GitHub, Yahoo Mail and more.

The outage also had implications for users of various internet-connected devices, including complaints from people that they couldn't turn their internet-connected lights on or get their internet-connected oven turned off.

Later on Feb. 28, however, Amazon reported that the problem had been fixed. "As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally."

The AWS dashboard as seen on Feb. 28.

Cloud Upside: Uptime

The outage aside, on the whole, cloud-based services from the likes of Akamai, Amazon, Cloudflare and Google still provide better uptime and availability than what the vast majority of enterprises could concoct by themselves, as Microsoft's Carmen Crincoli has noted. The services are also billed based on usage, which can make them especially affordable for smaller organizations.

It behooves any organization that relies on such services to test what might happen if a major part of its cloud-based infrastructure becomes unavailable, and then to put better disaster, recovery and failover plans in place. In other words, before disaster strikes, please unleash your chaos monkey.

Unfortunately, it's not clear that many organizations - beyond the likes of Netflix and its peers - have adopted these principles. "Sadly, I think organizations adopt these approaches rather like they adopt technology: there are leaders, [the] mainstream and, of course, the laggards," Alan Woodward, a computer science professor at Surrey University, tells me. "You just have to hope the service you may be dependent upon is a leader, not a laggard."

This piece has been updated with comment from Alan Woodward.



About the Author

Mathew J. Schwartz

Mathew J. Schwartz

Executive Editor, DataBreachToday & Europe

Schwartz is an award-winning journalist with two decades of experience in magazines, newspapers and electronic media. He has covered the information security and privacy sector throughout his career. Before joining Information Security Media Group in 2014, where he now serves as the Executive Editor, DataBreachToday and for European news coverage, Schwartz was the information security beat reporter for InformationWeek and a frequent contributor to DarkReading, amongst other publications. He lives in Scotland.




Around the Network