3 things we learned from this week’s Amazon S3 outage

KEY TAKEAWAYS
Enterprises are generally best served by a multi-faceted cloud strategy that includes both public cloud and enterprise cloud
It sucks when another company’s employee takes down your infrastructure

Don’t put all your eggs in one cloud basket.

Earlier this week, Amazon’s S3 service experienced a significant outage. In total it lasted almost four hours and caused service disruption and revenue loss for many businesses.

It certainly affected Tintri. Even in my team, a number of third-party SaaS-based technologies that we use went down—our invoicing system, HR system and A/B testing service to pick just a few.

So, what caused the outage? You can read Amazon’s explanation yourself. Here’s an excerpt:

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems … While these subsystems were being restarted, S3 was unable to service requests.

Translation: an Amazon employee was executing a number of scripts and a typo in one of the commands caused a large number of virtual machines to be deleted. These kinds of mistakes are learning opportunities—here’s what we learned:

1. It’s risky to put all your eggs in one basket

A number of CIOs have publicly proclaimed that their company is “all in” on public cloud, but that’s rarely a practical solution for two reasons:

First is cost control. Some workloads are resource-intensive—they use “free” network bandwidth in your data center, but when moved to public cloud will run up your credit card bill. Our customers are proof; at our recent Customer Advisory Board, most had a billing horror story, including one customer’s surprise $50,000 monthly charge for network bandwidth.

Second is performance SLAs. Mission critical workloads that perform consistently in your data center might struggle in public cloud if they’re not designed for that environment. For example, in public cloud your applications have to provide their own availability and resiliency… which leads us to a second learning …

2. Delivering availability is hard

The oft-cited bar for enterprise availability is five-nines (99.999% availability). With their 3 hour and 50 minute outage, it’ll take Amazon 30 years of flawless performance to achieve five-nines again.

Many organizations that are migrating workloads to public cloud may not be aware that most service tiers are three-nines or four-nines. It’s important to understand the potentially significant impact of lost revenue that can stem from a few decimal point of downtime—you need to plan for events like this week’s outage when considering your own SLAs to your customers.

Enterprise cloud offers some advantages in the effort to guarantee performance. Tintri’s platform offers autonomous operation—every virtual machine is assigned its own lane to eliminate conflict over resources. And tiers of performance can be established with VM-level Quality of Service (QoS) controls.

3. Recovery in seconds is better than hours

Amazon’s extended downtime also highlights a critical point—that when $h!t happens you need to be able to recover as rapidly as possible.

And with enterprise cloud that can only happen when you’re working at the right level of abstraction (VMs vs. LUNs or volumes). Tintri SyncVM (copy data management) is proof; it allows you to move back and forth between recovery points for individual VMs without ever losing performance history. Read more in this document.

The need to stay a step ahead is also why we recently added synchronous replication to allow for zero RTO and near-zero RPO. And it’s why more of our customers are using Tintri for BOTH primary and secondary storage, with automated snapshots as a first line of defense.

Look, we’re not saying public cloud is bad, nor are we disparaging Amazon or S3—the point is that some workloads work fine in public cloud as a function of their design, use and requirements. But organizations today near universally have some workloads that are better suited to their data center—and that’s why organizations need a multi-faceted approach that includes BOTH public cloud and enterprise cloud.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

VMstore T7000 Series

Tintri Cloud Platform

Tintri Cloud Engine

3 things we learned from this week’s Amazon S3 outage

1. It’s risky to put all your eggs in one basket

2. Delivering availability is hard

3. Recovery in seconds is better than hours

VMstore T7000 Series

Tintri Cloud Platform

Tintri Cloud Engine

3 things we learned from this week’s Amazon S3 outage

1. It’s risky to put all your eggs in one basket

2. Delivering availability is hard

3. Recovery in seconds is better than hours

Related Posts