AWS claims human error to blame for US cloud storage outage – ComputerWeekly.com

Amazon Web Services (AWS) says human error caused the cloud storage system outage, which lasted several hours and affected thousands of customers earlier this week.

What to move, where and when. Use this checklist and tips for a smooth transition.

By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.

You also agree that your personal information may be transferred and processed in the United States, and that you have read and agree to the Terms of Use and the Privacy Policy.

Amazons Simple Storage Service (S3), which provides backend support for websites, applications and other cloud services, ran into technical difficulties on the morning of Tuesday 28 February in the US, returning error messages to those trying to use it.

The cloud service giant revealed the cause in a post-mortem-style blog post, and explained the issue can be traced back to some exploratory work its engineers were doing to establish why the S3 billing system was performing so slowly.

During this process, a number of servers providing underlying support for two S3 subsystems were accidently removed, requiring a full restart, which caused the problems.

An authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process, said the blog post.

Unfortunately, one of the inputs to the command was entered incorrectly and a largerset of servers was removed than intended.

This affected instances of S3 run out of the firms US East-1 datacentre region in Virginia, US, causing havoc for a number of high-profile websites and service providers, including the cloud-based collaboration platform, Box, and instant and group messaging site, Slack.

The outage also had a knock-on impact on a number of AWS services, hosted from US East-1, that rely on S3 for backend support, including Amazon Elastic Computer Cloud (EC2), AWS Elastic Block Store, and AWS Lambda.

It also caused the AWS service status page to stop working, causing problems for users keen to find out when the firms systems would be back up and running again.

The downtime has promoted numerous industry commentators to speak up about the risks involved with running a business off the infrastructure of a single cloud provider, while others have seized on it to reinforce the importance of having a robust business continuity strategy in place.

AWS, however, goes on to say its platforms are built to be highly resilient, but the full-scale restart of S3 took much longer than anticipated.

We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes, said the post.

While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected, it added.

The incident has prompted AWS to re-evaluate the setup of its S3 infrastructure, the blog post continues, to prevent similar incidents from occurring in future.

Wewant to apologise for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further, it concluded.

More here:
AWS claims human error to blame for US cloud storage outage - ComputerWeekly.com

Related Posts

Comments are closed.