Thursday, March 2, 2017

Amazon: A Simple Error Brought Down AWS

I would hate to be the guy that has to live out the rest of his days at Amazon known simply as the "guy that brought down the internet."

The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.