You no doubt heard about the Amazon S3 outage that happened earlier this week. It was reported far and wide by media outlets who normally don’t delve into details of the technology supporting our connected world. It is an interesting thing to think that most people have heard about The Cloud but never AWS and certainly not S3.
We didn’t report on the outage, but we ate up the details of the aftermath. It’s an excellent look under the hood. We say kudos to Amazon for adding to the growing trend of companies sharing the gory details surrounding events like this so that we can all understand what caused this and how they plan to avoid it in the future.
Turns out the S3 team was working on a problem with some part of the billing system and to do so, needed to take a few servers down. An incorrect command used when taking those machines down ended up affecting a larger block than expected. So they went out like a light switch — but turning that switch back on wasn’t nearly as easy.
The servers that went down run various commands in the S3 API. With the explosive growth of the Simple Storage Service, this “reboot” hadn’t been tried in several years and took far longer than expected. Compounding this was a backlog of tasks that built up while they were bringing the API servers back online. Working through that backlog took time as well. The process was like waiting for a bathtub to fill up with water. It must have been an agonizing process for those involved, but certainly not as bad as the folks who had to restore GitLab service a few weeks back.
[via /r/programming]