You no doubt heard about the Amazon S3 outage that happened earlier this week. It was reported far and wide by media outlets who normally don’t delve into details of the technology supporting our connected world. It is an interesting thing to think that most people have heard about The Cloud but never AWS and certainly not S3.
We didn’t report on the outage, but we ate up the details of the aftermath. It’s an excellent look under the hood. We say kudos to Amazon for adding to the growing trend of companies sharing the gory details surrounding events like this so that we can all understand what caused this and how they plan to avoid it in the future.
Turns out the S3 team was working on a problem with some part of the billing system and to do so, needed to take a few servers down. An incorrect command used when taking those machines down ended up affecting a larger block than expected. So they went out like a light switch — but turning that switch back on wasn’t nearly as easy.
The servers that went down run various commands in the S3 API. With the explosive growth of the Simple Storage Service, this “reboot” hadn’t been tried in several years and took far longer than expected. Compounding this was a backlog of tasks that built up while they were bringing the API servers back online. Working through that backlog took time as well. The process was like waiting for a bathtub to fill up with water. It must have been an agonizing process for those involved, but certainly not as bad as the folks who had to restore GitLab service a few weeks back.
[via /r/programming]
I am frankly somewhat surprised they were able to get it back up as “quickly” as they did given all of the mess that had to be diagnosed and then resolved in bulk and everything had to (in theory) be kept in sync. There are a LOT of moving parts going on there.
Oops, Amazon broke a giant chunk of the internet.
I would hate to be the guy who blew away the index system for most of the AWS.
I mean its a toss up about whether he lost his job over it, but. certainly would keep me awake at night thinking about the blunder I had made.
I don’t think it was an issue with “blowing away” the index system. The issue was that it hadn’t been restarted in years and bringing it back up took a lot longer than expected. Sure, the resulted in an outage of a few hours, but the initial mistake wasn’t the cause of that long delay. In an ideally functioning system this mistake should have corrected quickly and now they’re working to make sure that is true if the indexing system does go down again.
At this kind of scale if it was an error by one person that caused it then it is their methods that are at fault and not the person that miss typed the command. That is it should require confirmation before execution.
I use AWS a lot and it is an excellent service. I’m no network guru yet I find it easy to use.
I understand what you are saying but the idea of creating entirety new checks and methods to prevent this specific situation might be helpful but is still unfortunately somewhat missing the bigger issue and is also somewhat unrealistic. There are hundreds of thousands of other potential problems that could also be at fault is the bigger issue here.
Creating systems to check and prevent every possible issue is not just almost impossible but will cause all sorts of other costs (not just monetary) and complexity as well. There is no clean way to actually prevent similar but different issues. That said, there are probably a number of broad architectural and process improvements that would certainly help improve other things similar to this from happening again and I am sure they are working on doing those right now.
*this* – From my experience, many times if a user does something that causes an issue in an application, it turns into ‘the systems fault’ because it let them do it, rather than an issue with user training on the process/application, or corrective action with the user. You can only (reasonably for time/cost) put so many checks in to keep a user from doing something bad/dumb, and the more you ‘are you sure’ them to death, the more they just get in a habit of clicking ‘yes/continue’ without even bothering to read the confirmation and it becomes a pointless hindrance. That is more at an application level, and the S3 issue (without having read the details behind this yet) likely applies more at an administrator level, where there are not usually these babysitter ‘are you sure’ type checks. If I’m on a system as a user and do an ‘rm’, sure, I’ll get a file delete confirmation, but if I’m in as root, you could easily blow away the system with a recursive delete, unchecked. This isn’t poor system design (In my opinion, and most AIX/’nix/enterprise systems seem to agree), it is trusting someone operating at this level is intending to do what they are doing, to not prevent or hinder them from doing their job.
This isn’t uncommon. A buddy of mine got a new job at a tech company running entirely on IPv6. Excited, he began exploring the network until an incorrectly typed IPv4 command took down the entire company network.
My company network has a fluke that running a process with a very specific set of parameters triggers an annoying as hell bug that causes one of the gateway servers to hang, sometimes for hours. The hilarity is that the parameters are a moving target. After hours of testing, I have a Perl script that predicts what parameters will hang the server at any given time but I still can’t find the damn bug. Yes… I said “any given time,” the parameters literally change depending on the date and, to a lesser extent, the time. What hangs the server today won’t hang it tomorrow. Even worse, the bug doesn’t manifest itself on the test bed…..
Sounds like some self-modifying worm or virus, written to annoy or cause productivity losses through apparent random problems. Or it’s simply running a pseudorandom number generator, or using something in the server for that, and the system clock to set off its payload.
Had any recently dismissed and disgruntled employees? https://en.wikipedia.org/wiki/Omega_Engineering
The test bed works, so why not “nuke and pave” the gateway server with a copy of the perfectly functioning test system? If it still has issues then you know the problem is coming from outside that server.
I often find it’s much quicker to backup files that need saved, then do a clean install than to clean a computer of malware and fix the damage. I’d much rather spend an hour reinstalling and knowing there’s nothing bad hanging around than spend several hours attempting to get the computer cleaned of an infestation.
Then if you know what caused the problem you can setup defenses to protect the clean system, knowing that it’s fine and will stay fine, especially if the user pays attention to the advice on things to not do.