Fail of the Week: GitLab Goes Down

Has work been a little stressful this week, are things getting you down? Spare a thought for an unnamed sysadmin at the GitHub-alike startup GitLab, who early yesterday performed a deletion task on a PostgreSQL database in response to some problems they were having in the wake of an attack by spammers. Unfortunately due to a command line error he ran the deletion on one of the databases behind the company’s main service, forcing it to be taken down. By the time the deletion was stopped, only 4.5 Gb of the 300 Gb trove of data remained.

Reading their log of the incident the scale of the disaster unfolds, and we can’t help wincing at the phrase “out of 5 backup/replication techniques deployed none are working reliably or set up in the first place“. In the end they were able to restore most of the data from a staging server, but at the cost of a lost six hours of issues and merge requests. Fortunately for them their git repositories were not affected.

For 707 GitLab users then there has been a small amount of lost data, the entire web service was down for a while, and the incident has gained them more publicity in a day than their marketing department could have achieved in a year. The post-mortem document makes for a fascinating read, and will probably leave more than one reader nervously thinking about the integrity of whichever services they are responsible for. We have to hand it to them for being so open about it all and for admitting a failure of their whole company for its backup failures rather than heaping blame on one employee. In many companies it would all have been swept under the carpet. We suspect that GitLab’s data will be shepherded with much more care henceforth.

We trust an increasing amount of our assets to online providers these days, and this tale highlights some of the hazards inherent in placing absolute trust in them. GitLab had moved from a cloud provider to their own data centre, though whether or not this incident would have been any less harmful wherever it was hosted is up for debate. Perhaps it’s a timely reminder to us all: keep your own backups, and most importantly: test them to ensure they work.

Thanks [Jack Laidlaw] for the tip.

Rack server image: Trique303 [CC BY-SA 4.0], via Wikimedia Commons.

54 thoughts on “Fail of the Week: GitLab Goes Down

  1. It is the one single universal truth of all EDP for as long as it has been around: There are two types of folks, those that keep good backups, and those that will one day wish they had.

        1. On the balance of probability most servers shipped with RAID will have disks from the same batch. RAID is not a backup. Backups are backups. RAIDS are raids. RAIDS fail. Monitoring systems for raids fail. Backups fail.. You can never have too many backups. Backup your raid, backup your backups, backup the backups to your backups, One more point… Don’t leave old data tapes out with the refuse… either keep them forever, or destroy them with fire. Hackers rake bins (not that I would ever do such a thing of course).

          1. I have been responsible for backups for many years for my last employer and was keepping two set in differents safes in 2 differents buildings 2 kilometers apart. I don’t know how many times I saved users who had deleted documents by mistake.

          2. RAID can be used as a backup, if you do it correctly.

            Imagine you want a daily backup for a week back in time, and a monthly backup for 12 months back in time.

            Then buy a RAID-1 cage with 4 slots, and then buy 21 drives.

            The RAID controller must be hotplug, store configuration on itself (it shouldn’t need to store anything on the drives to keep up the raid and the disks must be readable without the RAID controller) and the RAID controller must assume that a drive being put in, while its a valid RAID-1 set, means the just now connected drive should be overwritten.

            7 of the 21 drives you mark Monday, Tuesday, Wednesday and so on.
            12 of the 14 remaining drives, you mark January, February, Mars, April and so on.

            —–

            Now to start the backup strategy:
            First you put in the 2 unmarked drives into the cage, and then the drive for the current day, and the drive for the current month.
            Install your system and carry on.

            Next day, you pull out the drive marked with the yesterday’s day, and put in the drive marked today. Wait until the RAID controller has copied the set to the today’s drive.
            If month change, you do the same, but ONLY after the day’s drive finished copying.

            All drives not inserted in the raid set, you store in a fireproof safe or something similiar.
            Then you simply rotate around the drives as indicated.

            —–

            Imagine a disaster happens and you need to restore to a prior point in time.
            Now pull OUT all drives in RAID set, so its empty. (this means pulling out both unmarked disks, Today’s disk and Current Month disk)
            Insert the disk for the day or month you want to restore to. Lets say you want to restore from last February backup, then you insert that disk. Now wait until the RAID set declares this February disk as “master”.
            Now insert the disk for today. It will now be overwritten with “February” data.
            When its done, insert ONE of the unmarked disks. Wait until this are overwritten too.
            When its done, insert the last unmarked disk. Wait until this are overwritten.
            Now, pull out the February disk, and insert the current month disk. Now current month disk will be overwritten with February data.

            Voil’a, you have restored to February. Now you can carry on as usual.

          3. re “Don’t leave old data tapes with the refuse”: Please encrypt your backups. There are so many possible situations where the person that can properly handle the backup medium isn’t present to advise on the backup’s disposal or transport.

          4. I disagree with this new trend of saying that RAID is not a backup (raid1). You’ve got your data on 2 different disks, how is that not a backup? Sure, you’ll loose everything if the partition gets corrupted or if you accidentally delete a file. But that just means that this particular backup solution, like others, has a vulnerability. Copying your data on a separate computer is considered a backup but if a surge busts both computers then you’re out of luck. Does it mean it’s not a backup? RAID is a backup but just like every other backup solutions, it shouldn’t be used alone. It also depends on your level of comfort. I have some sandbox VMs that I run off the raid. I do that because I wanna protect myself agains HDD failures. But I don’t find that these images are important enough to copy on a second computer. I would even consider copying data in 2 different folders on the same partition to be a backup. A risky backup, but depending on the importance of that data, it might be just enough. Copying on RAID1 is still better than nothing at all anyway. And the day that one drive will fail and you will recover the data from the remaining drive, you will say: “Damn I’m happy to have a backup!”

  2. “out of 5 backup/replication techniques deployed .. set up in the first place“.
    I didn’t realise not doing something counted as deployment? this is great news, I always fancied a tape drive but couldn’t justify the cost, now I can simply ‘deploy’ one for free.
    it’s great that they are open and transparent about it? not like they had much choice. perhaps they should have been more open and transparent about their data integrity to begin with.

    1. Cant justify the cost means your data is completely worthless to you . Honestly if your data has any value at all you will justify the cost of a backup system.

      I had an executive use the same words you did, I responded with, “so if I delete the whole accounting database right now you won’t care, as it’s not worth backing it up?”

      suddenly he decided spending money on a backup solution was justifiable.

      1. +100 to that.
        Backup frequently, often and to multiple locations. Assume the jumbo jet scenario and plan for it. The jumbo jet scenario is where your offices and the local data centre both get wiped out by the same crashing jumbo jet and you have to start from scratch… Do you have backups… do you have kit to deploy them to, do you have kit to read them from? If not.. you don’t have a snowballs chance in hell of recovering.
        You can never have too many backups.
        One customer years ago, did everything right… backups on site, backups off site, backups in the local bank vault. Floods came, bank vault flooded sysadmins car (with most recent backup) washed away, offices trashed and all kit ruined…. Paper invoice reports were all that was recoverable initially, which were dried off carefully on clothes lines believe it or not. Much data recovery work on stinky hard disks and knackered tapes got most of it back… their insurance company paid for a lot of the work. This was before the days of cloud storage….
        Backup.. Backup the backup… and backup the backup to the backup.

      2. Good restore hygiene (testing restore under duress to potentially degraded targets) will always beat good backup hygiene!
        Too commonly heard – “we had good backups but…..”

    1. I’ve spent a lot of money on backup software only to find out it’s not worth anything it is crap.
      I have to sets of hard drives one is my back up and the other is my running disks.
      I no longer use any soft ware to do my main backups I do it by hand and it takes about 3 to 4 weeks to copy and
      verify 50 terabytes of of my stuff. I have a running backup that I keep going until I do my main. and I do a main backup 2 times a year. and rotate my hard drives and pack away my new backup.
      And yes I have lost a lot of data over the years and now I have saved a lot as well.

    1. An interesting assertion. I never use -rf. I will first try rmdir and if it doesn’t work (and I know I’m deleting the correct director) I’ll do a “sudo rm -r [dirname]”.

      Is this bad? I’m assuming subdirectories inside of subdirectories. Should I be stepping into the victim dir and running rm -r from there?

        1. McDonalds has a thing on their receipts where you can do an online survey (up to 5x a month) to get a free Quarter Pounder With Cheese. For quite a while it was on the receipts as a QPC with Cheese. I asked if that meant it came with double the cheese.

          Dairy Queen has a couple of lunchtime meals for five dollars, but it’s on their menu boards as $5 Buck Lunch. What? Five Dollar Buck Lunch?

          Sometimes people just fail to think a thing through.

    1. We noticed a error on the script used in a backup strategy of a company and had to go back through years of lto tapes getting access to the tape room before we found one with a actual backup before we found one with data that would read and we still had the hardware available to access it. The sysadmin’s had been religiously following the backup strategy document for years, only at no point did it ever mention checking the contents or trying a restore.

      Thankfully not during a disaster rebuild. Or major flying things would have hit fan, given how much a secure offsite tape room and that many tapes takes to produce let alone the hours invested in taking them all…

      I spent 2016 and 2017 telling my manager then that we had no backups whatsover of a pair of test servers that played a mandatory role in deployments for a multi million pound operation. I still think they have no backup but I left at xmas so its SEP now :)

    2. I had one ‘chance’ to test a Win7 backup– there were two, one made with MS own builtin utility including the whole system image, and partimage compressed block device dumps from Linux. When Windows decided the user registries or directories or whatever were snarfed (“cough, logging you in with a temporary profile”), restoring a MS-made backup didn’t help (surprise) but using the partimages worked fine. One reason nobody *keeps* validating is that there are no extra hard drives or machines to test on, and at least one of them is on 24/7 :( Schroedinger’s backups are still being made… at least there’s a third machine running BackupPC which has a RAID-1 and pulls daily so the files are still around for e.g. laptops if a system eats itself. Testing that backup is as easy as downloading a file and opening it, times a few thousand.

  3. One of the beautiful things about git is that it stores all the history and revisions locally. So even if they DID lose data in the git repositories, almost all of the projects would have an up to date copy of each branch somewhere, either in production or on a developer’s machine, that they could push back up to the server.

    Still no excuse for not testing your backups, and it’s a pain to recover all the branches that way, but at least the data wouldn’t be gone forever.

  4. While this should never have been allowed to happen if you are a serious online service, Humans will make mistakes and will continue too until we are wiped out, It seems that have most of it backed up. A lot of companies would have gone to ground when something like this happens and spin all sorts of stories. I think they are great for admitting it and explaining exactly what went wrong. I know it shouldn’t have happened but they did the right thing once they messed up.
    I am a web designer and my data is my livelihood, I’ve made some stupid mistakes in the past and had to use backups but thankfully have never lost any data forever. Sure there was one time my customer’s sites were offline for nearly 24 hours I sent out the dreaded mass email being as upfront as I possibly could be. I thankfully didn’t lose a single customer, I did give free services for 3 months to all affected though. When you obfuscate or try to hide things from customers they lose trust, Just tell them exactly what went wrong. Glad GitLab chose this path.

    1. Well said Jack. I think you outlined the point of the Fail of the Week series. People screw up, and responding appropriately is important. When these stories are told it becomes a way for more people to learn a bit less painfully.

      1. I like fail of the week, It does give insight into different ways people have tried and failed. Focusing solely on success doesn’t teach you much analysing a failure that’s where the real learning can be found.

  5. I once had backups that laid ineffective when the time my machines HDD failed.
    A family member was around one day and he seems to be Zeus in human form.

    I completed a backup onto two other HDDs. The morning before the failure day, I retested the two HDDs to make sure everything was on them. The family member visited and was interested in the HDDs.
    Him handling them somehow killed them…. And I (myself) have yet to accidentally statically kill some piece of hardware.

    Sometimes you just can’t prepare for every act of god or personal/group stupidities

  6. this story highlights one of the things I keep telling clients, there is a lot of difference between a hardware failure and something done from software. It doesn’t matter if a software problem is a bad command or a virus, you have to do things to make sure you can survive it – ie I’ve seen a company that mirrored data across three different sites, and they didn’t understand that wasn’t backing up until a program on one of them corrupted data that was then happily mirrored around.. oops..

    Nobody seems to take backups seriously until they have suffered a major loss… And with hard discs getting bigger – but the error rate per byte staying about the same – and software systems getting more complex, it’s not a matter of IF this is going to happen but WHEN.

    If you only have a small amount of crucial data – like the story above which only had 300gb – you should have so many copies of it that the main work is the system that manages your copies :-) More data is harder, my current small company has about 40TB, which is not that big in the scheme of things, and we keep two live copies (not blindly mirrored!) at two different sites, and quite a few offline copies. We keep a copy of end of year for the last 5 years, end of quarter for the last 3 years, end of month for the last 6 months, end of day for the last 7 days, and daily changes for the last 3 months… And yes, we test them – for a start the online ones are read in their entirety with hash checks on a regular basis..

    And I’m still nervous about loosing things.. :-)

  7. What a refreshing story. We can expect to read more of the like in the future — that #DevOps trend really is a big step up, watching a (nodejs|ruby|php|whatever)-developer’s take on ill-designed system infrastructures collapsing in realtime is priceless :)

  8. It is indeed eyerollingly notable that the internet was originally designed to be deliberately able to withstand taking any part of it out and to keep functioning, and now even basic things require goddamn googleapis and such to be up and connectable else the site is broken.

  9. As a long time sysadmin this tickles so many things with me. The first one is picking on many people who put blind faith in the “cloud”. They somehow try and convince themselves that the cloud is magic and everything is thoroughly backed up and redundant and bad things just can not happen. Apparently this is not necessarily always the case (smile). Managers seem to love it because they can pass the buck.

    It tickles another issue that I had first hand experience with too. Picking up the pieces after a disaster. I had my predecessors back up system in place. He was a tape saver and did lots of incremental backups. He did save a lot of tape. But to get everything back required finding no less than 6 tapes, and praying there were no errors on any of them. This was a terrible system for doing a major restore from and it took a lot of thinking when you were already very stressed. After that I rolled out a new system that changed a lot of things. Full restores depended on one tape and the way back was easy. It also made “oops” restores for the developers self service from the last backup. There were a few other things stirred in as well.

    The last thing is RAID. RAID is not a backup system for files. It is a fallible system to HELP prevent hardware issues from taking a storage system down. It is fallible for a number of reasons. One, you generally get all the disks at the same time so they are all about the same age so in theory, the should all start wearing out at around the same time. Even if you have a hot backup, that has been on for the same amount of time. Next is all of the disks are typically in the same physical environment. So if your cooling system for example takes a dump on you, all of the disks in the array get hot, so much like them all having the same on time, they are exposed to the same environment. The last one is a bit sneakier yet. About the most stressful thing you can do to the disks in an array is replace one member of it. That starts all the other disks clicking through sector by sector and computing the missing data to write to the new disk. However the other disks do not get to do this in a nice orderly manner as they are also being hit on by users so the heads are just whacking around like crazy. Watch the activity light sometime when you rebuild an array. BTW, what got me was a combination of two things. Our main chiller failed and the backup chiller popped it’s breaker coming on. The over temp alarm was part of the back up chiller. OOps. The aftermath was one disk, and the real pisser was the disk was not even dead, it was in predicted failure mode, but the system had mapped it out. Being the dutiful admin, I got a new disk ordered and had it the next day. So far so good. Now recall that comment about rebuilding the array being stressful? I got weird and unexpected emails that night about the system being down, when at worse it should have just been limping along shy one disk and rebuilding. Sadly, the rebuilding stressed another disk that had also gotten really hot when the chiller failed, and that one went into hard failure. I had two dead disks. So much for RAID saving the day. So the long and short is RAID is a tool, but it is far from guaranteed.

  10. The database wasn’t logging what was deleted or changed? Ours does.

    Always do a select count(*) from x where … To see now many row you will be deleting before running the delete statement.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s