Has work been a little stressful this week, are things getting you down? Spare a thought for an unnamed sysadmin at the GitHub-alike startup GitLab, who early yesterday performed a deletion task on a PostgreSQL database in response to some problems they were having in the wake of an attack by spammers. Unfortunately due to a command line error he ran the deletion on one of the databases behind the company’s main service, forcing it to be taken down. By the time the deletion was stopped, only 4.5 Gb of the 300 Gb trove of data remained.
Reading their log of the incident the scale of the disaster unfolds, and we can’t help wincing at the phrase “out of 5 backup/replication techniques deployed none are working reliably or set up in the first place“. In the end they were able to restore most of the data from a staging server, but at the cost of a lost six hours of issues and merge requests. Fortunately for them their git repositories were not affected.
For 707 GitLab users then there has been a small amount of lost data, the entire web service was down for a while, and the incident has gained them more publicity in a day than their marketing department could have achieved in a year. The post-mortem document makes for a fascinating read, and will probably leave more than one reader nervously thinking about the integrity of whichever services they are responsible for. We have to hand it to them for being so open about it all and for admitting a failure of their whole company for its backup failures rather than heaping blame on one employee. In many companies it would all have been swept under the carpet. We suspect that GitLab’s data will be shepherded with much more care henceforth.
We trust an increasing amount of our assets to online providers these days, and this tale highlights some of the hazards inherent in placing absolute trust in them. GitLab had moved from a cloud provider to their own data centre, though whether or not this incident would have been any less harmful wherever it was hosted is up for debate. Perhaps it’s a timely reminder to us all: keep your own backups, and most importantly: test them to ensure they work.
Thanks [Jack Laidlaw] for the tip.
Rack server image: Trique303 [CC BY-SA 4.0], via Wikimedia Commons.