Hackaday

2017-02-02T04:09:53-08:00

This would seem pertinent : http://www.taobackup.com/

2017-02-02T09:57:52-08:00

A great new version of “zen and the art of bicycle maintenance” !!!

2017-02-02T04:35:47-08:00

It is the one single universal truth of all EDP for as long as it has been around: There are two types of folks, those that keep good backups, and those that will one day wish they had.

Reply

2017-02-02T05:53:39-08:00

Usually it’s those that have good backups, and those who haven’t yet have had catastrophic data loss.

Reply

2017-02-02T06:20:56-08:00

I’ve only had one RAID fail on me, and of course two drives out of five failed simultaneously. Had partial backup, was still a PITA. Don’t buy all your drives from the same batch.

Reply

2017-02-02T07:27:09-08:00

On the balance of probability most servers shipped with RAID will have disks from the same batch. RAID is not a backup. Backups are backups. RAIDS are raids. RAIDS fail. Monitoring systems for raids fail. Backups fail.. You can never have too many backups. Backup your raid, backup your backups, backup the backups to your backups, One more point… Don’t leave old data tapes out with the refuse… either keep them forever, or destroy them with fire. Hackers rake bins (not that I would ever do such a thing of course).

Reply

2017-02-02T07:50:24-08:00

I have been responsible for backups for many years for my last employer and was keepping two set in differents safes in 2 differents buildings 2 kilometers apart. I don’t know how many times I saved users who had deleted documents by mistake.

2017-02-02T07:52:23-08:00

RAID is not a backup. Absolutely true and well said. This is a surprisingly common misconception. Lots of scenarios take out the entire RAID system just the same as a single hard drive.

2017-02-02T10:33:59-08:00

Yeah, the purpose of RAID is for preventing down time when one or more hard drive fails — it gives you a window of time to replace those bad drives.

2017-02-02T13:16:30-08:00

Most people simply forget the obvious fact that RAID will never save you from accidental data deletion.

2017-02-03T03:33:44-08:00

RAID is not a backup. RAID is uptime insurance.

2017-02-03T12:14:03-08:00

RAID can be used as a backup, if you do it correctly.

Imagine you want a daily backup for a week back in time, and a monthly backup for 12 months back in time.

Then buy a RAID-1 cage with 4 slots, and then buy 21 drives.

The RAID controller must be hotplug, store configuration on itself (it shouldn’t need to store anything on the drives to keep up the raid and the disks must be readable without the RAID controller) and the RAID controller must assume that a drive being put in, while its a valid RAID-1 set, means the just now connected drive should be overwritten.

7 of the 21 drives you mark Monday, Tuesday, Wednesday and so on.
12 of the 14 remaining drives, you mark January, February, Mars, April and so on.

—–

Now to start the backup strategy:
First you put in the 2 unmarked drives into the cage, and then the drive for the current day, and the drive for the current month.
Install your system and carry on.

Next day, you pull out the drive marked with the yesterday’s day, and put in the drive marked today. Wait until the RAID controller has copied the set to the today’s drive.
If month change, you do the same, but ONLY after the day’s drive finished copying.

All drives not inserted in the raid set, you store in a fireproof safe or something similiar.
Then you simply rotate around the drives as indicated.

—–

Imagine a disaster happens and you need to restore to a prior point in time.
Now pull OUT all drives in RAID set, so its empty. (this means pulling out both unmarked disks, Today’s disk and Current Month disk)
Insert the disk for the day or month you want to restore to. Lets say you want to restore from last February backup, then you insert that disk. Now wait until the RAID set declares this February disk as “master”.
Now insert the disk for today. It will now be overwritten with “February” data.
When its done, insert ONE of the unmarked disks. Wait until this are overwritten too.
When its done, insert the last unmarked disk. Wait until this are overwritten.
Now, pull out the February disk, and insert the current month disk. Now current month disk will be overwritten with February data.

Voil’a, you have restored to February. Now you can carry on as usual.

2017-02-06T01:40:09-08:00

re “Don’t leave old data tapes with the refuse”: Please encrypt your backups. There are so many possible situations where the person that can properly handle the backup medium isn’t present to advise on the backup’s disposal or transport.

2017-03-03T14:52:12-08:00

I disagree with this new trend of saying that RAID is not a backup (raid1). You’ve got your data on 2 different disks, how is that not a backup? Sure, you’ll loose everything if the partition gets corrupted or if you accidentally delete a file. But that just means that this particular backup solution, like others, has a vulnerability. Copying your data on a separate computer is considered a backup but if a surge busts both computers then you’re out of luck. Does it mean it’s not a backup? RAID is a backup but just like every other backup solutions, it shouldn’t be used alone. It also depends on your level of comfort. I have some sandbox VMs that I run off the raid. I do that because I wanna protect myself agains HDD failures. But I don’t find that these images are important enough to copy on a second computer. I would even consider copying data in 2 different folders on the same partition to be a backup. A risky backup, but depending on the importance of that data, it might be just enough. Copying on RAID1 is still better than nothing at all anyway. And the day that one drive will fail and you will recover the data from the remaining drive, you will say: “Damn I’m happy to have a backup!”

2017-02-02T08:58:20-08:00

This has been true forever.. +1

Reply

2017-02-02T09:27:29-08:00

I had a similar experience. I had 1 failing drive, was lazy and thought, ohh well there is still a spare. 2 weeks later, pooofff…..

Reply

2017-02-02T09:42:23-08:00

Agreed, but have you ever checked you can reconstruct your service from a backup????

Reply

2017-02-02T22:48:52-08:00

After once running an update query on a 5 million record table in a production database without a where clause, I am glad I was in the latter category :-)

Reply

2017-02-02T05:07:28-08:00

“out of 5 backup/replication techniques deployed .. set up in the first place“.
I didn’t realise not doing something counted as deployment? this is great news, I always fancied a tape drive but couldn’t justify the cost, now I can simply ‘deploy’ one for free.
it’s great that they are open and transparent about it? not like they had much choice. perhaps they should have been more open and transparent about their data integrity to begin with.

Reply

2017-02-02T06:57:17-08:00

Cant justify the cost means your data is completely worthless to you . Honestly if your data has any value at all you will justify the cost of a backup system.

I had an executive use the same words you did, I responded with, “so if I delete the whole accounting database right now you won’t care, as it’s not worth backing it up?”

suddenly he decided spending money on a backup solution was justifiable.

Reply

2017-02-02T07:38:35-08:00

i do have backups.

Reply

2017-02-02T05:15:09-08:00

On the bright side their, full disclosure indicates an above board management team, and future faults in this particular area should be much low.

Reply

2017-02-02T05:35:29-08:00

Maybe the achieved media/blog/twitter coverage will be noticed ans this becomes a future model.

Reply

2017-02-02T05:33:43-08:00

If you aren’t testing your backups, you don’t have backups.

Reply

2017-02-02T05:49:18-08:00

This. Taking backups or setting up a backup system is only half the battle.
Verifying it works correctly and eventually fixing it so it does is the other half.

Reply

2017-02-02T06:10:18-08:00

+100 to that.
Backup frequently, often and to multiple locations. Assume the jumbo jet scenario and plan for it. The jumbo jet scenario is where your offices and the local data centre both get wiped out by the same crashing jumbo jet and you have to start from scratch… Do you have backups… do you have kit to deploy them to, do you have kit to read them from? If not.. you don’t have a snowballs chance in hell of recovering.
You can never have too many backups.
One customer years ago, did everything right… backups on site, backups off site, backups in the local bank vault. Floods came, bank vault flooded sysadmins car (with most recent backup) washed away, offices trashed and all kit ruined…. Paper invoice reports were all that was recoverable initially, which were dried off carefully on clothes lines believe it or not. Much data recovery work on stinky hard disks and knackered tapes got most of it back… their insurance company paid for a lot of the work. This was before the days of cloud storage….
Backup.. Backup the backup… and backup the backup to the backup.

Reply

2017-02-02T15:48:41-08:00

Good restore hygiene (testing restore under duress to potentially degraded targets) will always beat good backup hygiene!
Too commonly heard – “we had good backups but…..”

Reply

2017-02-02T16:38:39-08:00

I’ve spent a lot of money on backup software only to find out it’s not worth anything it is crap.
I have to sets of hard drives one is my back up and the other is my running disks.
I no longer use any soft ware to do my main backups I do it by hand and it takes about 3 to 4 weeks to copy and
verify 50 terabytes of of my stuff. I have a running backup that I keep going until I do my main. and I do a main backup 2 times a year. and rotate my hard drives and pack away my new backup.
And yes I have lost a lot of data over the years and now I have saved a lot as well.

Reply

2017-02-02T21:00:32-08:00

yup

Reply

2017-02-02T06:43:22-08:00

A fascinating example of why rmdir would have saved the day, over rm -rf

Reply

2017-02-02T08:08:53-08:00

This. If you expect the dir to be empty, rmdir will very kindly tell you if you are wrong.

Reply

2017-02-02T11:32:20-08:00

An interesting assertion. I never use -rf. I will first try rmdir and if it doesn’t work (and I know I’m deleting the correct director) I’ll do a “sudo rm -r [dirname]”.

Is this bad? I’m assuming subdirectories inside of subdirectories. Should I be stepping into the victim dir and running rm -r from there?

Reply

2017-02-02T13:18:13-08:00

rm -rf = Remove Much Really Fu@ked ;)

Reply

2017-02-02T06:57:53-08:00

I’m glad you edited to write Fail of the Week in all letters. Thank you.

Reply

2017-02-02T09:11:15-08:00

Was it previously written partly in numbers (or other non-letters)?

Reply

2017-02-02T11:33:09-08:00

It was previously abbreviated FoTW week.

Reply

2017-02-02T13:22:37-08:00

McDonalds has a thing on their receipts where you can do an online survey (up to 5x a month) to get a free Quarter Pounder With Cheese. For quite a while it was on the receipts as a QPC with Cheese. I asked if that meant it came with double the cheese.

Dairy Queen has a couple of lunchtime meals for five dollars, but it’s on their menu boards as $5 Buck Lunch. What? Five Dollar Buck Lunch?

Sometimes people just fail to think a thing through.

Reply

2017-02-02T07:40:48-08:00

It is extremely common to find sites with a “solid” backup strategy where nobody ever bothered to test a restore procedure every once in a while.

Reply

2017-02-02T09:13:25-08:00

We noticed a error on the script used in a backup strategy of a company and had to go back through years of lto tapes getting access to the tape room before we found one with a actual backup before we found one with data that would read and we still had the hardware available to access it. The sysadmin’s had been religiously following the backup strategy document for years, only at no point did it ever mention checking the contents or trying a restore.

Thankfully not during a disaster rebuild. Or major flying things would have hit fan, given how much a secure offsite tape room and that many tapes takes to produce let alone the hours invested in taking them all…

I spent 2016 and 2017 telling my manager then that we had no backups whatsover of a pair of test servers that played a mandatory role in deployments for a multi million pound operation. I still think they have no backup but I left at xmas so its SEP now :)

Reply

2017-02-02T09:16:35-08:00

whoops that should be 2015 and 2016, I don’t have a time travel device unfortunately.

Reply

2017-02-02T16:07:40-08:00

I had one ‘chance’ to test a Win7 backup– there were two, one made with MS own builtin utility including the whole system image, and partimage compressed block device dumps from Linux. When Windows decided the user registries or directories or whatever were snarfed (“cough, logging you in with a temporary profile”), restoring a MS-made backup didn’t help (surprise) but using the partimages worked fine. One reason nobody *keeps* validating is that there are no extra hard drives or machines to test on, and at least one of them is on 24/7 :( Schroedinger’s backups are still being made… at least there’s a third machine running BackupPC which has a RAID-1 and pulls daily so the files are still around for e.g. laptops if a system eats itself. Testing that backup is as easy as downloading a file and opening it, times a few thousand.

Reply

2017-02-02T08:39:01-08:00

One of the beautiful things about git is that it stores all the history and revisions locally. So even if they DID lose data in the git repositories, almost all of the projects would have an up to date copy of each branch somewhere, either in production or on a developer’s machine, that they could push back up to the server.

Still no excuse for not testing your backups, and it’s a pain to recover all the branches that way, but at least the data wouldn’t be gone forever.

Reply

2017-02-02T09:58:01-08:00

While this should never have been allowed to happen if you are a serious online service, Humans will make mistakes and will continue too until we are wiped out, It seems that have most of it backed up. A lot of companies would have gone to ground when something like this happens and spin all sorts of stories. I think they are great for admitting it and explaining exactly what went wrong. I know it shouldn’t have happened but they did the right thing once they messed up.
I am a web designer and my data is my livelihood, I’ve made some stupid mistakes in the past and had to use backups but thankfully have never lost any data forever. Sure there was one time my customer’s sites were offline for nearly 24 hours I sent out the dreaded mass email being as upfront as I possibly could be. I thankfully didn’t lose a single customer, I did give free services for 3 months to all affected though. When you obfuscate or try to hide things from customers they lose trust, Just tell them exactly what went wrong. Glad GitLab chose this path.

Reply

2017-02-02T11:36:21-08:00

Well said Jack. I think you outlined the point of the Fail of the Week series. People screw up, and responding appropriately is important. When these stories are told it becomes a way for more people to learn a bit less painfully.

Reply

2017-02-02T13:22:45-08:00

I like fail of the week, It does give insight into different ways people have tried and failed. Focusing solely on success doesn’t teach you much analysing a failure that’s where the real learning can be found.

Reply

2017-02-02T15:07:09-08:00

I once had backups that laid ineffective when the time my machines HDD failed.
A family member was around one day and he seems to be Zeus in human form.

I completed a backup onto two other HDDs. The morning before the failure day, I retested the two HDDs to make sure everything was on them. The family member visited and was interested in the HDDs.
Him handling them somehow killed them…. And I (myself) have yet to accidentally statically kill some piece of hardware.

Sometimes you just can’t prepare for every act of god or personal/group stupidities

Reply

2017-02-02T15:19:22-08:00

this story highlights one of the things I keep telling clients, there is a lot of difference between a hardware failure and something done from software. It doesn’t matter if a software problem is a bad command or a virus, you have to do things to make sure you can survive it – ie I’ve seen a company that mirrored data across three different sites, and they didn’t understand that wasn’t backing up until a program on one of them corrupted data that was then happily mirrored around.. oops..

Nobody seems to take backups seriously until they have suffered a major loss… And with hard discs getting bigger – but the error rate per byte staying about the same – and software systems getting more complex, it’s not a matter of IF this is going to happen but WHEN.

If you only have a small amount of crucial data – like the story above which only had 300gb – you should have so many copies of it that the main work is the system that manages your copies :-) More data is harder, my current small company has about 40TB, which is not that big in the scheme of things, and we keep two live copies (not blindly mirrored!) at two different sites, and quite a few offline copies. We keep a copy of end of year for the last 5 years, end of quarter for the last 3 years, end of month for the last 6 months, end of day for the last 7 days, and daily changes for the last 3 months… And yes, we test them – for a start the online ones are read in their entirety with hash checks on a regular basis..

And I’m still nervous about loosing things.. :-)

Reply

2017-02-03T00:33:41-08:00

Of all the commands you give a computer, “Delete” executes the fastest.

Reply

2017-03-03T10:01:20-08:00

Always much easier to destroy than to create…

Reply

2017-02-03T05:36:40-08:00

What a refreshing story. We can expect to read more of the like in the future — that #DevOps trend really is a big step up, watching a (nodejs|ruby|php|whatever)-developer’s take on ill-designed system infrastructures collapsing in realtime is priceless :)

Reply

2017-02-04T07:21:55-08:00

It is indeed eyerollingly notable that the internet was originally designed to be deliberately able to withstand taking any part of it out and to keep functioning, and now even basic things require goddamn googleapis and such to be up and connectable else the site is broken.

Reply

2017-02-05T22:28:40-08:00

As a long time sysadmin this tickles so many things with me. The first one is picking on many people who put blind faith in the “cloud”. They somehow try and convince themselves that the cloud is magic and everything is thoroughly backed up and redundant and bad things just can not happen. Apparently this is not necessarily always the case (smile). Managers seem to love it because they can pass the buck.

It tickles another issue that I had first hand experience with too. Picking up the pieces after a disaster. I had my predecessors back up system in place. He was a tape saver and did lots of incremental backups. He did save a lot of tape. But to get everything back required finding no less than 6 tapes, and praying there were no errors on any of them. This was a terrible system for doing a major restore from and it took a lot of thinking when you were already very stressed. After that I rolled out a new system that changed a lot of things. Full restores depended on one tape and the way back was easy. It also made “oops” restores for the developers self service from the last backup. There were a few other things stirred in as well.

The last thing is RAID. RAID is not a backup system for files. It is a fallible system to HELP prevent hardware issues from taking a storage system down. It is fallible for a number of reasons. One, you generally get all the disks at the same time so they are all about the same age so in theory, the should all start wearing out at around the same time. Even if you have a hot backup, that has been on for the same amount of time. Next is all of the disks are typically in the same physical environment. So if your cooling system for example takes a dump on you, all of the disks in the array get hot, so much like them all having the same on time, they are exposed to the same environment. The last one is a bit sneakier yet. About the most stressful thing you can do to the disks in an array is replace one member of it. That starts all the other disks clicking through sector by sector and computing the missing data to write to the new disk. However the other disks do not get to do this in a nice orderly manner as they are also being hit on by users so the heads are just whacking around like crazy. Watch the activity light sometime when you rebuild an array. BTW, what got me was a combination of two things. Our main chiller failed and the backup chiller popped it’s breaker coming on. The over temp alarm was part of the back up chiller. OOps. The aftermath was one disk, and the real pisser was the disk was not even dead, it was in predicted failure mode, but the system had mapped it out. Being the dutiful admin, I got a new disk ordered and had it the next day. So far so good. Now recall that comment about rebuilding the array being stressful? I got weird and unexpected emails that night about the system being down, when at worse it should have just been limping along shy one disk and rebuilding. Sadly, the rebuilding stressed another disk that had also gotten really hot when the chiller failed, and that one went into hard failure. I had two dead disks. So much for RAID saving the day. So the long and short is RAID is a tool, but it is far from guaranteed.

Reply

2017-02-06T19:11:00-08:00

The database wasn’t logging what was deleted or changed? Ours does.

Always do a select count(*) from x where … To see now many row you will be deleting before running the delete statement.

Reply

2017-03-25T09:38:43-07:00

I just got a refubished pc from geeks for $150 with 1TB hard drive and 8 gigs of ram . next time back your shit up

Reply

2017-05-19T06:37:01-07:00

Imagine you want a daily backup for a week back in time, and a monthly backup for 12 months back in time.
—–
Now to start the backup strategy:
First you put in the 2 unmarked drives into the cage, and then the drive for the current day, and the drive for the current month.

Reply

2017-06-25T01:17:56-07:00

Imagine you want a daily backup for a week back in time, and a monthly backup for 12 months back in time. Imagine you want a daily backup for a week back in time, and a monthly backup for 12 months back in time.

Reply

2017-06-29T05:15:07-07:00

—–
Now to start the backup strategy:
First you put in the 2 unmarked drives into the cage, and then the drive for the current day, and the drive for the current month.
Imagine you want a daily backup for a week back in time, and a monthly backup for 12 months back in time.

Reply

Hackaday

Fail Of The Week: GitLab Goes Down

57 thoughts on “Fail Of The Week: GitLab Goes Down”

Leave a Reply to David PyeCancel reply

Search

Never miss a hack

If you missed it

The Great Green Wall: Africa’s Ambitious Attempt To Fight Desertification

Your Open-Source Client Options In The Non-Mastodon Fediverse

Supercon 2023: MakeItHackin Automates The Tindie Workflow

The Computers Of Voyager

My Space

Our Columns

Hackaday Podcast Episode 270: A Cluster Of Microcontrollers, A Rocket Engine From Scratch, And A Look Inside Voyager

This Week In Security: TunnelVision, Scarecrows, And Poutine

Ask Hackaday: Do You Calibrate Your Instruments?

FLOSS Weekly Episode 782: Nitric — In Search Of The Right Knob

Displays We Love Hacking: LVDS And EDP

57 thoughts on “Fail Of The Week: GitLab Goes Down”

Leave a Reply to David PyeCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns