How A Storage Company Builds Their Own

September 4, 2009

blackblaze_storage_pods

Want 67 Terabytes of local storage? That’ll be $7,867 but only if you build it yourself. Blackblaze sells online storage, but when setting up their company they found the only economical way was to build their own storage pods. Lucky for us they followed the lead of other companies and decided to share how they built their own storage farm using some custom, some consumer, and some open source components.

Each pod is a standalone HTTPS-connected storage unit with 45 hard drives in it. Nine SATA port expanders connect to 4 SATA controller cards on the mainboard. The system boots from a 46th hard drive into 64-bit debian. Drives are running RAID 6 and using the Journaled File System (JFS). Our first thought when reading this was about the heat generated by those drives. A custom case houses all of this hardware and includes 6 big fans to take care of the cooling.

[Thanks Dave]

28 thoughts on “How A Storage Company Builds Their Own”

Hackius says:

September 4, 2009 at 11:20 am

I did the exact same thing for a company on a contract on a slightly smaller scale (20 TB/unit). They needed all that storage for huge video files.

It’s simple really but it’s an elegant writeup.

Report comment

Reply
Hackius says:

September 4, 2009 at 11:22 am

BTW what software did they use to make these swanky diagrams:
http://blog.backblaze.com/wp-content/uploads/2009/08/backblaze-storage-pod-power-wiring-diagram.jpg

I’d like to make diagrams like that too.

Report comment

Reply
louis ii says:

September 4, 2009 at 11:23 am

Yeah, very simple but good writeup.

Report comment

Reply
Schreiaj says:

September 4, 2009 at 11:24 am

http://www.c0t0d0s0.org/archives/5899-some-perspective-to-this-diy-storage-server-mentioned-at-storagemojo.html

For another perspective on this.

Report comment

Reply
BRANKKO says:

September 4, 2009 at 11:27 am

This is really cool, but what’s about electricity?

Report comment

Reply
Hackius says:

September 4, 2009 at 11:33 am

The seagate ST31500341AS seems to have a huge failure rate. People seem to be going through one a month. Even with RAID 6 is this an acceptable failure rate?

Report comment

Reply
bob says:

September 4, 2009 at 11:34 am

This setup is getting slated around the net. Please don’t use this as a good example of how to do storage, there are several pretty major flaws. It’s a nice writeup though.

Report comment

Reply
colecoman1982 says:

September 4, 2009 at 12:02 pm

@bob: Thanks for the input, but it’s “put up or shut up” time. I don’t know you, and I doubt many of the other people here do either. If you want anyone to take what you have to say seriously then you should be able to give, at least some, examples of what “flaws” you’re referring to. Otherwise, we have to assume that, like the majority of people that post on internet message boards, you’re just talking out your rectum and don’t have a clue.

Report comment

Reply
anon says:

September 4, 2009 at 12:08 pm

major flaws not worth mentioning obviously.

Report comment

Reply
daryl says:

September 4, 2009 at 12:14 pm

I think most of the negative comments on this solution are due to the lack of access to the software that runs the solution. The redundancy isn’t in the hardware (each red box is essentially designed to be throw away as I understand the solution). The beauty of this solution is in the software (which no one has access to, and they don’t really want to talk about – naturally, because that is their business edge) and how that software doesn’t require huge, complex redundant hardware systems.

When you can build the hardware for next to nothing compared to commercial hardware storage solutions, who cares if you lose an entire rack of these boxes, because they were so cheap to build, they have 6 other copies of the data (made up number) in other racks and data centers. I’ve read some comments trashing speed, but when you realize that most people don’t have a fat enough pipe to strain these servers, that’s not an issue either.

My assumption is that these boxes are seen as expensive 67TB hard drives in themselves. Cheap interface board (mobo, cpu, ram, boot drive) and high capacity in the hardware were the goal of the project, allowing the maintainers to write the redundancy into the software end and not worry about a higher downtime per box, because the box itself isn’t critical.

*** I could be completely wrong. I’m just saying take another look at it and recognize that none of us have the whole picture due to the missing software component. I wouldn’t build one of these for personal use or for my business. ***

Report comment

Reply
phoenix says:

September 4, 2009 at 12:17 pm

I wonder why they went with three two-port and one four-port SATA controllers instead of three four-port controllers. To save $35?

Also, what was the developement cost of the “custom Backblaze application layer logic”?

Report comment

Reply
daryl says:

September 4, 2009 at 12:20 pm

@phoenix

Could it be PCI bus speeds? better I/O per raid array via more channels? I dunno on that one. Good question. I do know they have nine backplanes, and this way they only have one extra port instead of three extra sata ports. No real logic there, just pointing it out.

Report comment

Reply
redbeard says:

September 4, 2009 at 1:57 pm

There are a number of points i’d like to address. Now, there are also solutions here (unlike a lot of comments here). Please, in the truest sense of research and commentary, provide feedback. If I’m chasing this down the wrong path, please let me know.

So each of these nodes consist of 46 disks. One for the core operating system, and 45 disks provisioned in three 15 disk RAID 6 arrays. One obvious failure is the operating system disk, this would be very unlikely to lead to total data loss so we can rule that out.

Additionally, lets analyze the problems in the perspective of a single 15 disk RAID 6 array. A failure in one does not mean anything is more likely to cause a failure in the other two, so let’s start there.

They’re using 1.5 TB drives in their arrays, thus providing approx 19.5TB in each RAID 6 array. Important metrics to note:
* These disks have a MTBF of 750,000 hours. (With 15 disks in the array, thats 15 disk hours for every geological hour)
* The MTBF is for average use, not greater than average activity

The note about increased activity is important. When a drive fails, you will need to interleave reads across every single disk in order to recover parity information to rebuild the failed drive. Doing this while the drive is in use causes extra seeks across the head leading to an even higher than “normal” amount of activity, increasing chance of failure.

More insidious is the chance for bit error issues.

On a SATA disk, the error rate in bits is 1 in 10^14. Problem is, a single 1.5 TB disk provides 1.31941395 × 10^14 bits. If there is a problem with one of the disks, we will need to read 13 disks in order to produce the replacement data. This means, given the formula:

Percentage_of_Loss = ( (Recovery_Disks * Disk_Capacity) / Error_Rate ) * 100

And the values:
Recovery Disks = 13
Disk Capacity = 1.32 * 10^14 bits
Error Rate = 10^14 bits

We achieve a 1700% chance of inability to read a block of data. This means a guaranteed loss of data even at one tenth the probability.

So, where to go from here? Well, first, we need a better filesystem. Filesystem Copy-on-write with checksumming alleviates these problems. On FreeBSD or Solaris, that would be ZFS. On Linux we’re looking at BTRFS or NILFS.

A hardware RAID controller will do little to take care of this problem. While it can do checksumming of the disk and periodic health checks, it doesn’t change the fact that due to the size of disks now, ther WILL be errors. We need to move to a method of compensating for this by design.

Report comment

Reply
redbeard says:

September 4, 2009 at 1:57 pm

Apparently, it won’t let me post my bibliography, lets try it with broken links:

Bibliography:
hxtp://db.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html
hxtp://research.google.com/archive/disk_failures.pdf
hxtp://permabit.wordpress.com/2008/08/20/are-fibre-channel-and-scsi-drives-more-reliable
hxtp://www.eweek.com/c/a/Data-Storage/Hard-Disk-MTBF-Flap-or-Farce
hxtp://blog.econtech.selfip.org/2008/09/56-chance-your-hard-drive-is-not-fully-readable-a-lawsuit-in-the-making

Report comment

Reply
M4CGYV3R says:

September 4, 2009 at 2:21 pm

I’m totally making one. That rules. I’ll never need a new hard drive again.

Report comment

Reply
bob says:

September 4, 2009 at 2:48 pm

@colecoman1982 – I was going to post a link but didn’t in the end because out of the several places I’ve seen criticisms, no single post summed up everything flawlessly. Also the link that was posted 2 or 3 comments up did cover most things.

I guess a lot of storage experts are a bit annoyed because they’ve been making cheap storage units without the noobular errors for some time but obviously you only get attention with a shiny red paintjob and some nice exploded diagrams. It is a shame that doing it properly is too boring.

Report comment

Reply
blerik says:

September 4, 2009 at 3:20 pm

How about this for a major flaw: When the primary power supply croaks, you really do not know what happened to the data you wrote in the last second. No raid6 is gonna fix that. There is no battery backed ram here. Oh, and when the secondary power supply croaks, you mess up a full set of 15 discs since half of them lose power (and raid6 cannot fix that as well).

Why are they even using raid6 on the boxes is a mystery to me. The only way to make this setup work is replicating the data across multiple units in multiple racks so they don’t take power from the same bus.

Report comment

Reply
vikki says:

September 4, 2009 at 3:46 pm

to M4CGYV3R, untrue, they are using conventional drives, not solid state, and with data storage traffic they will degrade over time. of course solid state would about quadruple your total cost, but so far there is no long term permanent fool proof storage option for data. vigilance is always going to be a necessary ingredient.

Report comment

Reply
redbeard says:

September 4, 2009 at 4:01 pm

Whoops. Math snafu. The failure progression is not linear, but the idea still holds true. More disks, more bits, more failures.

Report comment

Reply
fcisler says:

September 4, 2009 at 11:29 pm

I wonder what the price difference is between using these cards versus a couple nice 3ware cards.

The newer 3ware can span multiple arrays over multiple controllers – and has a 16 port model (IIRC). Performance would be significantly better.

Report comment

Reply
fra5 says:

September 5, 2009 at 4:42 am

I wounder what temperature the harddrives operate in? they seam to be mounted very close to each other. And how does temperature relate to fail rate?

Report comment

Reply
b0red says:

September 5, 2009 at 3:31 pm

I see comments and an article below that addresses the low speed of the system, how it is limited by the bus, the board, the ram, etc etc.

This system is meant to be connected to the internet… via ethernet…. the best it would ever see in speed is 1 Gbit, the max the ethernet card can provide….

What is the disadvantage in speed then in the sata bus or whatever… if your major bottleneck is the actual ethernet card?

Also, I see many comments on redundancy… on another site someone said how the system was not for “real life like in yahoo or google” because of the lack of redundancy…. I don’t see dual power supplies on the google server (linked on this very same article)… and those are servers used on very real life applications.

This system seems perfect for what it was designed for IMHO.

Report comment

Reply
Bluedodo says:

September 5, 2009 at 8:56 pm

So how would you do an offsite backup of so much data.

Report comment

Reply
anonymous says:

September 6, 2009 at 4:09 pm

@b0red: The google servers are for computation, not data storage. If and when a google server fails, the upstream software will simply have a different server do the work the broken one was supposed to do. Very little data is kept on each server, and all of the data is easily replaceable, of little consequence, and probably replicated elsewhere anyway.

Report comment

Reply
andrei says:

September 7, 2009 at 5:06 am

May have missed it in the article but how do they delay the power up of the second PSU ? Is that achieved with a micro controller or can it be triggered by the operating system loaded with the first PSU via the motherboard ?

Report comment

Reply
Robert Paulson says:

April 16, 2010 at 6:59 am

For that $-u could get twice the TBs off of new egg with old AMANDA backing ya up. f(x)=bulk purchase external firewire drives, copy and paste the config file from the renegade site and let the daisy chain via USB and fire wire cook. I’ve got 30 TBs going now and it is smoking!

Report comment

Reply
Shadyman says:

October 21, 2010 at 12:47 pm

@blerik:

These pods use JFS, “Journaled File System”; however, like you said, “JFS journals metadata only, which means that metadata will remain consistent but user files may be corrupted after a crash or power loss.”[wikipedia:JFS]

I would imagine this is mitigated in their ‘system logic’ such that if a write fails, it gets sent to another machine.

I would also hazard to guess that the data center has some form of UPS and backup generator.

Report comment

Reply
Tanner says:

October 25, 2010 at 9:52 am

@Shadyman
He’s not talking about power loss due to the grid going out, he’s talking about the power supply giving power to the hard drives failing.

Report comment

Reply