How A Storage Company Builds Their Own

blackblaze_storage_pods

Want 67 Terabytes of local storage? That’ll be $7,867 but only if you build it yourself. Blackblaze sells online storage, but when setting up their company they found the only economical way was to build their own storage pods. Lucky for us they followed the lead of other companies and decided to share how they built their own storage farm using some custom, some consumer, and some open source components.

Each pod is a standalone HTTPS-connected storage unit with 45 hard drives in it. Nine SATA port expanders connect to 4 SATA controller cards on the mainboard. The system boots from a 46th hard drive into 64-bit debian. Drives are running RAID 6 and using the Journaled File System (JFS). Our first thought when reading this was about the heat generated by those drives. A custom case houses all of this hardware and includes 6 big fans to take care of the cooling.

[Thanks Dave]

28 thoughts on “How A Storage Company Builds Their Own

  1. I did the exact same thing for a company on a contract on a slightly smaller scale (20 TB/unit). They needed all that storage for huge video files.

    It’s simple really but it’s an elegant writeup.

  2. @bob: Thanks for the input, but it’s “put up or shut up” time. I don’t know you, and I doubt many of the other people here do either. If you want anyone to take what you have to say seriously then you should be able to give, at least some, examples of what “flaws” you’re referring to. Otherwise, we have to assume that, like the majority of people that post on internet message boards, you’re just talking out your rectum and don’t have a clue.

  3. I think most of the negative comments on this solution are due to the lack of access to the software that runs the solution. The redundancy isn’t in the hardware (each red box is essentially designed to be throw away as I understand the solution). The beauty of this solution is in the software (which no one has access to, and they don’t really want to talk about – naturally, because that is their business edge) and how that software doesn’t require huge, complex redundant hardware systems.

    When you can build the hardware for next to nothing compared to commercial hardware storage solutions, who cares if you lose an entire rack of these boxes, because they were so cheap to build, they have 6 other copies of the data (made up number) in other racks and data centers. I’ve read some comments trashing speed, but when you realize that most people don’t have a fat enough pipe to strain these servers, that’s not an issue either.

    My assumption is that these boxes are seen as expensive 67TB hard drives in themselves. Cheap interface board (mobo, cpu, ram, boot drive) and high capacity in the hardware were the goal of the project, allowing the maintainers to write the redundancy into the software end and not worry about a higher downtime per box, because the box itself isn’t critical.

    *** I could be completely wrong. I’m just saying take another look at it and recognize that none of us have the whole picture due to the missing software component. I wouldn’t build one of these for personal use or for my business. ***

  4. I wonder why they went with three two-port and one four-port SATA controllers instead of three four-port controllers. To save $35?

    Also, what was the developement cost of the “custom Backblaze application layer logic”?

  5. @phoenix

    Could it be PCI bus speeds? better I/O per raid array via more channels? I dunno on that one. Good question. I do know they have nine backplanes, and this way they only have one extra port instead of three extra sata ports. No real logic there, just pointing it out.

  6. There are a number of points i’d like to address. Now, there are also solutions here (unlike a lot of comments here). Please, in the truest sense of research and commentary, provide feedback. If I’m chasing this down the wrong path, please let me know.

    So each of these nodes consist of 46 disks. One for the core operating system, and 45 disks provisioned in three 15 disk RAID 6 arrays. One obvious failure is the operating system disk, this would be very unlikely to lead to total data loss so we can rule that out.

    Additionally, lets analyze the problems in the perspective of a single 15 disk RAID 6 array. A failure in one does not mean anything is more likely to cause a failure in the other two, so let’s start there.

    They’re using 1.5 TB drives in their arrays, thus providing approx 19.5TB in each RAID 6 array. Important metrics to note:
    * These disks have a MTBF of 750,000 hours. (With 15 disks in the array, thats 15 disk hours for every geological hour)
    * The MTBF is for average use, not greater than average activity

    The note about increased activity is important. When a drive fails, you will need to interleave reads across every single disk in order to recover parity information to rebuild the failed drive. Doing this while the drive is in use causes extra seeks across the head leading to an even higher than “normal” amount of activity, increasing chance of failure.

    More insidious is the chance for bit error issues.

    On a SATA disk, the error rate in bits is 1 in 10^14. Problem is, a single 1.5 TB disk provides 1.31941395 × 10^14 bits. If there is a problem with one of the disks, we will need to read 13 disks in order to produce the replacement data. This means, given the formula:

    Percentage_of_Loss = ( (Recovery_Disks * Disk_Capacity) / Error_Rate ) * 100

    And the values:
    Recovery Disks = 13
    Disk Capacity = 1.32 * 10^14 bits
    Error Rate = 10^14 bits

    We achieve a 1700% chance of inability to read a block of data. This means a guaranteed loss of data even at one tenth the probability.

    So, where to go from here? Well, first, we need a better filesystem. Filesystem Copy-on-write with checksumming alleviates these problems. On FreeBSD or Solaris, that would be ZFS. On Linux we’re looking at BTRFS or NILFS.

    A hardware RAID controller will do little to take care of this problem. While it can do checksumming of the disk and periodic health checks, it doesn’t change the fact that due to the size of disks now, ther WILL be errors. We need to move to a method of compensating for this by design.

  7. Apparently, it won’t let me post my bibliography, lets try it with broken links:

    Bibliography:
    hxtp://db.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html
    hxtp://research.google.com/archive/disk_failures.pdf
    hxtp://permabit.wordpress.com/2008/08/20/are-fibre-channel-and-scsi-drives-more-reliable
    hxtp://www.eweek.com/c/a/Data-Storage/Hard-Disk-MTBF-Flap-or-Farce
    hxtp://blog.econtech.selfip.org/2008/09/56-chance-your-hard-drive-is-not-fully-readable-a-lawsuit-in-the-making

  8. @colecoman1982 – I was going to post a link but didn’t in the end because out of the several places I’ve seen criticisms, no single post summed up everything flawlessly. Also the link that was posted 2 or 3 comments up did cover most things.

    I guess a lot of storage experts are a bit annoyed because they’ve been making cheap storage units without the noobular errors for some time but obviously you only get attention with a shiny red paintjob and some nice exploded diagrams. It is a shame that doing it properly is too boring.

  9. How about this for a major flaw: When the primary power supply croaks, you really do not know what happened to the data you wrote in the last second. No raid6 is gonna fix that. There is no battery backed ram here. Oh, and when the secondary power supply croaks, you mess up a full set of 15 discs since half of them lose power (and raid6 cannot fix that as well).

    Why are they even using raid6 on the boxes is a mystery to me. The only way to make this setup work is replicating the data across multiple units in multiple racks so they don’t take power from the same bus.

  10. to M4CGYV3R, untrue, they are using conventional drives, not solid state, and with data storage traffic they will degrade over time. of course solid state would about quadruple your total cost, but so far there is no long term permanent fool proof storage option for data. vigilance is always going to be a necessary ingredient.

  11. I wonder what the price difference is between using these cards versus a couple nice 3ware cards.

    The newer 3ware can span multiple arrays over multiple controllers – and has a 16 port model (IIRC). Performance would be significantly better.

  12. I see comments and an article below that addresses the low speed of the system, how it is limited by the bus, the board, the ram, etc etc.

    This system is meant to be connected to the internet… via ethernet…. the best it would ever see in speed is 1 Gbit, the max the ethernet card can provide….

    What is the disadvantage in speed then in the sata bus or whatever… if your major bottleneck is the actual ethernet card?

    Also, I see many comments on redundancy… on another site someone said how the system was not for “real life like in yahoo or google” because of the lack of redundancy…. I don’t see dual power supplies on the google server (linked on this very same article)… and those are servers used on very real life applications.

    This system seems perfect for what it was designed for IMHO.

  13. @b0red: The google servers are for computation, not data storage. If and when a google server fails, the upstream software will simply have a different server do the work the broken one was supposed to do. Very little data is kept on each server, and all of the data is easily replaceable, of little consequence, and probably replicated elsewhere anyway.

  14. May have missed it in the article but how do they delay the power up of the second PSU ? Is that achieved with a micro controller or can it be triggered by the operating system loaded with the first PSU via the motherboard ?

  15. For that $-u could get twice the TBs off of new egg with old AMANDA backing ya up. f(x)=bulk purchase external firewire drives, copy and paste the config file from the renegade site and let the daisy chain via USB and fire wire cook. I’ve got 30 TBs going now and it is smoking!

  16. @blerik:

    These pods use JFS, “Journaled File System”; however, like you said, “JFS journals metadata only, which means that metadata will remain consistent but user files may be corrupted after a crash or power loss.”[wikipedia:JFS]

    I would imagine this is mitigated in their ‘system logic’ such that if a write fails, it gets sent to another machine.

    I would also hazard to guess that the data center has some form of UPS and backup generator.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.