Moore’s Law Of Raspberry Pi Clusters

[James J. Guthrie] just published a rather formal announcement that his 4-node Raspberry Pi cluster greatly outperforms a 64-node version. Of course the differentiating factor is the version of the hardware. [James] is using the Raspberry Pi 2 while the larger version used the Model B.

We covered that original build almost three years ago. It’s a cluster called the Iridris Pi supercomputer. The difference is a 700 MHz single core versus the 900 Mhz quad-core with double-the ram. This let [James] benchmark his four-node-wonder at 3.048 gigaflops. You’re a bit fuzzy about what a gigaflops is exactly? So were we… it’s a billion floating point operations per second… which doesn’t matter to your human brain. It’s a ruler with which you can take one type of measurement. This is triple the performance at 1/16th the number of nodes. The cost difference is staggering with the Iridris ringing in at around £2500 and the light-weight 4-node built at just £120. That’s more than an order of magnitude.

Look, there’s nothing fancy to see in [James’] project announcement. Yet. But it seems somewhat monumental to stand back and think that a $35 computer aimed at education is being used to build clusters for crunching Ph.D. level research projects.

52 thoughts on “Moore’s Law Of Raspberry Pi Clusters

    1. From the article “The 3 GIGAFLOPS performance means simulations take around the same time to complete as on a single core on a relatively modern workstation”, so it is comparable in performance, but it is cheaper in cost.
      There is no explanation for the huge performance gain so here it is: it’s not the 512 vs 1Gb of RAM, it’s not the 700 vs 900MHz that makes the huge difference, it’s the multicore processor. The performance of a distribute computer depends a lot on the communication performance between nodes. The Pis(all of them) are pretty bad at spitting out data over Ethernet: it starts with a USB port located on a CPU, goes to a HUB which has an USB-ETHERNET adapter in it. So, any messages exchanged between nodes is slow. But with the newer Pi, each core has 3 brothers(sisters?) with which it can communicate at super speed and low latency.

          1. Or some custom token ring bus built over the GPIO pins? Interconnects for supercomputers are always an important area of research, and cheap computers like the Pi seem like a good place for new hackers to try their hands at it.

      1. >From the article “The 3 GIGAFLOPS performance means simulations take around the same time to complete as on a single core on a relatively modern workstation”, so it is comparable in performance,
        >but it is cheaper in cost.

        3 gigaflops is around the performance of a NINE YEAR OLD LAPTOP CPU running at 1.6GHz, something you get for free, or pay $20 at goodwill

        1. My graphics card which is not the newest can do 3.78 TFlops, so yes this doesn’t seem economical, except for the people selling raspis :)

          Still, the point is that 4 new raspi2’s are equal to 3 times 64 old ones it seems, and that’s interesting to hear.

      2. Or we could assume that the person already owns a computer and then spend $100 on a GTX 640 and then have processing power that is over 3 orders of magnitude faster.

        Heck my 5 year old Core i5 does around 12 GFLOPs per core and I have 4 cores.

        Other than “because you can” there really is no way to justify clustering RaspberryPis for any kind of real workload.

    2. For something optimized to run as many, many threads, not necessarily. Also, cost and energy use do fairly heavily factor into the equation, both of which the desktop is worse for. Does it perform better? Sure, but it uses several times the power, costs a minimum of 3x as much, isn’t as expandable, and takes significantly more space. For about the cost of said PC “cluster”, you could have a minimum of ten pi nodes and all the cores that implies, which for some purposes is significantly more useful than one faster CPU.

      1. A desktop PC can run many threads with the same ease as a handful of pi nodes. Communication between threads is also much easier and faster because they can share memory, plus it easy to extend the memory. And even with 10 pi nodes, you’d only get 10 GFLOPS versus 75 GFLOPS for a single quad core i7.

          1. If you compare the Raspi with a graphics card, it would be fair to also include the GPU performance on the Raspi. The GTX640 would still win, though.

    3. My single node intel desktop is 69Gflops, for 490$ not including the screen but including HD, power suply. and premium metal box
      The 120£ is a lie, you need 4 SD card, and 4 power supplies, so it is more 200£
      So 20 times more Gflop, less than 3 times the price.
      Who win ?

  1. “it’s a billion floating point operations per second… which doesn’t matter to your human.”

    It scares me a bit to think that so many androids and computers are reading HaD that the postings are now addressed to them. B^)

  2. PhD level research projects with 3 GFLOPS? Any decent PC greatly outperforms this by some orders of magnitude. A 100$ video card such as the Radeon HD 7770 has a peak performance of 1280 GFLOPS

        1. It will be interesting to see the results, but I doubt it will scale efficiently.

          As always, it comes down to the application. Using a pile of RPis as web servers or a build cluster will scale nicely since the applications have no interdependence on each other. [James] is running a CFD simulation on his cluster, which frequently needs to ferry data between processes.

          Fortunately for [James] the RPi2 has 4 cores and some additional cache behind the memory controller, so communicating between processes on the same RPi is speedy. But neither USB-HS and 100Mb Ethernet are suitable here, and the combination wears away at the performance advantage of the RPi2 as the size of the fabric grows.

          This is why HaD readers are so critical of compute clusters based on the embedded board of the month. GPUs that can get hundreds of times the performance of this cluster for less than a couple of times the price are available at your local computer shop, and if a PhD can’t get time on a university supercomputer then I’m sure they’ve got a friend with a fancy gamer PC they could bribe some idle time from.

          But of course this isn’t about that: it’s about seeing what you can do with on the bottom end. It’s a fun toy research exercise, and I’ve kind of got an itch now to buy a few more of these to play with. HPC is mature enough so that [James] should be able to estimate the performance of a larger cluster based on his current build, and I’m looking forward to see how that holds up to reality.

          1. Oddly enough, I’ve worked with some guys that swear by a rack of single-core machines over a single large multi-core machine for CFD simulations. It’s about memory bandwidth – the communication of work units and results between head node and workers doesn’t come close to saturating a modern network connection, but when you’re dealing with the amounts of data involved in the intermediate calculations, having a dedicated 1:1 path from CPU core to memory apparently makes a difference. This was a bunch of years back, I’m not sure if Intel’s push into triple- and then quad-channel memory renders that observation obsolete, or if the tools have improved to make better use of the hardware. It may also depend on the particulars of the workload.

            For what it’s worth, that was university work, and the hardware and energy costs were irrelevant (free).

        2. I still don’t feel I’m the one who misses the point here. This is actually interesting if you are researching on parallel computing (which is the original scope of Dr. Cox), but in this case the objective looks like building computing power to get a job done, i.e. calculate a CFD. And to get a 1280 GFLOPS performance (=100 bucks consumer video card) you will probably need as much as 50-80 k$ worth of RPIs and various equipment (but I suspect much more, as this system will scale awfully).

    1. Agreed. I sigh because I would love these RPi Beowulf tests to be killer per kWh, but I already tried something of this level around 2007.

      I was trying to build a Beowulf cluster from four motherboards with dual socket-370 CPUs and Intel P3 Tualatin 1.2 GHz chips. I loved the Tualatin chip: 130 nm instead of 180 nm of the P3 Coppermine or even the first P4s, twice the on-die L2 cache.

      Finding slightly older server parts was easy (and cheap) as a geek with a car living halfway between Cambridge and route 128. I got all the parts together — CPUs, motherboards, 133 MHz ECC RAM (server boards), power supplies, and a separate router to handle the subnet for message passing. Then I realized a single Athlon64 x2 on a microATX board was beating my numbers and only needed one power supply. It didn’t even need a separate router to talk to itself.

      The power was the expensive part. I was living in Boston, where power cost about 60% more than it does for me now that I’m in Los Angeles. Even thought I got quiet fans for the CPUs (which didn’t run as hot as the P4s of their time), I still had a noisy setup that needed 1.2 kilowatts.

      This is how I learned the lesson of throughput. The weakest point in any system may still have redundancy. You may have a lot of pipelines, but their skinniness means the data is just as slow to obtain. It’s a house on stilts instead of a house with a single, cement foundation.

      I also learned that money is not the only cost. I had already learned when I was very poor that I could trade time for money, though I would later learn that I may have also traded my future health as well. In this case I didn’t learn anything so painful: noise and space are also costs. Complexity of maintenance is also a cost, but that also files under time. I can get all the parts for less than $500 but never have the time to keep the beast running without suffering.

      Nevertheless, I love following these experiments. Everyone that conducts one learns a lot. The most valuable long-term lesson is that each challenge met is its own payment, one that can be turned into cash when applying for and getting more interesting jobs in the future. Each hard lesson is burned into the mind, one that will seem like a stroke of genius ten years later.

      Besides, handheld Beowulf? Hwaet!

  3. This is neat. The one thing I don’t get is why use a Pi? I get that it is a decent introductory board but the fact that even with the Pi2 they still slave the Ethernet to the USB bus seems to make this a suboptimal board for this kind of application.

      1. Yes, a used i7 machine from ebay or similar would easily get at least 70 gigaflops for not 20 times the price of 4 RPi’s. While the RPi2 is “faster” than the original they are still rather weak in just about every metric but especially for this use case.

      2. Can’t say anything really comes to mind that costs less. However the Odroid C1 costs effectively the same, has just about twice the clock, Gigabit Ethernet that isn’t on the USB bus and I want to say I have read in passing that the GPU on the Odroid is slightly better documented, at the least it sounded like there was more out there than just a mystery blob. How much more I have no clue. However if that ends up being correct there is the potential that you could also leverage the onboard GPU that is otherwise going to be sitting there doing next to nothing with the thing running headless. At the very least it would probably shave some cycles not having to run through the USB to get at the Ethernet.

        However if we are going price per gigaflop, neither is really all that practical. Like some people have pointed out the GPU on a decent graphics card can run circles around one of these clusters. Still it is an interesting exercise in the application of parallel computing. Its just that the USB bottleneck drives me nuts :P

      3. If you just factor in the cost of the time to set up a cluster like this, and to adapt the original problem so that it will run efficiently on this cluster, you’ve already spent more than the cost of a i7 machine.

      4. On a pure cost per gigaflop a GPU will win. Nvida and AMD both have offerings in the 1-6 Teraflop range.

        For example for double precision floating point operations, a Radeon HD 6990 (wikipedia) can do about 1.27 Tflops. Compared to a Pi 2 of about 93 Mflops (hackaday).
        http://hackaday.com/2015/02/05/benchmarking-the-raspberry-pi-2/

        According to
        http://en.wikipedia.org/wiki/FLOPS lists a system at $0.08 per Gflop

        Define – Plug and Play, Pop video card in and run code. Seems easy enough.

        If I did the math right one video card would be worth 13763 Pi 2 units on a flop for flop basis. That’s before factors such as power, cooling and networking are accounted for.

  4. Better performance at what? If the first one was built to test and develop network distributed processing algorithms (I had assumed all along that it was, or why build it), then the 4 board system one is not a substitute.

  5. There are tons of comments on the performance of the cluster, and how pointless it is…

    but *nothing* on the fact that he built the case out of Mega Blocks? Cases built out of Legos are neat, but they have one drawback – they use valuable Legos that could be being used to build something else!

    A case built out of Mega Blocks is awesome, because at some point, those Mega Blocks lying around my house are going to be useless once my kids grow out of them. I’m totally stealing that idea in the future.

    1. I had no idea that was a thing. I had Duplo when I was little, with farm animals and things. I recall my brother had a Duplo train that my animals could ride in.
      Oops, I got lost in a nostalgia trip…

  6. “quad-core with double-the ram”

    “four-node-wonder”

    “light-weight 4-node”

    Are-we hyphenating every-thing in Hack-A-Day-articles-now? I understand they are probably not hiring English majors, but still.

    1. And thank god they aren’t, you want a poorly written article with multiple grammatical errors just pick up a local newspaper. The Hackaday staff does a fine job with their writing considered, yes they still provide wrong info from time to time and do silly stuff but it is far better than most aggregated news sources.

      1. Agreed. Although newspapers are nearly dead, they were called fishwrap for a reason. I’d much rather read a tech article written and/or edited by a knowledgable tech person, than an English major attempting to write a tech article that makes sense. I’m a former tech editor married to an English major, who worked at a place where they attempted make a tech person out of an English major, it didn’t work out well for him. I wonder if he’s even still in the publishing industry at all anymore. Keep up the good work HaD staff, don’t hire any English majors as editors. The only advice I can give is to proofread it twice and have another person proofread it, or proofread it slowly 3 times and shove it out the door. Given the ongoing quick pace that hacks come out, I understand deadlines and think you’re doing great, just leave the English majors to their Brit Lit and out of the HaD stuff.

  7. “You’re a bit fuzzy about what a gigaflops is exactly? So were we…”
    I’m afraid you need to run round the quad 3 times at midnight with your underpants round your ankles.

  8. Having some experience with real world simulation – the Pi2 is not only waaaay too slow, but the lack of instructions in a CISC CPU will also harm speed – however the biggest issue is ram – even a small (modern) dataset can use 10s of GB per core.. PER CORE – so the Rpi2 is limited to 256mb/core – which is most likely barely enough to do much of anything beyond the simplest of geometry and conditions.

  9. “Look, there’s nothing fancy to see in [James’] project announcement.”
    [translation]
    –=Our smart friend James wants to do something…==-
    *reads intently, studies in analytical discretion, grunts in approval*

    “Yet”.
    [translation]
    –=He wants to go to the bathroom since he is planning on eating some taco bell. Watch out now, he will use the extra spicy sauce. Since he is smart pay close attention to any revelations on how to wash our hands and even wipe.=–

    Wow. Oh wow. We are all quivering in anticipation now. you got us.

    HOW DOTH ARTICLE FORMED? I HAS PAPER NAPKIN WITH SCRIBBLES OF A “FM” CLOUD AND A BOOGER. MAEK ME HACKADAY SHOWCASE.

    http://www.researchgate.net/profile/James_Guthrie5
    And holy hell.. The article has been downloaded TWO times.

    http://personal.strath.ac.uk/james.guthrie/jabopi/
    “and the cost will likely be met by the author.”

    Ahhh.. There is it. You know he should have just done a kickstarter and promise to copy left the documentation on how to build it.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.