Moore’s Law Of Raspberry Pi Clusters

April 17, 2015

[James J. Guthrie] just published a rather formal announcement that his 4-node Raspberry Pi cluster greatly outperforms a 64-node version. Of course the differentiating factor is the version of the hardware. [James] is using the Raspberry Pi 2 while the larger version used the Model B.

We covered that original build almost three years ago. It’s a cluster called the Iridris Pi supercomputer. The difference is a 700 MHz single core versus the 900 Mhz quad-core with double-the ram. This let [James] benchmark his four-node-wonder at 3.048 gigaflops. You’re a bit fuzzy about what a gigaflops is exactly? So were we… it’s a billion floating point operations per second… which doesn’t matter to your human brain. It’s a ruler with which you can take one type of measurement. This is triple the performance at 1/16th the number of nodes. The cost difference is staggering with the Iridris ringing in at around £2500 and the light-weight 4-node built at just £120. That’s more than an order of magnitude.

Look, there’s nothing fancy to see in [James’] project announcement. Yet. But it seems somewhat monumental to stand back and think that a $35 computer aimed at education is being used to build clusters for crunching Ph.D. level research projects.

52 thoughts on “Moore’s Law Of Raspberry Pi Clusters”

Artenz says:

April 17, 2015 at 1:15 am

A 1 node desktop PC cluster would be even faster.

Report comment

Reply
1. Bogdan says:
  
  April 17, 2015 at 1:30 am
  
  From the article “The 3 GIGAFLOPS performance means simulations take around the same time to complete as on a single core on a relatively modern workstation”, so it is comparable in performance, but it is cheaper in cost.
  There is no explanation for the huge performance gain so here it is: it’s not the 512 vs 1Gb of RAM, it’s not the 700 vs 900MHz that makes the huge difference, it’s the multicore processor. The performance of a distribute computer depends a lot on the communication performance between nodes. The Pis(all of them) are pretty bad at spitting out data over Ethernet: it starts with a USB port located on a CPU, goes to a HUB which has an USB-ETHERNET adapter in it. So, any messages exchanged between nodes is slow. But with the newer Pi, each core has 3 brothers(sisters?) with which it can communicate at super speed and low latency.
  
  Report comment
  
  Reply
  1. scuffles says:
    
    April 17, 2015 at 2:55 am
    
    Now just imagine if they had used SBCs that didn’t slave their Ethernet to the USB bus.
    
    Report comment
    
    Reply
    1. lwatcdr (@lwatcdr) says:
      
      April 17, 2015 at 5:29 am
      
      Seem like one should use some other method for the interconnects when using the Rpi. Maybe the SPI bus?
      
      Report comment
      
      Reply
      1. Quin says:
        
        April 17, 2015 at 11:34 am
        
        Or some custom token ring bus built over the GPIO pins? Interconnects for supercomputers are always an important area of research, and cheap computers like the Pi seem like a good place for new hackers to try their hands at it.
        
        Report comment
  2. rasz_pl says:
    
    April 17, 2015 at 12:02 pm
    
    >From the article “The 3 GIGAFLOPS performance means simulations take around the same time to complete as on a single core on a relatively modern workstation”, so it is comparable in performance,
    >but it is cheaper in cost.
    
    3 gigaflops is around the performance of a NINE YEAR OLD LAPTOP CPU running at 1.6GHz, something you get for free, or pay $20 at goodwill
    
    Report comment
    
    Reply
    1. Whatnot says:
      
      April 18, 2015 at 7:25 am
      
      My graphics card which is not the newest can do 3.78 TFlops, so yes this doesn’t seem economical, except for the people selling raspis :)
      
      Still, the point is that 4 new raspi2’s are equal to 3 times 64 old ones it seems, and that’s interesting to hear.
      
      Report comment
      
      Reply
  3. Garbz says:
    
    April 18, 2015 at 12:45 am
    
    Or we could assume that the person already owns a computer and then spend $100 on a GTX 640 and then have processing power that is over 3 orders of magnitude faster.
    
    Heck my 5 year old Core i5 does around 12 GFLOPs per core and I have 4 cores.
    
    Other than “because you can” there really is no way to justify clustering RaspberryPis for any kind of real workload.
    
    Report comment
    
    Reply
2. BLACKOUT says:
  
  April 17, 2015 at 1:34 am
  
  For something optimized to run as many, many threads, not necessarily. Also, cost and energy use do fairly heavily factor into the equation, both of which the desktop is worse for. Does it perform better? Sure, but it uses several times the power, costs a minimum of 3x as much, isn’t as expandable, and takes significantly more space. For about the cost of said PC “cluster”, you could have a minimum of ten pi nodes and all the cores that implies, which for some purposes is significantly more useful than one faster CPU.
  
  Report comment
  
  Reply
  1. Artenz says:
    
    April 17, 2015 at 2:56 am
    
    A desktop PC can run many threads with the same ease as a handful of pi nodes. Communication between threads is also much easier and faster because they can share memory, plus it easy to extend the memory. And even with 10 pi nodes, you’d only get 10 GFLOPS versus 75 GFLOPS for a single quad core i7.
    
    Report comment
    
    Reply
    1. Garbz says:
      
      April 18, 2015 at 12:45 am
      
      Or 480GFLOPS for a GTX640 which is cheaper than 4x RaspberryPi 2s.
      
      Report comment
      
      Reply
      1. Artenz says:
        
        April 18, 2015 at 8:10 am
        
        If you compare the Raspi with a graphics card, it would be fair to also include the GPU performance on the Raspi. The GTX640 would still win, though.
        
        Report comment
3. Bob says:
  
  April 17, 2015 at 1:48 am
  
  *facepalm*
  
  Report comment
  
  Reply
4. jacques says:
  
  April 17, 2015 at 3:31 pm
  
  My single node intel desktop is 69Gflops, for 490$ not including the screen but including HD, power suply. and premium metal box
  The 120£ is a lie, you need 4 SD card, and 4 power supplies, so it is more 200£
  So 20 times more Gflop, less than 3 times the price.
  Who win ?
  
  Report comment
  
  Reply
5. Altayeb says:
  
  May 20, 2016 at 12:28 am
  
  hey, guys i just wanna know how much i can do with pi cluster?
  
  Report comment
  
  Reply
djneo says:

April 17, 2015 at 1:49 am

how would a 64 node Pi 2 cluster stack up to a desktop, performance and price

Report comment

Reply
1. jacques says:
  
  April 17, 2015 at 3:41 pm
  
  Intel Core i7 3960X 141 GFLOPS
  64 nodes RPI 24 GFLOPS (summed)
  
  I7 Motherboard with ram 850$
  64 Rpi 2240$
  
  Report comment
  
  Reply
Ren says:

April 17, 2015 at 1:53 am

“it’s a billion floating point operations per second… which doesn’t matter to your human.”

It scares me a bit to think that so many androids and computers are reading HaD that the postings are now addressed to them. B^)

Report comment

Reply
1. esot.eric.wazhung says:
  
  April 17, 2015 at 2:00 am
  
  haha! yep!
  
  Report comment
  
  Reply
2. Trevor says:
  
  April 17, 2015 at 3:04 am
  
  +1
  
  Report comment
  
  Reply
3. aa says:
  
  April 17, 2015 at 4:30 am
  
  And cats. Don’t forget them…
  
  Report comment
  
  Reply
  1. DainBramage says:
    
    April 17, 2015 at 8:57 am
    
    Imagine how terrifying cats could be if they had opposable thumbs…
    
    Report comment
    
    Reply
4. Mike Szczys says:
  
  April 17, 2015 at 6:04 am
  
  heh… supposed to be “human brain”. My human brain knew what I wanted to write and inserted the word on my conscious’ behalf.
  
  Report comment
  
  Reply
5. RandyKC says:
  
  April 17, 2015 at 6:56 am
  
  Most of my news is filtered through my Android. I hope it’s telling me everything I need to know.
  
  Report comment
  
  Reply
  1. Akki says:
    
    April 17, 2015 at 12:14 pm
    
    What if all of our newsreaders and search engines are actively suppressing the news about the coming SkyNet takeover!
    
    Report comment
    
    Reply
Samuel Reinfelder says:

April 17, 2015 at 2:00 am

Sounds more like a proof of principle. An modern Core I7 can have 30-50 gigaflops. That’s a lot of Rasperry Pis.

Report comment

Reply
Larry says:

April 17, 2015 at 2:28 am

PhD level research projects with 3 GFLOPS? Any decent PC greatly outperforms this by some orders of magnitude. A 100$ video card such as the Radeon HD 7770 has a peak performance of 1280 GFLOPS

Report comment

Reply
1. dominoembedded says:
  
  April 17, 2015 at 4:50 am
  
  His PhD may have been based on computer cluster algorithms rather than actual use of them, and that would be the perfect thing to act as a test bed because there is so much comparison and benchmarking
  
  Report comment
  
  Reply
  1. Mike Szczys says:
    
    April 17, 2015 at 6:06 am
    
    We write these things assuming you’re going to click through and read the original post being covered. He plans to continue adding nodes and the thesis will be on supercritical fluids in internal combustion engines.
    
    Report comment
    
    Reply
    1. chango says:
      
      April 17, 2015 at 8:09 am
      
      It will be interesting to see the results, but I doubt it will scale efficiently.
      
      As always, it comes down to the application. Using a pile of RPis as web servers or a build cluster will scale nicely since the applications have no interdependence on each other. [James] is running a CFD simulation on his cluster, which frequently needs to ferry data between processes.
      
      Fortunately for [James] the RPi2 has 4 cores and some additional cache behind the memory controller, so communicating between processes on the same RPi is speedy. But neither USB-HS and 100Mb Ethernet are suitable here, and the combination wears away at the performance advantage of the RPi2 as the size of the fabric grows.
      
      This is why HaD readers are so critical of compute clusters based on the embedded board of the month. GPUs that can get hundreds of times the performance of this cluster for less than a couple of times the price are available at your local computer shop, and if a PhD can’t get time on a university supercomputer then I’m sure they’ve got a friend with a fancy gamer PC they could bribe some idle time from.
      
      But of course this isn’t about that: it’s about seeing what you can do with on the bottom end. It’s a fun toy research exercise, and I’ve kind of got an itch now to buy a few more of these to play with. HPC is mature enough so that [James] should be able to estimate the performance of a larger cluster based on his current build, and I’m looking forward to see how that holds up to reality.
      
      Report comment
      
      Reply
      1. mike says:
        
        April 17, 2015 at 10:32 am
        
        Oddly enough, I’ve worked with some guys that swear by a rack of single-core machines over a single large multi-core machine for CFD simulations. It’s about memory bandwidth – the communication of work units and results between head node and workers doesn’t come close to saturating a modern network connection, but when you’re dealing with the amounts of data involved in the intermediate calculations, having a dedicated 1:1 path from CPU core to memory apparently makes a difference. This was a bunch of years back, I’m not sure if Intel’s push into triple- and then quad-channel memory renders that observation obsolete, or if the tools have improved to make better use of the hardware. It may also depend on the particulars of the workload.
        
        For what it’s worth, that was university work, and the hardware and energy costs were irrelevant (free).
        
        Report comment
    2. Larry says:
      
      April 17, 2015 at 12:02 pm
      
      I still don’t feel I’m the one who misses the point here. This is actually interesting if you are researching on parallel computing (which is the original scope of Dr. Cox), but in this case the objective looks like building computing power to get a job done, i.e. calculate a CFD. And to get a 1280 GFLOPS performance (=100 bucks consumer video card) you will probably need as much as 50-80 k$ worth of RPIs and various equipment (but I suspect much more, as this system will scale awfully).
      
      Report comment
      
      Reply
Hle says:

April 17, 2015 at 2:56 am

Think how much computing time you could buy from EC2

Report comment

Reply
1. pseydtonne says:
  
  April 17, 2015 at 4:49 pm
  
  Agreed. I sigh because I would love these RPi Beowulf tests to be killer per kWh, but I already tried something of this level around 2007.
  
  I was trying to build a Beowulf cluster from four motherboards with dual socket-370 CPUs and Intel P3 Tualatin 1.2 GHz chips. I loved the Tualatin chip: 130 nm instead of 180 nm of the P3 Coppermine or even the first P4s, twice the on-die L2 cache.
  
  Finding slightly older server parts was easy (and cheap) as a geek with a car living halfway between Cambridge and route 128. I got all the parts together — CPUs, motherboards, 133 MHz ECC RAM (server boards), power supplies, and a separate router to handle the subnet for message passing. Then I realized a single Athlon64 x2 on a microATX board was beating my numbers and only needed one power supply. It didn’t even need a separate router to talk to itself.
  
  The power was the expensive part. I was living in Boston, where power cost about 60% more than it does for me now that I’m in Los Angeles. Even thought I got quiet fans for the CPUs (which didn’t run as hot as the P4s of their time), I still had a noisy setup that needed 1.2 kilowatts.
  
  This is how I learned the lesson of throughput. The weakest point in any system may still have redundancy. You may have a lot of pipelines, but their skinniness means the data is just as slow to obtain. It’s a house on stilts instead of a house with a single, cement foundation.
  
  I also learned that money is not the only cost. I had already learned when I was very poor that I could trade time for money, though I would later learn that I may have also traded my future health as well. In this case I didn’t learn anything so painful: noise and space are also costs. Complexity of maintenance is also a cost, but that also files under time. I can get all the parts for less than $500 but never have the time to keep the beast running without suffering.
  
  Nevertheless, I love following these experiments. Everyone that conducts one learns a lot. The most valuable long-term lesson is that each challenge met is its own payment, one that can be turned into cash when applying for and getting more interesting jobs in the future. Each hard lesson is burned into the mind, one that will seem like a stroke of genius ten years later.
  
  Besides, handheld Beowulf? Hwaet!
  
  Report comment
  
  Reply
scuffles says:

April 17, 2015 at 3:28 am

This is neat. The one thing I don’t get is why use a Pi? I get that it is a decent introductory board but the fact that even with the Pi2 they still slave the Ethernet to the USB bus seems to make this a suboptimal board for this kind of application.

Report comment

Reply
1. Mike Szczys says:
  
  April 17, 2015 at 6:08 am
  
  I think the cost-per-gigaflop is a deciding factor. Do you know of an alternative that is this plug-and-play which will outperform and cost less?
  
  Report comment
  
  Reply
  1. Jonimoose says:
    
    April 17, 2015 at 7:15 am
    
    Yes, a used i7 machine from ebay or similar would easily get at least 70 gigaflops for not 20 times the price of 4 RPi’s. While the RPi2 is “faster” than the original they are still rather weak in just about every metric but especially for this use case.
    
    Report comment
    
    Reply
  2. scuffles says:
    
    April 17, 2015 at 7:40 am
    
    Can’t say anything really comes to mind that costs less. However the Odroid C1 costs effectively the same, has just about twice the clock, Gigabit Ethernet that isn’t on the USB bus and I want to say I have read in passing that the GPU on the Odroid is slightly better documented, at the least it sounded like there was more out there than just a mystery blob. How much more I have no clue. However if that ends up being correct there is the potential that you could also leverage the onboard GPU that is otherwise going to be sitting there doing next to nothing with the thing running headless. At the very least it would probably shave some cycles not having to run through the USB to get at the Ethernet.
    
    However if we are going price per gigaflop, neither is really all that practical. Like some people have pointed out the GPU on a decent graphics card can run circles around one of these clusters. Still it is an interesting exercise in the application of parallel computing. Its just that the USB bottleneck drives me nuts :P
    
    Report comment
    
    Reply
    1. AltMarcxs says:
      
      April 17, 2015 at 5:58 pm
      
      The Mali450 (as in the Odroid C1) doesn’t understand OpenCL.
      But anyway, when using GPU calculation beware that you only get single precision FP.
      
      Report comment
      
      Reply
  3. Artenz says:
    
    April 17, 2015 at 8:33 am
    
    If you just factor in the cost of the time to set up a cluster like this, and to adapt the original problem so that it will run efficiently on this cluster, you’ve already spent more than the cost of a i7 machine.
    
    Report comment
    
    Reply
  4. Koplin says:
    
    April 17, 2015 at 3:47 pm
    
    On a pure cost per gigaflop a GPU will win. Nvida and AMD both have offerings in the 1-6 Teraflop range.
    
    For example for double precision floating point operations, a Radeon HD 6990 (wikipedia) can do about 1.27 Tflops. Compared to a Pi 2 of about 93 Mflops (hackaday).
    http://hackaday.com/2015/02/05/benchmarking-the-raspberry-pi-2/
    
    According to
    http://en.wikipedia.org/wiki/FLOPS lists a system at $0.08 per Gflop
    
    Define – Plug and Play, Pop video card in and run code. Seems easy enough.
    
    If I did the math right one video card would be worth 13763 Pi 2 units on a flop for flop basis. That’s before factors such as power, cooling and networking are accounted for.
    
    Report comment
    
    Reply
TheRegnirps says:

April 17, 2015 at 7:19 am

Better performance at what? If the first one was built to test and develop network distributed processing algorithms (I had assumed all along that it was, or why build it), then the 4 board system one is not a substitute.

Report comment

Reply
Pat says:

April 17, 2015 at 7:28 am

There are tons of comments on the performance of the cluster, and how pointless it is…

but *nothing* on the fact that he built the case out of Mega Blocks? Cases built out of Legos are neat, but they have one drawback – they use valuable Legos that could be being used to build something else!

A case built out of Mega Blocks is awesome, because at some point, those Mega Blocks lying around my house are going to be useless once my kids grow out of them. I’m totally stealing that idea in the future.

Report comment

Reply
1. Shannon says:
  
  April 17, 2015 at 8:42 am
  
  I had no idea that was a thing. I had Duplo when I was little, with farm animals and things. I recall my brother had a Duplo train that my animals could ride in.
  Oops, I got lost in a nostalgia trip…
  
  Report comment
  
  Reply
  1. Artenz says:
    
    April 17, 2015 at 8:53 am
    
    Maybe you can hold a Pi board between two Duplo tractors and a sheep.
    
    Report comment
    
    Reply
English Teacher says:

April 17, 2015 at 9:16 am

“quad-core with double-the ram”

“four-node-wonder”

“light-weight 4-node”

Are-we hyphenating every-thing in Hack-A-Day-articles-now? I understand they are probably not hiring English majors, but still.

Report comment

Reply
1. JohnnyRico says:
  
  April 17, 2015 at 12:30 pm
  
  And thank god they aren’t, you want a poorly written article with multiple grammatical errors just pick up a local newspaper. The Hackaday staff does a fine job with their writing considered, yes they still provide wrong info from time to time and do silly stuff but it is far better than most aggregated news sources.
  
  Report comment
  
  Reply
  1. Rollinns says:
    
    April 28, 2015 at 2:26 pm
    
    Agreed. Although newspapers are nearly dead, they were called fishwrap for a reason. I’d much rather read a tech article written and/or edited by a knowledgable tech person, than an English major attempting to write a tech article that makes sense. I’m a former tech editor married to an English major, who worked at a place where they attempted make a tech person out of an English major, it didn’t work out well for him. I wonder if he’s even still in the publishing industry at all anymore. Keep up the good work HaD staff, don’t hire any English majors as editors. The only advice I can give is to proofread it twice and have another person proofread it, or proofread it slowly 3 times and shove it out the door. Given the ongoing quick pace that hacks come out, I understand deadlines and think you’re doing great, just leave the English majors to their Brit Lit and out of the HaD stuff.
    
    Report comment
    
    Reply
Marvin says:

April 17, 2015 at 9:32 am

“You’re a bit fuzzy about what a gigaflops is exactly? So were we…”
I’m afraid you need to run round the quad 3 times at midnight with your underpants round your ankles.

Report comment

Reply
Noirwhal says:

April 17, 2015 at 3:10 pm

Having some experience with real world simulation – the Pi2 is not only waaaay too slow, but the lack of instructions in a CISC CPU will also harm speed – however the biggest issue is ram – even a small (modern) dataset can use 10s of GB per core.. PER CORE – so the Rpi2 is limited to 256mb/core – which is most likely barely enough to do much of anything beyond the simplest of geometry and conditions.

Report comment

Reply
Haka Plz says:

April 18, 2015 at 6:38 pm

“Look, there’s nothing fancy to see in [James’] project announcement.”
[translation]
–=Our smart friend James wants to do something…==-
*reads intently, studies in analytical discretion, grunts in approval*

“Yet”.
[translation]
–=He wants to go to the bathroom since he is planning on eating some taco bell. Watch out now, he will use the extra spicy sauce. Since he is smart pay close attention to any revelations on how to wash our hands and even wipe.=–

Wow. Oh wow. We are all quivering in anticipation now. you got us.

HOW DOTH ARTICLE FORMED? I HAS PAPER NAPKIN WITH SCRIBBLES OF A “FM” CLOUD AND A BOOGER. MAEK ME HACKADAY SHOWCASE.

http://www.researchgate.net/profile/James_Guthrie5
And holy hell.. The article has been downloaded TWO times.

http://personal.strath.ac.uk/james.guthrie/jabopi/
“and the cost will likely be met by the author.”

Ahhh.. There is it. You know he should have just done a kickstarter and promise to copy left the documentation on how to build it.

Report comment

Reply
Anderlan says:

December 5, 2015 at 1:21 pm

I can see the singularity. Preparing for spaghettification.

Report comment

Reply

Hackaday

Moore’s Law Of Raspberry Pi Clusters

52 thoughts on “Moore’s Law Of Raspberry Pi Clusters”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Dearest C++, Let Me Count The Ways I Love/Hate Thee

Personal Reflections On Immutable Linux

Crunching The News For Fun And Little Profit

The End Of The Hackintosh Is Upon Us

The Hackaday Summer Reading List: No AI Involvement, Guaranteed

Our Columns

Trickle Down: When Doing Something Silly Actually Makes Sense

Hackaday Podcast Episode 328: Benchies, Beanies, And Back To The Future

This Week In Security: Bitchat, CitrixBleed Part 2, Opossum, And TSAs

Ask Hackaday: Are You Wearing 3D Printed Shoes?

FLOSS Weekly Episode 840: End-of-10; Not Just Some Guy In A Van

52 thoughts on “Moore’s Law Of Raspberry Pi Clusters”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns