PentaPico: A Pi Pico Cluster For Image Convolution

The five picos on two breadboards and the results of image convolution.

Here’s something fun. Our hacker [Willow Cunningham] has sent us a copy of their homework. This is their final project for the “ECE 574: Cluster Computing” course at the University of Maine, Orono.

It was enjoyable going through the process of having a good look at everything in this project. The project is a “cluster” of 5x Raspberry Pi Pico microcontrollers — with one head node as the leader and four compute nodes that work on tasks. The software for both types of node is written in C. The head node is connected to a workstation via USB 1.1 allowing the system to be controlled with a Python script.

The cluster is configured to process an embarrassingly parallel image convolution. The input image is copied into the head node via USB which then divvies it up and distributes it to n compute nodes via I2C, one node at a time. Results are given for n = {1,2,4} compute nodes.

It turns out that the work of distributing the data dwarfs the compute by three orders of magnitude. The result is that the whole system gets slower the more nodes we add. But we’re not going to hold that against anyone. This was a fascinating investigation and we were impressed by [Willow]’s technical chops. This was a complicated project with diverse hardware and software challenges and they’ve done a great job making it all work and in the best scientific tradition.

It was fun reading their journal in which they chronicled their progress and frustrations during the project. Their final report in IEEE format was created using LaTeX and Overleaf, at only six pages it is an easy and interesting read.

For anyone interested in cluster tech be sure to check out the 256-core RISC-V megacluster and a RISC-V supercluster for very low cost.

21 thoughts on “PentaPico: A Pi Pico Cluster For Image Convolution

  1. “It turns out that the work of distributing the data dwarfs the compute by three orders of magnitude.”
    An excellent project and an excellent learning exercise. Nice work. :) And this experience will inform his work for years. This is why doing these projects are so important, to learn. Core principled of science is reproducibility and not taking someone’s word for it. Detractors of these kinds of project say “what is the point, just get a cheap PC” and they are missing the point and the fun in building these mini clusters.

    Good work Willow Cunningham. :)

    1. Now you know why PCIe 7.0 is going to be so fast.

      Seems that another student could use this project as a springboard to implement faster data transfers via less conventional means.

    2. hahah yeah my heart was exploding with “that the work of distributing the data dwarfs the compute by three orders of magnitude!!!” but then there it was right there on the page

  2. But why waste time on hardware when the entire project could be done with QEMU on a single 8-core CPU. It’s not like it’s still 2008 and dual-core Wolfdale is the duck’s guts.

    1. 12 and 20 core CPUs are cheap. 14700 is around $200 if you go looking. Tough to beat for raw compute. Did a BMW Blender render in 24 seconds on the one I built. 14400 is probably the price/performance leader though. Although when I see a $154 12700KF (Amazon) or $170 Ryzen 7700 (SZCPU) it gives me pause.

        1. I was just saying if you are going QEMU why stop at 8 cores?
          8 is midrange, soon to be entry level.
          Also a modern $80 6 core like 8400F can probably do 8 cores worth of work.

    1. looks like deconvolution since it’s deblurring, but since deconvolution is convolution with the inverse point spread function, then I guess you can also call it ‘convolution with a kernel for sharpening’

    1. There’s enough (in this case unused) I/O to make an extremely high bandwidth custom communications channel between these chips, but I somewhat understand the desire to go with something standard, to reduce complexity of the overall project.

    2. Eyup
      QSPI, or even better if the project allows it, shared RAM module.
      Granted, one have to order the access, but that shouldn’t be too hard to do.

      But kudoes anyway to him, that’s a very nice project as is!

      1. I think a Pico can be used as a virtual RAM chip, so it’s possible that could handle the accesses.

        Also if the project scales up it might be better to use bare RP2040 (or the new chip), and share the power lanes. Imagine a SBC size device with 10-20 RP2040 chips on it 😂

    3. I think SPI with DMA would be nearly ideal for moving image rows between devices. Or maybe a 4 or 8 bit parallel with DMA. I did not see DMA in the docs but the M0 can do it.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.