Parallel Processing Was Never Quite Done Like This

Parallel processing is an idea that will be familiar to most readers. Few of you will not be reading this on a device with only one processor core, and quite a few of you will have experimented with clusters of Raspberry Pi or similar SBCs. Instead of one processor doing tasks sequentially, the idea goes, take a bunch of processors and hand out the tasks to be done simultaneously.

It’s a fair bet though that few of you will have designed and constructed your own parallel processing architecture. [BB] sends us a link which though it’s an old one is interesting enough to bring you today: [Michael] created a massively parallel array of Parallax Propeller microcontrollers back in 2008, and he did so on a breadboard.

The Parallax Propeller is an 8-core RISC microcontroller from the company that had found success in the 1990s with the BASIC Stamp, the PIC-based board that was all the rage before Arduino came into the world. In the last decade it was seen as an extremely exciting prospect, but high price and arcane development tools compared to a new generation of low-cost and easy to code competitors meant that it never quite caught on and remains today something of an intriguing oddity. So today’s value in this project lies not in something that you should run out and do yourselves, but instead in what the work tells us about the nuts and bolts of parallel processing architecture. It involves more than simply hooking up a load of chips and hoping for the best, and we gain some insight into the different strategies involved.

The Propeller certainly wasn’t the first attempt at a massively parallel microcontroller, and we doubt it will be the last. We’re certainly seeing microcontrollers with more than one core becoming more mainstream even in our community, but even with those how many of you have made use of the second core in your dual-core ESP32? Is a multicore microcontroller a solution searching for a problem, or will somebody one day crack it and the world will never be the same again? As always, the comments are below.

64 thoughts on “Parallel Processing Was Never Quite Done Like This

      1. The Transputer gloves on as the xCORE devices; buy them at digilent. You can have a 4000mips 32 core MCU for £25, and can parallel then up.

        Occam lives on as xC, and is a delight to use.

        1. Chuck Moore is a great scientist but a bad business man. What you call a “pet project” need money to roll and this money can’t come from is pocket because the guy don’t have that much. There must be customers and investors to back it.

    1. Concerning GreenArrays GA144, I would like to experiment with it but I’m not ready to pay 495 US$ for their EVB002 board. And the chip itself is a BGA type not easy to work with it.

          1. Verilog code for the propeller has been available for long time now at the Parallax web site. A version 2 propeller was in development as of 2015 with 16 cpus (cogs), a higher clock speed and other enhancements.

  1. Jenny,
    “will somebody one day crack it and the world will never be the same again?” One word: conservatism.

    Don’t forget XMOS multicore microcontrollers, very interesting product. I worked with it a few years ago. The problem is that most users of µC are used to work in C/C++. To work with XMOS µC one as to learn to XC language and multiprograms model. The limited success of products like XMOS ones and event more for Green Arrays GA144 is conservatism.

    1. I certainly haven’t forgotten XMOS, I too have encountered them professionally. But they’re the spiritual successor to the Transputer, and I thought since I mention them in the linked article that would be appropriate.

    2. They are a niche product and quite successful as their chip is found in Amazon’s Alexa and automotive audio products.

      Conservativism has nothing to do with a lack of adoption. The Parallax Prop suffers from a bunch of short comings. The GreenArrays chip requires you to be a Forth guru to do anything with it. Worse there is no real documentation that teaches a person how to code it. Typical Forther hositility at work.

      1. Bunker Mentality isn’t unique to Forth fanatics, they are just a little more extreme than most.

        Having to work in RPN notation might have something to do with that…

          1. And the odd “Cog” central hub to transfer data from one core to another. You can wait anywhere from 1 to 7 clock cycles for the cog to come around to the core you want to transfer to. The only other option is to assign IO lines to each core you want to transfer data, but that eats up IO.

          2. It’s been a long time but I’ve worked with the Parallax Propeller. Everyone should take the Propeller out for a “spin” to get some experience with a commutating core micro-controller. Here’s a link to the page for the 40-PDIP part which has links to documentation etc.

            https://www.parallax.com/product/p8x32a-d40

            The 40-pin PDIP Propeller costs $8 bucks for the raw part. It’s easy to breadboard a working system from there. The part is well documented with complete schematics of working systems. The Propeller has an 80MHz system clock, 20 MIPS per core (“cog”) x 8 cogs = 160 MIPS max. There are a maximum of 32 GPIO pins, but two pins are usually dedicated to serial programming via an external USB/UART dongle during development. There’s nothing “lousy” about the on-die memory itself, but it would be nice to have a bit more of it on a per cog basis. The chip does need an external EEPROM that holds the SPIN language interpreter and user-space. The cogs communicate over a 32-bit parallel commutating circular bus. There are two powerful timer/counter/PLLs per cog. One popular misconception is that the commutating bus slows down the individual cogs. This is generally not true, in-fact the part is designed to let all the cogs run independently in parallel. The interpreted high-level language is called SPIN. I found SPIN was easy to use, powerful, but slow. I think the SPIN GUI IDE is nice. There are C/C++ tools available now for the Propeller. The Propeller Assembler (PASM) is fully documented and powerful. For speed intensive applications you run a supervisor SPIN program with PASM code inline loaded into the cogs and executed. I seem to remember that you can start/stop/reprogram the cogs with SPIN on the fly. The worst problem I encountered is the number of clock cycles required per instruction. The cogs are not RISC cores. I’m working from memory but I think the fasted PASM instructions took 4 clock cycles.

        1. What I’ve encountered with Forth fanatics is they seem to be mostly about adding stuff to the Forth language rather than actually using it to *do stuff*.

          Forth games? Forth word processor and other office software? Forth web browser? What’s the killer app written in Forth?

          1. In 2014 when I worked with XMOS products it was not a niche product lines but the lack of acceptance forced them to find a niche to survive. My hope is that they gather enough momentum in their niche to get out of it and spread to larger market. The product worth it.

            Forth inc. created in 70’s by Chuck Moore and Elisabeth Rather still alive in 2019. There must me something there. To know more about it visit https://www.forth.com/software-development-company/. In Britain there is mpe selling vfx forth. So it seem that forth is sustainable business.

            The fact is that most people cling to what is most popular and it is not necessarily what is best.
            This is the momemtum that count that’s all. Java and C/C++ have a lot of momentum and all languages that have had success in late years are those that have a look and feel familiar to C/C++ programmers.

      1. Moore’s law is an observation that transistor counts in newer chips doubles with roughly every two years, while maintaining rough price parity with the older ones.

        The current problems to stay on this course literally means that transistor resources is not becoming cheaper at the same rate.

        And since parallel systems are nearly always more resource intensive then non parallel ones. (a quad core CPU usually needs 4x more transistors then a single core CPU using the same architecture implementation, for rather logical reasons…)

        Therefor the end of Moore’s law wouldn’t be all that beneficial towards parallel processing systems.

        The more important reason for why parallel processing is on the upswing in consumer/enterprise products is rather due to ever increasing amounts of RAM access latency (in CPU clock cycles) associated with higher core clock speeds. Making efficient prefetching more complex, usually resulting in less efficient use of memory bandwidth due to speculative prefetching of data, when looking at branches in execution.

        1. Yes and no. Moore’s law relates to miniaturisation. Twice as many transistors in the same space, as a result, about the same price. So yes, cost falls, but at the edge of performance, more importantly, speed is increased, as the smaller geometry results in lower capacitances. You can always spend more money, but as the speed is not increasing at an exponential rate anymore, the only way left is to alter the architecture, add more processing happening in parallel.

          1. Yes, there is a bit of a lack of increasing clock speeds during the last 5-10 years. (Partly due to thermal issues, and due to diminishing returns in terms of memory latency and the slew of problems it brings with it, and how it plays together with the branching problem and so forth. (Not to mention latency between processors, but this isn’t all that applicable to single socketed systems.))

            But my main point were rather that there also is diminishing returns in terms of transistor costs as well, with the current struggle to live up the Moore’s law. Thereby also effecting the future of parallel systems, since an increase in parallel performance needs a similar increase in transistor count. (Ie, the cost of a billion transistors is not going to drop all that much in the next 2-4 years compared to how much it has fallen in the prior 2-4 years. Roughly speaking. Unless there is some “breakthrough”, much isn’t technically needed, manufacturing processes tends to hoover around the Moore’s law line per say. Though having been more on the behind side of things lately.)

            Though, systems in the future will likely improve in more then one sole area. Ie, improve both serial and parallel performance, while likely also adding hardware acceleration of specific tasks. But non of this is hardly a new thing. (And has been done since the 80’s.)

            So making a statement “Now that Moore’s law is dead, the era of parallel processing can start.” is fairly misinformed to be honest, since parallel computing has been done for decades and is also negatively effected by the struggle to live up Moore’s law.

            If a parallel system is more or less effected compared to a serial system is all down to specific architecture and hardware implementations, not to mention application area.

  2. The problem is as always with getting the most out of parallel processing and that is can the problem or app be mapped across multiple processors successfully and can it benefit from them?

    It’s not as easy as you think and some problems don’t map out at all.

      1. Please explain why you say that. The Propeller does indeed perform true parallel processing; each cog is independent, has its own memory for code and data, and doesn’t need to wait for anything unless there’s a need to use shared resources, like a shared memory for coordination which is often not needed except for fleeting moments in an application.

  3. I wonder is bbcode supported here:

    A-----B
    |\   /|
    | E-F |
    | | | |
    | G-H |
    |/   \|
    C-----D

    After reading the above I’m now I thinking about the new RPi4B, well 8 of them with 2 external USB 3.0 gigabit NIC’s connected to each configured in an eight node 8 node hypercube network. It is funny how brans work.

    1. “It is funny how brans work.”

      Apparently the rough texture of indigestible material acts as a mild irritant, so the G-I tract tries to expel it as quickly as possible :)

      1. I really wish that there was an edit facility, but at the end of the day typos happen, especially with an old keyboard on it’s last leg (7 keys are worn blank and the springs are almost dead).

  4. I use the Prop chip a lot. It’s a beast. We were doing long range autonomous boats with it 10 years ago. Being as it came out roughly at the same time as the Arduino, wasn’t much more expensive, and the software toolchain was friendlier to set up at the time, I don’t know why it didn’t become a mainstay.

    1. The Propeller would have changed the world if it had come with C from the beginning.
      Even today $THEY keep C like a stepchild. A situation to cry oceans of tears…
      …or just change the microcontroller!

      Last night I tried to get Propeller-GCC compiled again after an OS upgrade. Looks like I need to find new workarounds, tricks and hacks again to make that happen. That’s sooooooooooo… adrenaline… you know what I mean… currently I really am thinking to give away all my Propeller boards. I’m so fed up with this situation.

      1. Funny, I had to re-write a display drive in Spin because Parallax is pretty much only supporting C these days for a lot of their hardware. Their Blockly stuff is all C based libraries.

        I love the P1 and have one of the P2 Engineering Samples. It is so freaking fast!

  5. Get the to a copy editor (or at least a grammar checker).
    Read Strunk & White.

    Seems like we are seeing a lot of parallel processing bits.
    Graphics processors have been doing for decades, and getting these
    ad-on boards for various computation tasks from GPU makers.

    (Note difference between parallel processing and multiprocessing and multi-computing.)

  6. Worth mentioning, the Propeller 2 is coming out within a couple of months; some of us have the first gen silicon which turned out to have some bugs but was workable enough to do early dev work on. P2 will blow P1 out of the water with 512K instead of 32K, 160 MHz 2-cycle instruction instead of 80 MHz 4-cycle, native support for what we once called LMM executed directly from Hub RAM and burst mode without branches as fast as Cog RAM execution via the “eggbeater.” Add in native support for HDMI and USB and it’s going to be up to date.

    1. Doesn’t matter unless Parallax steps up and officially supports something other than Spin2 for a developer tool chain. Really another proprietary language. Wow, that’s a great way of impressing potestial custimers.

      How about GCC for the P2 – oh yeah crickets from Parallax. They aren’t about to invest any time or $$$ on it. And really what does it say about a company when it won’t even support a C/C++ compiler for it’s new processor. It’s a important question for those who write C/C++ for embedded controllers.

      BTW the documentation better be rock solid for the P2, it’s a bizarre processor in a lot of ways, user FAQ’s and Chip notes aren’t enough. Does Parallax even have anyone doing documentation for the P2?

      1. Zerg, your post suggests that the decision (or not) from Parallax to support GCC for P2 is somehow an elective choice. It’s not. After 13 years of R&D, several fabrications, we don’t have a lot of financial latitude to port our GCC to P2 the way it should be done. We’ll be very fortunate to finish the hardware as it’s a pretty amazing accomplishment that we’ve gotten this far. Our outcomes are based on the realities of boot-strapping this project. We did this with retained earnings, hard work, and no external investors. This will let us be true to our customers.

        The P2 has been a community effort and we’re going to rely on the community as much as possible to help us with Spin/PASM drivers, MicroPython, and documentation. We will provide the core documentation, of course.

        Happy to be very open with everybody about how we arrived where we are with the P2, what you can expect, and what we can’t readily do internally.

        Join us! – Ken

  7. Nothing wrong with the Propeller chip. 8 32-bit RISC cores give truly parallel operation. Builtin support for composite video and VGA with a few resistors. Back when it was released in 2006 it was the only? micro with 32KB of internal RAM (actually there is 32 + 8*2 = 48KB).

    Ever wanted a micro with 16 UARTS? What about 16 individual I2C? Or multiple SPI?
    The prop can do any mix of these simple interfaces, all under software control.

    There are about 120 engineering samples of the P2 in the wild. It’s again 8 32-bit cores, 2 clock instructions. Each core has a dedicated 4KB (up to 2K of that may be addresses as 512x32bit registers), and 512KB of shared RAM between the cores. Cores can execute code from both its dedicated RAM and the shared RAM. Pairs of cores can share 2KB of their dedicated RAM. There are 64 I/O pins, each with ADC, varying pull-up and pull-down values. Each pin has dedicated smart-pin controls. We have successfully overclocked the ES chips at 350MHz. Inbuilt DACs allow VGA connections using 5 pins and I have driven my 24” 1080×1920 monitor with coloured text 240×135 lines IIRC. Some are driving multiple VGA monitors. It can boot directly from SD Cards with or without FAT32.

    There are dedicated streamers that can stream data in and out of the chip – we use that to stream out VGA in the background of a core. How would you like to visualise your code in memory on a VGA monitor. One of the guys has done just that, with the VGA core just displaying a big chunk of memory real-time, while the “program” runs real-time on another core. FWIW he was watching micropython running.

    The boot ROM contains, in addition to boot from external SPI FLASH, SD Card, or Serial auto and download, it also contains a TAQOZ which is an interactive FORTH implementation, and a simple interactive Monitor/Debugger.

    In the last few days, Catalina C is now running. Micropython is running limited code. And there is a general compiler that can take C, Basic, PASM, Spin1 and output PASM. (PASM is prop assembler). We have soft-USB running.

    There is inbuilt hardware for polar-Cartesian conversion, multiply, divide, Corsica, CRC, etc.

    Peripherals are again soft, on any pin, and as many as you want, all aided by the smart pins. So if you want 32 UARTS you can have them!

    ETA for production chips is around August.

    Full disclosure – I am just a customer/hobbyist/user of propeller chips. I have no other relationship with Parallax.

  8. The Propeller (prop as we affectionately call it) is a true 8 core 32-bit RISC CPU with 2KB dedicated RAM/Registers per core and shared 32KB RAM and 32KB ROM. It runs at 80MHz with mostly 4 clocks/instruction although I overclock to 96-104MHz.
    Each core can access any of the 32 I/O, do composite video or VGA, counters/timers, etc. The prop utilises soft peripherals which means you can have 16 UARTs, 16 separate I2C busses, multiple SPI, etc. It also comes in 40 pin DIP as well as QFP44.
    At the time of its release in 2006, I don’t know of any micro that had 48KB of internal RAM.

    There are ~120 P2 Engineering Samples in the wild. There are a few bugs which are being corrected. The P2 has 8 true cores (not threads) with 4KB of dedicated RAM each, of which 2KB can also be used as 512 32-bit registers. Each core can access the 512KB of shared RAM. Code can run from both the dedicated 4KB and shared 512KB RAM. Each core can access all 64 I/O. Each pair of cores can share @K of its dedicated RAM with each other. The clock speed is rated at 160MHz (175MHz on the next silicon IIRC) but we have been successfully overclocking at 350MHz! Each instruction mostly takes 2 clocks. Each of the 64 I/O pins has programmable smartpin hardware, and numerous pull-up and pull-down options, as well as ADC on every pin! There are inbuilt DACs enabling VGA to be output on 5 pins (RGB and H/V) without any external hardware/resistors. HDMI can be done too.

    Each core can execute code from either its own 4KB dedicated RAM and from the shared 512KB RAM. Each core is deterministic, meaning code can be accurately timed, a task that cannot be achieved with most modern processors.

    There is hardware for multiply and divide, cordic, cartesion-polar conversion, rotation, etc. There are streamers which can stream data in/out of the P2. I have VGA running to my 24″ 1080×1920 monitor with color text of 240 chars x 135 lines. Some have dual VGA monitors running. One guy has a monitor displaying the shared RAM watching micropython executing in a separate core.

    The P2 can boot directly from an SD Card with or without FAT32, from FLASH, from Serial with download capability and autobaud, or from the internal ROM which has TAQOZ – a Forth implementation that runs interactively on the serial port, or a simple inbuilt Monitor/Debugger. We have a compiler which can take C, Spin, Basic and PASM (Propeller assembly) and compile to P2 PASM. Catalina C is now working in the last few days.

    Production P2 chips are expected in August.

    Disclosure: I don’t work for Parallax. I am a just a happy customer/hobbyist/user.

    1. In August we will have about 2,000 P2 chips in our inventory. Many of them will be mounted on the P2 Evaluation Board and some individual units will be for sale. In October we’ll have production volumes (at least that’s how it looks now). But first, before any of this happens, we need to verify the final P2 – which will happen late July! – Ken

  9. Back in 2005 I did use a micro with 48k RAM, the LPC2148 ARM. Back in 2005 I wrote a Forth for it and made it generate VGA, audio, and handle FAT16 SD cards etc. It was a self-contained PC that I called “noPC”. I entered this in the Philips ARM Design Contest then and claimed 2nd prize even though I never used any of the sponsors compilers or boards.

    The reason I say this is because I then abandoned ARM chips, and many other chips after I discovered the Propeller in 2006. Even though I continue to evaluate the latest ARM chips etc I haven’t looked back since and have only used the Prop chip in scores of commercial and industrial designs even implementing FTP and web servers on them.

    The P2 is a fantastic chip which I have been using in engineering sample form for the past year and I have made hardware and software and open-sourced, including a datasheet for the P2. It blows the old P1 Prop out of the water and I regularly run my chips at 300MHz. My P2s can run as stand-alone PCs with VGA and keyboard, and 128GB FAT32 SD cards which the P2 itself can format to actual FAT32, and I can play music and videos and view images etc. However I am still only using 2 cores and one of those is for VGA but sometimes I run up to 4 cores. The P2 core is cheap, around $1 :) but since there are 8 of them I expect the whole 8-core P2 price to be around $8…$10.

    BTW, I cringe every time I see BB (though not for many years) and his “Parallel Computer” on breadboards. This guy is a breadboard colored wire showman and his show pony never even blinked an LED or said Hello World. Please disregard any of his stuff as representative of Parallax or the Propeller. I’ve always considered him “light entertainment” but not in any nasty way. I’m sorry I even had to say this just to set the record straight.

  10. One of the main reasons I use the esp32 is the 2nd core, and I suspect others are the same – so I the answer to “you have made use of the second core in your dual-core ESP32?” is “yes!”.

    It simple makes sense to have time sensitive stuff on one core and your general logic on another.. I fact, I really want the esp32 to go to 4 cores, as then one could be doing wifi, the other bluetooth, another time sensitive stuff, and the last with the overall logic and glue…

  11. I agree that the initial Propeller nearly died for lack of language support. The architecture was odd, the memory was limited and the Spin language (which was built into the chip) was just a step too far for most mainstream users. Each processor could only access the RAM 1/8th of the time in a round-robin fashion. But even so it was an awesome and innovative design that found a niche with hobbyists (like me) who had always wanted to experiment with low-cost parallel processing. The things people could do even with that limited chip and its 32kb of RAM were truly awesome. I particularly liked the fact that it had no interrupts – because with 8 processors to throw around it didn’t really need them!

    But the new Propeller 2 is a completely different proposition. The current version has 512Kb of RAM, and because of yet another innovative design, each of the 8 processors can now access that RAM at full speed. It now has interrupts (I’m not sure I like that, but I can see that some people will think so) and a bunch of new features that even the early adopters have not had time to explore yet – like the ability to have two processors “paired” to tackle a task, communicating via dedicated dual-ported shared memory space, separate to the main RAM.

    And, importantly, Parallax seems to be taking a different tack on languages this time. There is not even a version of the Spin language for the chip yet (although one is coming). But it will not be built into the chip. Instead, this time around Parallax has released the chip “into the wild” (so to speak) for independent developers to support. So far we have 2 “homegrown” C compilers (disclaimer: I wrote one of them!), a Forth compiler, and a bunch of other languages. I am sure there will eventually be a GCC toolchain – but I’m not sure we want that too early. Putting a GCC toolchain on this beast at this stage would be like having a towbar option on a sports car. Yes, you can then sell it on the basis that it can tow a caravan, but that’s not why you would buy one! And even without GCC, this chip will be launched with far better tool support than the original Propeller was.

    1. Every mcu released has a GCC tool chain released with it. It’s basically expected and what people look for when considering a new mcu.

      BTW GCC ports can very polished and slick. Rowley Crossworks does it well. Parallax ought to ask them to do a port for the P2.

      It’s a no brainer for Parallax to support it but for some bizarre reason they refuse to support anything except Spin.

      And Spin is no selling point. Schools don’t want it and neither does industry. Parallax ought to be at least supporting Python and GCC. I’m no fan of Python but it’s popular among educators and can be a selling point. GCC support is a must so engineers can tell their PHB’s ‘see it has GCC support’.

    1. You can do the same with a $5 Raspberry zero card and do it better. The fact is you don’t use a Arduino for video, you feed data from the Arduino to a Raspberry or tablet that can do a far better job at video display and data management.

      1. The Pi has a full blown OS running on a SoC implementation. P2 is a microcontroller, not a SoC. The fact that you compare the P2 to a Pi speaks volumes to its capability. It’s a microcontroller with 512K of RAM.

  12. Well Zerg, while you seem intent on bagging, there is a lot you just don’t know about the P2 obviously, but you do seem to be interested enough to comment. While I can playback VGA video and audio, the point of that is that I still have all those cores and smart pins free for real-time tasks.

    In a tiny nutshell the P2 combines the power of 8 real-time cores and 64 smart pins so that it can handle real life embedded applications that I might be able to do with other chips. However I can do it much easier and faster and better with just one chip, without the headaches and actually have fun doing it.

    1. I went the propeller route and couldn’t find an adequate book to teach it.

      I bought a getting started kit and was told it was the wrong kit after it was recommended by the forums.

      I kept getting the wrong power supply sent to me and spent half the summer waiting for the product to come in the mail.

      Then I was told by an engineer that it is a course and now they push the block language that I am not interested in.

      I called for help and was told all the workers were at a show.

      The real question is why should I shell out money for kits that are more expensive for an obscure language like Spin that no one uses outside of Parallax when I could be learning C, Python or Linux which is in vogue?

      Compare the prices of the Arduino to the Propeller and I’m getting gouged as a customer with ineffective support.

      The reason block language is pushed is because they don’t expect people to learn.

      I bought their product and had expectations that was not met so why should I buy their products that are updated and extend credit when they didn’t give me something with documentation that I could use?

      1. I am sorry to hear you had a bad experience. I guess you and I had very different experiences with Parallax. When I decided to go the Propeller route, I purchased their Education Kit. It came with the DIP-40, regulators, bypass caps, crystal, bread board,LEDs, etc and a nice book that started with the “Hello, World!” of microcontrollers, a project to blink the LED.

        I have never had to contact customer support, as the forms they host are a wealth of high quality information. The friendliest forms of that type I have seen.

        I picked up the Propeller Manual and it wasn’t long before I had a dozen or so props as a part of a 200+ channel DMX universe running 110V Christmas lights.

        I am no a huge fan of blockly for myself, but I did see my 8 YO daughter take to it at the Museum of Computing History in Mountain View, CA. She picked it right up and was programming. I thought that aspect of it was pretty cool. It makes it accessible to anyone at any skill level, even 8 yearolds.

        Spin is pretty much a Python work alike. If you want to use C, go for it. There is are multiple C compilers. Want to learn assembly? PASM is about the easiest ASM to learn. Very straight forward.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.