Musings On A Good Parallel Computer

Until the late 1990s, the concept of a 3D accelerator card was something generally associated with high-end workstations. Video games and kin would run happily on the CPU in one’s desktop system, with later extensions like MMX, 3DNow!, and SSE providing a significant performance boost for games that supported them. As 3D accelerator cards (colloquially called graphics processing units, or GPUs) became prevalent, they took over almost all SIMD vector tasks, but one thing that they’re not good at is being a general-purpose parallel computer. This really ticked [Raph Levien] off and it inspired him to cover his grievances.

Although the interaction between CPUs and GPUs has become tighter over the decades, with PCIe in particular being a big improvement over AGP and PCI, GPUs are still terrible at running arbitrary computing tasks, and even PCIe links are still glacial compared to communication within the GPU and CPU dies. With the introduction of asynchronous graphic APIs this divide became even more intense. [Raph]’s proposal is to invert this relationship.

There’s precedent for this already, with Intel’s Larrabee and IBM’s Cell processor merging CPU and GPU characteristics on a single die, though both struggled with developing for such a new kind of architecture. Sony’s PlayStation 3 was forced to add a GPU due to these issues. There is also the DirectStorage API in DirectX, which bypasses the CPU when loading assets from storage, effectively adding CPU features to GPUs.

As [Raph] notes, so-called AI accelerators also have these characteristics, with often multiple SIMD-capable, CPU-like cores. Maybe the future is Cell after all.

So What Is A Supercomputer Anyway?

Over the decades there have been many denominations coined to classify computer systems, usually when they got used in different fields or technological improvements caused significant shifts. While the very first electronic computers were very limited and often not programmable, they would soon morph into something that we’d recognize today as a computer, starting with World War 2’s Colossus and ENIAC, which saw use with cryptanalysis and military weapons programs, respectively.

The first commercial digital electronic computer wouldn’t appear until 1951, however, in the form of the Ferranti Mark 1. These 4.5 ton systems mostly found their way to universities and kin, where they’d find welcome use in engineering, architecture and scientific calculations. This became the focus of new computer systems, effectively the equivalent of a scientific calculator. Until the invention of the transistor, the idea of a computer being anything but a hulking, room-sized monstrosity was preposterous.

A few decades later, more computer power could be crammed into less space than ever before including ever higher density storage. Computers were even found in toys, and amidst a whirlwind of mini-, micro-, super-, home-, minisuper- and mainframe computer systems, one could be excused for asking the question: what even is a supercomputer?

Continue reading “So What Is A Supercomputer Anyway?”

Several Raspberry Pi Picos connected to each other

Raspberry Pi Pico Parallel Mandelbrot Computation

The Mandelbrot set is — when visualized with some colors — an interesting shape with infinite detail. While the patterns are immediately obvious to the human eye, anyone who’s run one can tell you that they’re pretty computationally expensive to produce. Fortunately, as with many things in graphics, rendering the Mandelbrot set can be easily parallelized.

That’s what [rak277] and [ir93] demonstrate in their RP2040-based finals project. Computron, as they call it, is a network of Raspberry Pi Picos that work together to compute a visualization of the Mandelbrot set and show it on a VGA display. The Computron is made of two or more “math units” and one “projection unit”. The math units communicate over a shared I²C bus with the projection unit to first divide the workload and then compute their share of the work.

This project shows both the strengths and limitations of parallel computation. It makes use of multiple math units on a highly parallelizable workload, but as more math units are added there are diminishing performance gains due to the increased communications load on the network, which [rak277] and [ir93] suspect to be the current bottleneck in the Computron.

If you’re fresh out of Pi Picos, and don’t mind waiting awhile, you could always crank out a Mandelbrot set on your trusty Atari 800 in BASIC.

PicoCray - Raspberry Pi Pico Cluster

Parallel Computing On The PicoCray RP2040 Cluster

[ExtremeElectronics] cleverly demonstrates that if one Raspberry Pi Pico is good, then nine must be awesome.  The PicoCray project connects multiple Raspberry Pi Pico microcontroller modules into a parallel architecture leveraging an I2C bus to communicate between nodes.

The same PicoCray code runs on all nodes, but a grounded pin on one of the Pico modules indicates that it is to operate as the controller node.  All of the remaining nodes operate as processor nodes.  Each processor node implements a random back-off technique to request an address from the controller on the shared bus. After waiting a random amount of time, a processor will check if the bus is being used.  If the bus is in use, the processor will go back to waiting.  If the bus is not in use, the processor can request an address from the controller.

Once a processor node has an address, it can be sent tasks from the controller node.  In the example application, these tasks involve computing elements of the Mandelbrot Set. The particular elements to be computed in a given task are allocated by the controller node which then later collects the results from each processor node and aggregates the results for display.

The name for this project is inspired by Seymore Cray. Our Father of the Supercomputer biography tells his story including why the Cray-1 Supercomputer was referred to as “the world’s most expensive loveseat.” For even more Cray-1 inspiration, check out this Raspberry Pi Zero Cluster.

Parsing PNGs Differently

There are millions of tiny bugs all around us, in everything from our desktop applications to the appliances in the kitchen. Hidden, arbitrary conditions that cause unintended outputs and behaviors. There are many ways to find these bugs, but one way we don’t hear about very often is finding a bug in your own code, only to realize someone else made the same mistake. For example, [David Buchanan] found a bug in his multi-threaded PNG decoder and realized that the Apple PNG decoder had the same bug.

PNG (Portable Network Graphics) is an image format just like JPEG, WEBP, or TIFF designed to replace GIFs. After a header, the rest of the file is entirely chunks. Each chunk is prepended by a four-letter identifier, with a few chunks being critical chunks. The essential sections are IHDR (the header), IDAT (actual image data), PLTE (the palette information), and IEND (the last chunk in the file). Compression is via the DEFLATE method used in zlib, which is inherently serial. If you’re interested, there’s a convenient poster about the format from a great resource we covered a while back.

Continue reading “Parsing PNGs Differently”

Linux-Fu: Parallel Universe

At some point, you simply run out of processing power. Admittedly, that point keeps getting further and further away, but you can still get there. If you run out of CPU time, the answer might be to add more CPUs. However, sometimes there are other bottlenecks like memory or disk space. However, it is also likely that you have access to multiple computers. Who doesn’t have a few Raspberry Pis sitting around their network? Or maybe a server in the basement? Or even some remote servers “in the cloud.” GNU Parallel is a tool that lets you spread work across multiple tasks either locally to remote machines. In some ways, it is simple, since it looks sort of like xargs but with parallel execution. On the other hand, it has myriad options and configurations that can make it a little daunting to use. Continue reading “Linux-Fu: Parallel Universe”

Neural Nets In The Browser: Why Not?

We keep seeing more and more Tensor Flow neural network projects. We also keep seeing more and more things running in the browser. You don’t have to be Mr. Spock to see this one coming. TensorFire runs neural networks in the browser and claims that WebGL allows it to run as quickly as it would on the user’s desktop computer. The main page is a demo that stylizes images, but if you want more detail you’ll probably want to visit the project page, instead. You might also enjoy the video from one of the creators, [Kevin Kwok], below.

TensorFire has two parts: a low-level language for writing massively parallel WebGL shaders that operate on 4D tensors and a high-level library for importing models from Keras or TensorFlow. The authors claim it will work on any GPU and–in some cases–will be actually faster than running native TensorFlow.

Continue reading “Neural Nets In The Browser: Why Not?”