Turing Pi 2: The Low Power Cluster

We’re not in the habit of recommending Kickstarter projects here at Hackaday, but when prototype hardware shows up on our desk, we just can’t help but play with it and write it up for the readers. And that is exactly where we find ourselves with the Turing Pi 2. You may be familiar with the original Turing Pi, the carrier board that runs seven Raspberry Pi Compute boards at once. That one supports the Compute versions 1 and 3, but a new design was clearly needed for the Compute Module 4. Not content with just supporting the CM4, the developers at Turing Machines have designed a 4-slot carrier board based on the NVIDIA Jetson pinout. The entire line of Jetson devices are supported, and a simple adapter makes the CM4 work. There’s even a brand new module planned around the RK3588, which should be quite impressive.

One of the design decisions of the TP2 is to use the mini-ITX form-factor and 24-pin ATX power connection, giving us the option to install the TP2 in a small computer case. There’s even a custom rack-mountable case being planned by the folks over at My Electronics. So if you want 4 or 8 Raspberry Pis in a rack mount, this one’s for you.
Continue reading “Turing Pi 2: The Low Power Cluster”

Parallel Processing Was Never Quite Done Like This

Parallel processing is an idea that will be familiar to most readers. Few of you will not be reading this on a device with only one processor core, and quite a few of you will have experimented with clusters of Raspberry Pi or similar SBCs. Instead of one processor doing tasks sequentially, the idea goes, take a bunch of processors and hand out the tasks to be done simultaneously.

It’s a fair bet though that few of you will have designed and constructed your own parallel processing architecture. [BB] sends us a link which though it’s an old one is interesting enough to bring you today: [Michael] created a massively parallel array of Parallax Propeller microcontrollers back in 2008, and he did so on a breadboard.

The Parallax Propeller is an 8-core RISC microcontroller from the company that had found success in the 1990s with the BASIC Stamp, the PIC-based board that was all the rage before Arduino came into the world. In the last decade it was seen as an extremely exciting prospect, but high price and arcane development tools compared to a new generation of low-cost and easy to code competitors meant that it never quite caught on and remains today something of an intriguing oddity. So today’s value in this project lies not in something that you should run out and do yourselves, but instead in what the work tells us about the nuts and bolts of parallel processing architecture. It involves more than simply hooking up a load of chips and hoping for the best, and we gain some insight into the different strategies involved.

The Propeller certainly wasn’t the first attempt at a massively parallel microcontroller, and we doubt it will be the last. We’re certainly seeing microcontrollers with more than one core becoming more mainstream even in our community, but even with those how many of you have made use of the second core in your dual-core ESP32? Is a multicore microcontroller a solution searching for a problem, or will somebody one day crack it and the world will never be the same again? As always, the comments are below.

CUDA Is Like Owning A Supercomputer

The word supercomputer gets thrown around quite a bit. The original Cray-1, for example, operated at about 150 MIPS and had about eight megabytes of memory. A modern Intel i7 CPU can hit almost 250,000 MIPS and is unlikely to have less than eight gigabytes of memory, and probably has quite a bit more. Sure, MIPS isn’t a great performance number, but clearly, a top-end PC is way more powerful than the old Cray. The problem is, it’s never enough.

Today’s computers have to processes huge numbers of pixels, video data, audio data, neural networks, and long key encryption. Because of this, video cards have become what in the old days would have been called vector processors. That is, they are optimized to do operations on multiple data items in parallel. There are a few standards for using the video card processing for computation and today I’m going to show you how simple it is to use CUDA — the NVIDIA proprietary library for this task. You can also use OpenCL which works with many different kinds of hardware, but I’ll show you that it is a bit more verbose.
Continue reading “CUDA Is Like Owning A Supercomputer”

Neural Nets In The Browser: Why Not?

We keep seeing more and more Tensor Flow neural network projects. We also keep seeing more and more things running in the browser. You don’t have to be Mr. Spock to see this one coming. TensorFire runs neural networks in the browser and claims that WebGL allows it to run as quickly as it would on the user’s desktop computer. The main page is a demo that stylizes images, but if you want more detail you’ll probably want to visit the project page, instead. You might also enjoy the video from one of the creators, [Kevin Kwok], below.

TensorFire has two parts: a low-level language for writing massively parallel WebGL shaders that operate on 4D tensors and a high-level library for importing models from Keras or TensorFlow. The authors claim it will work on any GPU and–in some cases–will be actually faster than running native TensorFlow.

Continue reading “Neural Nets In The Browser: Why Not?”

1000 CPUs On A Chip

Often, CPUs that work together operate on SIMD (Single Instruction Multiple Data) or MISD (Multiple Instruction Single Data), part of Flynn’s taxonomy. For example, your video card probably has the ability to apply a single operation (an instruction) to lots of pixels simultaneously (multiple data). Researchers at the University of California–Davis recently constructed a single chip with 1,000 independently programmable processors onboard. The device is energy efficient and can compute up to 1.78 trillion instructions per second.

The KiloCore chip (not to be confused with the 2006 Rapport chip of the same name) has 621 million transistors and uses special techniques to be energy efficient, an important design feature when dealing with so many CPUs. Each processor operates at 1.78 GHz or less and can shut itself down when not needed. The team reports that even when computing 115 billion instructions per second, the device only consumes about 700 milliwatts.

Unlike some multicore designs that use a shared memory area to communicate between processors, the KiloCore allows processors to directly communicate. If you are just a diehard Arduino user, maybe you could scale up this design. Or, if you want to make use of the unused power in your video card under Linux, you can always try to bring KGPU up to date.

Tote Boards: The Impressive Engineering Of Horse Gambling

Horse racing has been around since the time of the ancient Greeks. Often called the sport of kings, it was an early platform for making friendly wagers. Over time, private bets among friends gave way to bookmaking, and the odds of winning skewed in favor of a new concept called the “house”.

During the late 1860s, an entrepreneur in Paris named Joseph Oller invented a new form of betting he called pari-mutuel. In this method, bettors wager among themselves instead of against the house. Bets are pooled together and the winnings divided among the bettors. Pari-mutuel betting creates more organic odds than ones given by a profit-driven bookmaker.

Oller’s method caught on quite well. It brought fairness and transparency to betting, which made it even more attractive. It takes a lot of quick calculations to show real-time bet totals and changing odds, and human adding machines presented a bottleneck. In the early 1900s, a man named George Julius would change pari-mutuel technology forever by making an automatic vote-counting machine in his garage.

Continue reading “Tote Boards: The Impressive Engineering Of Horse Gambling”

Massively Parallel CPU Processes 256 Shades Of Gray

256

The 1980s were a heyday for strange computer architectures; instead of the von Neumann architecture you’d find in one of today’s desktop computers or the Harvard architecture of a microcontroller, a lot of companies experimented with strange parallel designs. While not used much today, at the time these were some of the most powerful computers of their day and were used as the main research tools of the AI renaissance of the 1980s.

Over at the Norwegian University of Science and Technology a huge group of students (13 members!) designed a modern take on the massively parallel computer. It’s called 256 Shades of Gray, and it processes 320×240 pixel 8-bit grayscale graphics like no microcontroller could.

The idea for the project was to create an array-based parallel image processor with an architecture similar to the Goodyear MPP formerly used by NASA or the Connection Machine found in the control room of Jurassic Park. Unlike these earlier computers, the team implemented their array processor in an FPGA, giving rise to their Lena processor this processor is in turn controlled by a 32-bit AVR microcontroller with a custom-build VGA output.

The entire machine can process 10 frames per second of 320×240 resolution grayscale video. There’s a presentation video available (in Norwegian), but the highlight might be their demo of The Game of Life rendered in real-time on their computer. An awesome build, and a very cool experience for all the members of the class.