The 13.5 Million Core Computer

Having a dual- or quad-core CPU is not very exotic these days and CPUs with 12 or even 16 cores aren’t that rare. The Andromeda from Cerebras is a supercomputer with 13.5 million cores. The company claims it is one of the largest AI supercomputers ever built (but not the largest) and can perform 120 Petaflops of “dense compute.”

We aren’t sure about the methodology, but they also claim more than one exaflop of “AI computing.” The computer has a fabric backplane that can handle 96.8 terabits per second between nodes. According to a post on Extreme Tech, the core technology is a 3-plane wafer processor, WSE-2. One plane is for communications, one holds 40 GB of static RAM, and the math plane has 850,000 independent cores and 3.4 million floating point units.

The data is sent to the cores and collected by a bank of 64-core AMD EPYC 3 processors. Andromeda is optimized to handle sparse matrix computations. The company claims that the performance scales “almost linearly.” That is, as you double the number of cores used, you roughly half the total run time.

The machine is available for remote use and cost about $35 million to build. Since it uses 500 kW at peak run times, it isn’t free to operate, either. Extreme Tech notes that the Frontier computer at Oak Ridge National Labs is both larger and more precise, but it cost $600 million, so you’d expect it to be more capable.

Most homebrew “supercomputers” we see are more for learning how to work with clusters than trying to hit this sort of performance. Of course, if you have a modern graphics card, OpenCL and CUDA will let you do some of this, too, but at a much lesser scale.

The Fastest Fourier Transform In The West

An interesting aspect of time-varying waveforms is that by using a trick called a Fourier Transform (FT), they can be represented as the sum of their underlying frequencies. This mathematical insight is extremely helpful when processing signals digitally, and allows a simpler way to implement frequency-dependent filtration in a digital system. [klafyvel] needed this capability for a project, so started researching the best method that would fit into an Arduino Uno. In an effort to understand exactly what was going on they have significantly improved on the code size, execution time and accuracy of the previous crown-wearer.

A complete real-time Fourier Transform is a resource-heavy operation that needs more than an Arduino Uno can offer, so faster approximations have been developed over the years that exchange absolute precision for speed and size. These are known as Fast Fourier Transforms (FFTs). [klafyvel] set upon diving deep into the mathematics involved, as well as some low-level programming techniques to figure out if the trade-offs offered in the existing solutions had been optimized. The results are impressive.

Fastest FFT code benchmarking results in ms
Benchmarking results showing speed of implementation versus the competition (ApproxFFT)

Not content with producing one new award-winning algorithm, what is documented on the blog is a masterclass in really understanding a problem and there are no less than four algorithms to choose from depending on how you rank the importance of execution speed, accuracy, code size or array size.

Along the way, we are treated to some great diversions into how to approximate floats by their exponents (French text), how to control, program and gather data from an Arduino using Julia, how to massively improve the speed of the code by using trigonometric identities and how to deal with overflows when the variables get too large. There is a lot to digest in here, but the explanations are very clear and peppered with code snippets to make it easier and if you have the time to read through, you’re sure to learn a lot!  The code is on GitHub here.

If you’re interested in FFTs, we’ve seen them before around these parts. Fill your boots with this link of tagged projects.

A 3D printed cat treat dispenser on a table with a laptop in the background and with a treat in it's tray and a cat on the left about to eat the treat.

Local IOT Cat Treat Dispenser

[MostElectronics], like many of us, loves cats, and so wanted to make an internet connected treat dispenser for their most beloved. The result is an ingenious 3D printed mechanism connected to a Raspberry Pi that’s able to serve treats through a locally run web application.

The inside of a 3d printed cat treat dispenser, showing the different compartments, shaft and wires running out the back.

From the software side, the Raspberry Pi uses a RESTful API that one can connect to through a static IP. The API is implemented as a Python Flask application running under a stand alone web server Python script. The web application itself keeps track of the number of treats left and provides a simple interface to dispense treats at the operators leisure. The RpiMotorLib Python library is used to control a 28BYJ-48 stepper motor through its ULN2003 controller module, which is used to rotate the inside shaft of the treat dispenser.

The mechanism to dispense treats is a stacked, compartmentalized drum, with two drum layers for food compartments that turn to drop treats. The bottom drum dispenses treats through a chute connected to the tray for the cat, leaving an empty compartment that the top drum can replenish by dropping its treats into through a staggered opening. Each compartmentalized treat drum layer provides 11 treats, allowing for a total of 22 treats with two layers stacked on top of each other. One could imagine extending the treat dispenser to include more drum layers by adding even more layers.

Source code is available on GitHub and the STL files for the dispenser are available on Thingiverse. We’ve seen cat electronic feeders before, sometimes with escalating consequences that shake us to our core and leave us questioning our superiority.

Video after the break!

Continue reading “Local IOT Cat Treat Dispenser”

Fuel Cell Catalyst: Less Is More

A fuel cell is almost like a battery that has replenishable fuel. Instead of charging a battery with an electric current, you recharge a fuel cell with something like hydrogen or you simply consume it from a tank much as an internal combustion engine consumes gasoline. However, fuel cells usually use a catalyst — it isn’t consumed in the reaction, but it is necessary and many fuel cells use platinum as a catalyst which is expensive. But what if you could use less catalyst and get a better result? That’s what researchers in Canada and the US are claiming in a recent paper. The key isn’t how much catalyst they are using, but rather the shape of the catalyst.

Of course, everyone wants to use less of the expensive catalyst but polymer electrolyte fuel cells have had a particular problem where reducing the amount of catalyst used causes a disproportionate drop in cell performance. This new approach uses spherical catalyst support that improves the distribution and utilization of the catalyst.

Continue reading “Fuel Cell Catalyst: Less Is More”

Home-Built CPU Runs With Home-Built Toolchain

A few years ago [Takaya Saeki] and fellow students of the University of Tokyo, were given a very limited instruction during their ‘CPU exercise’ class, along the lines of:

Take this ray-tracing program written in OCaml and run it on your CPU implemented on an FPGA

Splitting into groups to cover the CPU, FPU, simulator tool, and compiler toolchain, the students started with designing a RISC ISA, then designed a CPU around that. You can follow along with the retrospective writeup of the class, then dive into the GitHub pages for each of the components of the system, although the commentary is mainly in Japanese. Hey, you can google translate right? Continue reading “Home-Built CPU Runs With Home-Built Toolchain”

New Metric Prefixes Get Bigger And Smaller

It always fascinates us that every single thing that is made had to be designed by someone. Even something as simple as a bag and box that holds cereal. Someone had to work out the dimensions, the materials, the printing on it, and assign it a UPC code. Those people aren’t always engineers, but someone has to think it out no matter how mundane it is before it can be made. But what about the terms we use to express things? Someone has to work those out, too. In the case of metric prefixes like kilo, mega, and pico, it is apparently the General Conference on Weights and Measures that recently had its 27th session. As a result of that, we have four more metric prefixes to learn: ronna, quetta, ronto, and quecto.

Apparently, the new prefixes are to accommodate “big data” which is rapidly producing more data than there are atoms in the Universe. There were actually proposed earlier in a slightly different form but accepted at the conference. Apparently quecca is too close to a Portuguese swear word. So what do these actually mean? A QB (quettabyte) would be 1030 bytes while an RB (ronnabyte) is only 1027.  So 1 QB would be 1,000,000 yottabytes (YB) the previous top of the scale.

Continue reading “New Metric Prefixes Get Bigger And Smaller”

A breadboard with a few DIP chips

Minimalist 6502 System Uses A CPU And Not Much Else

A central processing unit, or CPU, is the heart of any computer system. But it’s definitely not the only part: you also need RAM, ROM and at least some peripherals to turn it into a complete system that can actually do something useful. Modern microcontrollers typically have some or all of these functions integrated into a single chip, but classic CPUs don’t: they were meant to be placed on motherboards along with dozens of other chips. That’s why [c0pperdragon]’s latest project, the SingleBreadboardComputer, is such an amazing design: assisting its 6502 CPU are just four companion chips.

The entire system takes up just one strip of solderless breadboard. Next to the CPU we find 32 KB of SRAM, 32 KB of flash and a clock oscillator. The fifth chip is a 74HC00 quad two-input NAND gate, which is used as a very tiny piece of glue logic to connect everything together. Two of its NAND gates are used for address decoding logic, allowing either the ROM or RAM chip to be selected depending on the state of the CPU’s A15 line as well as blocking the RAM during the low phase of the system clock. The latter function is needed because the address lines are not guaranteed to be stable during the low phase and could cause writes to random memory locations.

The remaining two NAND gates are connected as an RS-flipflop in order to implement a serial output. This is needed because the CPU cannot keep its outputs in the same state for multiple clock cycles, which is required for a serial port. Instead, [c0pperdragon] uses the MLB pin, normally used to implement multiprocessor systems, to generate two-clock pulses, and stores the state in the flipflop for as long as needed. A few well-timed software routines can then be used to transmit and receive serial data without any further hardware.

Currently, the only software for this system is a simple demonstration that sends back data received on its serial port, but if you fancy a challenge you could write programs to do pretty much anything. You could probably find some inspiration in other minimalist 6502 boards, or projects that emulate a complete motherboard in an FPGA.