Faster Integer Division With Floating Point

December 22, 2024 by Al Williams 24 Comments

Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data) instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.

The plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.

Continue reading “Faster Integer Division With Floating Point” →

SIMD-Accelerated Computer Vision On The ESP32-S3

July 1, 2024 by Maya Posch 11 Comments

One of the fun parts of the ESP32-S3 microcontroller is that it got upgraded to the newer Cadence Xtensa LX7 processor core, which turns out to have a range of SIMD instructions that can help to significantly speed up a range of tasks. [Shranav Palakurthi] recently used this to speed up the processing of video frames to detect corners using the FAST method. By moving some operations that benefit from SIMD over to an optimized version written in LX7 ASM, the algorithm’s throughput was increased by 220%, from 5.1 MP/s to 11.2 MP/s, albeit with some caveats.

The problem with the SIMD instructions in the LX7 other than them being very poorly documented – unless you sign an NDA with Cadence – is that it misses many instructions that would be really useful. For [Shranav] the lack of support for direct misaligned reads and comparing of unsigned 8-bit numbers were hurdles, but could be worked around, with the results available on GitHub.

Much of the groundwork for this SIMD implementation was laid by [Larry Bank], who reverse-engineered the SIMD instructions from available documentation and code samples, finding that the ESP32-S3 misses quite a few common SIMD instructions, including various shifts and unaligned reads and writes. Still, it’s good enough for quite a few tasks, as long as you can make it work with the available instructions.

AVX-512: When The Bits Really Count

May 9, 2022 by Matthew Carlson 16 Comments

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Massively Parallel CPU Processes 256 Shades Of Gray

February 26, 2013 by Brian Benchoff 22 Comments

256

The 1980s were a heyday for strange computer architectures; instead of the von Neumann architecture you’d find in one of today’s desktop computers or the Harvard architecture of a microcontroller, a lot of companies experimented with strange parallel designs. While not used much today, at the time these were some of the most powerful computers of their day and were used as the main research tools of the AI renaissance of the 1980s.

Over at the Norwegian University of Science and Technology a huge group of students (13 members!) designed a modern take on the massively parallel computer. It’s called 256 Shades of Gray, and it processes 320×240 pixel 8-bit grayscale graphics like no microcontroller could.

The idea for the project was to create an array-based parallel image processor with an architecture similar to the Goodyear MPP formerly used by NASA or the Connection Machine found in the control room of Jurassic Park. Unlike these earlier computers, the team implemented their array processor in an FPGA, giving rise to their Lena processor this processor is in turn controlled by a 32-bit AVR microcontroller with a custom-build VGA output.

The entire machine can process 10 frames per second of 320×240 resolution grayscale video. There’s a presentation video available (in Norwegian), but the highlight might be their demo of The Game of Life rendered in real-time on their computer. An awesome build, and a very cool experience for all the members of the class.