Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data) instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.
The plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.
Since modern processors have several SIMD instructions, the post takes the time to benchmark many different variations of a program dividing in a loop. The basic program is the reference and, thus, has a “speed factor” of 1. Unrolling the loop, a common loop optimization technique, doesn’t help much and, on some CPUs, can make the loop slower.
Converting to floating point and using AVX2 sped the program up by a factor of 8X to 11X, depending on the CPU. Some of the processors supported AVX512, which also offered considerable speed-ups.
This is one of those examples of why profiling is so important. If you’d had asked us if converting integer division to floating point might make a program run faster, we’d have bet the answer was no, but we’d have been wrong.
As CPUs get more complex, optimizing gets a lot less intuitive. If you are interested in things like AVX-512, we’ve got you covered.
For 8 bit by 8 bit division without vector extensions, it might be fastest to look-up a reciprocal in a table and use multiplication. But vectorized versions will still be significantly faster.
Yep, I’ve done this on MSP430(f5529) because it has a multiplier FPU, but the FPU cannot do division.
WRT the fact that sometimes using FPUs are faster than doing integer math, that has been the case (in some circles) for 40 years.
https://arxiv.org/pdf/2207.08420
If speed is of utmost importance, why not just have a direct LUT for any slow operation taking two 8-bit operands? That’s a 64kB LUT. And the solution doesn’t need vector operations or have some minimum number of operations before breaking even.
LUT is usually 1 or 2 cycles per value on out-of-order architectures. The vectorized versions shown here get more than 4x speedup over that.
I’ve been working hard as I can over the past 2 years to completely retrain my intuition about good code.
A cache miss is 200 cycles (for the sake of arguing) so it’s possibly 10-20 floating point divisions!
Data orientated design completely over turns just about everything I thought I knew about fast code.