Faster Integer Division With Floating Point

December 22, 2024

Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data) instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.

The plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.

Since modern processors have several SIMD instructions, the post takes the time to benchmark many different variations of a program dividing in a loop. The basic program is the reference and, thus, has a “speed factor” of 1. Unrolling the loop, a common loop optimization technique, doesn’t help much and, on some CPUs, can make the loop slower.

Converting to floating point and using AVX2 sped the program up by a factor of 8X to 11X, depending on the CPU. Some of the processors supported AVX512, which also offered considerable speed-ups.

This is one of those examples of why profiling is so important. If you’d had asked us if converting integer division to floating point might make a program run faster, we’d have bet the answer was no, but we’d have been wrong.

As CPUs get more complex, optimizing gets a lot less intuitive. If you are interested in things like AVX-512, we’ve got you covered.

24 thoughts on “Faster Integer Division With Floating Point”

jpa says:

December 22, 2024 at 10:25 pm

For 8 bit by 8 bit division without vector extensions, it might be fastest to look-up a reciprocal in a table and use multiplication. But vectorized versions will still be significantly faster.

Report comment

Reply
1. JSL says:
  
  December 22, 2024 at 10:41 pm
  
  Yep, I’ve done this on MSP430(f5529) because it has a multiplier FPU, but the FPU cannot do division.
  
  WRT the fact that sometimes using FPUs are faster than doing integer math, that has been the case (in some circles) for 40 years.
  
  Report comment
  
  Reply
  1. RetepV says:
    
    December 23, 2024 at 5:30 am
    
    And sometimes doing magic math is faster than either. ;)
    
    https://en.m.wikipedia.org/wiki/Fast_inverse_square_root
    
    Report comment
    
    Reply
    1. genixia says:
      
      December 23, 2024 at 8:45 pm
      
      Thanks for the rabbit-hole! Fun read…
      
      Report comment
      
      Reply
2. Pat says:
  
  December 23, 2024 at 1:57 pm
  
  I kinda doubt it. Definitely not for a single operation – the cache hit alone would kill you. But in order to get the right answer you can’t just do “value * reciprocal” because for any reasonable LUT you’d need to round to get the correct answer to avoid the precision loss in the reciprocal (and I’m not sure you can actually do it exactly without a LUT with precision bigger than 8 bits?). The fetch + mult is already costing you like 6 cycles in the best case with everything in cache. With multiple operations you might start to win.
  
  Report comment
  
  Reply
  1. BugoTheCat says:
    
    December 24, 2024 at 10:45 am
    
    There are times were contrary to the popular belief, use of LUTs can give surprising results, even when I am living with the idea that what I am doing right now must be horrible because it will trash the cache and make things even more slow. It might even make most people to not even consider it because it’s something from the 90s that is bad for current hardware. But that’s why it makes sense to sometimes test these assumptions. Unless of course once code is so horrible that even a LUT would make it faster, so nothing is absolute.
    
    I could point examples from my experience. All of the examples are over iterrations over several pixels. One has to do with replacing a division with LUTs. In a modern CPU. But as we know integer divisions are still bad. The other was when I was making a big 128k table that takes two 8bit pixel values and gets 16bit pixel color. Something like a fake palette colormap in 16bpp. On a Pentium 1. One of the early CPUs where I first heard in Abrash’s books maybe, that things have change and making a LUT especially such a big, is a No No. Previously I would do integer shifts and ORs to create 16bpp RGB values the typical way one does it. It wasn’t as fast as I’d like on the Pentium. Then I decided to try this and I was in horror when I tried because I knew “it was considered horrible idea especially with a huge 128k LUT for the small cache of the Pentium”. And yet I went like double speed. I just stopped assuming any modern wisdom at that point and decided to try anything no matter how absurd and if it makes my code faster that’s fine. Sure, maybe I was doing something stupid in the original code but I am well versed in optimizing such things so relatively confident the non-LUT approach wasn’t that badly written to suffer more in performance than cache misses.
    
    However, all ideas welcome. I was looking at a recent Casey Muratori’s video (with Primagen) where he infact replaced an integer modulo with floating point division and subtraction, with all the conversions from integer to float and then back, and made the compiler to do use the SIMD really well and get 4x speed. I’d also see that code before and be like “What? That’s gonna be way slower, what about the conversions?”.
    
    Report comment
    
    Reply
MBR says:

December 22, 2024 at 10:27 pm

https://arxiv.org/pdf/2207.08420

Report comment

Reply
Space vs Time Optimization says:

December 22, 2024 at 11:10 pm

If speed is of utmost importance, why not just have a direct LUT for any slow operation taking two 8-bit operands? That’s a 64kB LUT. And the solution doesn’t need vector operations or have some minimum number of operations before breaking even.

Report comment

Reply
1. jpa says:
  
  December 23, 2024 at 1:37 am
  
  LUT is usually 1 or 2 cycles per value on out-of-order architectures. The vectorized versions shown here get more than 4x speedup over that.
  
  Report comment
  
  Reply
  1. Oliver says:
    
    December 23, 2024 at 5:09 am
    
    Could you explain this, as I don’t quite understand how that works.
    
    If I need one instruction to perform that lookup-division, but the articles methid would be 4 times as fast, I’d end up with a quarter instruction?!
    
    Report comment
    
    Reply
    1. jpa says:
      
      December 23, 2024 at 5:37 am
      
      Vectorized instructions handle multiple values per cycle (SIMD, single instruction multiple data). AVX512 can handle 64x 8-bit values.
      
      Report comment
      
      Reply
      1. Oliver says:
        
        December 23, 2024 at 1:53 pm
        
        Right, but this all falls appart if you have 32bit ints (as wood a lut, or rather it be very expensive) or you cant parallize these divisions.
        
        Anyway, subtle but understood. Thank you.
        
        Report comment
    2. fonz says:
      
      December 23, 2024 at 5:38 am
      
      four results per cycle
      
      Report comment
      
      Reply
    3. Anonymous says:
      
      December 23, 2024 at 5:42 am
      
      IIRC vectorized means they run multiple divisions in parallel on the same core at the same time. So if you are doing more than one at a time then it will be faster.
      
      Report comment
      
      Reply
2. Pat says:
  
  December 23, 2024 at 9:21 am
  
  “why not just have a direct LUT for any slow operation taking two 8-bit operands? That’s a 64kB LUT. ”
  
  That’s a huge LUT. On a modern CPU you’d eventually evict the entire L1 cache doing that. And obviously embedding a LUT like that in the CPU is impractical, because it’s, uh, exactly the same as the L1 cache.
  
  Report comment
  
  Reply
  1. Zorbas says:
    
    December 24, 2024 at 12:23 am
    
    Embedded in the cpu, it would be in ROM, which is typically smaller than the RAM of the L1 cache. So not exactly the same. Nevertheless not a good idea, probably.
    
    Report comment
    
    Reply
    1. Pat says:
      
      December 24, 2024 at 7:35 am
      
      I was assuming the poster meant a programmable LUT, hence “any 8 x 8 function.”
      
      Even for a fixed function LUT, though, for a math function a full LUT is a poor choice because there’s structure to the LUT, and it’s faster to use math since LUTs inevitably slow down as they get larger.
      
      It’s funny because working with FPGAs you’d figure LUTs are a good choice for math in a lot of cases: only a single critical path. But I’ve ended up implementing so many of them as actual math after trying LUTs because the LUT is so big and the math can be trivialized and compacted.
      
      Report comment
      
      Reply
  2. Eric Korpela says:
    
    December 24, 2024 at 4:24 pm
    
    If you really need the speed on an wide variety of processors write it multiple ways and test in a realistic manner at run time. Your be surprised how caused things are even within an architecture. (Examples, ftw, seti@home…)
    
    Report comment
    
    Reply
alialiali says:

December 23, 2024 at 12:21 am

I’ve been working hard as I can over the past 2 years to completely retrain my intuition about good code.

A cache miss is 200 cycles (for the sake of arguing) so it’s possibly 10-20 floating point divisions!

Data orientated design completely over turns just about everything I thought I knew about fast code.

Report comment

Reply
1. El Gru says:
  
  December 23, 2024 at 8:56 am
  
  This.
  
  In my opinion it goes even further:
  The data is the core and any program is just a replaceable convenience to handle the data. The language is not important, it just makes certain manipulations more convenient.
  
  Report comment
  
  Reply
clancydaenlightened says:

December 23, 2024 at 11:23 am

Probably would be much faster using a GPU

Most people have those installed

Or have integrated graphics on the cpu…

Algebra, trigonometry, and caculus is something GPU are extremely efficient compared to a cpu

Especially logarithmic operations and floating point math

Report comment

Reply
1. Bob Marlee says:
  
  December 23, 2024 at 4:52 pm
  
  Except that you’d have to wait for a round-trip on the bus to talk to the GPU. The APU should at least avoid that.
  
  Report comment
  
  Reply
  1. alialiali says:
    
    December 24, 2024 at 1:26 am
    
    It all comes down to dataset size، If it’s big enough then then the latency doesn’t matter.
    
    While I can and do use openCL, I never even look at SIMD intrinsics.
    
    I write code that’s simple and clear enough that GCC can vectorise.
    
    I can and do use OpenAC and OpenMP (in the past).
    
    Once you take the perspective of the data flow (data orientated programming) you find that much of the micropotomization is a false economy.
    
    The amazing work being showcased here is vital though and interesting. It’s not a criticism of that.
    
    It’s just that no matter which route you take you better have the data laid out so it can be applied efficiently.
    
    If the data was buried in large structs and in long arrays of that structs. No matter what you do you will have bad performance as you utterly thrash your cache.
    
    You might think that’s contrived but it’s actually the most common data pattern in use by “career programmers” using OOP.
    
    Report comment
    
    Reply
mnem says:

December 26, 2024 at 6:04 am

Interesting. Reading this I was curious if this would be applicable to the AMMX SIMD instruction set in the Apollo 68080 CPU core but it doesn’t support division so no. It does support XOR though so I wonder if this is workable using SIMD XOR division … 🤔

Report comment

Reply

Hackaday

Faster Integer Division With Floating Point

24 thoughts on “Faster Integer Division With Floating Point”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

VRML And The Dream Of Bringing 3D To The World Wide Web

Australia’s Space Program Finally Gets Off The Pad, But Only Barely

What Happens When Lightning Strikes A Plane?

Happy Birthday 6502

Two For The Price Of One: BornHack 2024 And 2025 Badges

Our Columns

A Love Letter To Prototype Zero

Hackaday Podcast Episode 332: 5 Axes Are Better Than 3, Hacking Your Behavior, And The Man Who Made Models

This Week In Security: Perplexity V Cloudflare, GreedyBear, And HashiCorp

The 64-Degree Egg, And Other Delicious Variants

Jenny’s Daily Drivers: FreeDOS 1.4

24 thoughts on “Faster Integer Division With Floating Point”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns