Binary Math Tricks: Shifting To Divide By Ten Ain’t Easy

On small CPUs, you often don’t have a multiply or divide instruction. Of course, good programmers know that shifting right and left will multiply or divide by a power of two. But there are always cases where you need to use something that isn’t a power of two. Sometimes you can work it out for multiplication.

For example, multiplying by 10 is common when dealing with conversion between binary and decimal. But since 10n is equal to 8n+2n, you can express that as a bunch of left shift three times to multiply by eight, adding that value to your original value shifted left once to multiply by two.

But division is a different problem. n/10 does not equal n/8-n/2 or anything else simple like that. The other day a friend showed me a very convoluted snippet of code on Stack Overflow by user [realtime] that divides a number by 10 and wanted to know how it worked. It is pretty straightforward if you just stick with the math and I’ll show you what I mean in this post. Turns out the post referenced the venerable Hacker’s Delight book, which has a wealth of little tricks like this.

First Multiply

First, let’s do some shifts to multiply. Each left shift is a power of two, so n<<1 is 2*n and n<<8 is 256*n. That’s easy. The hard part is decomposing it for non-powers of two: n<<3 + n<<1 is the same as n*10

If you are in assembly doing one shift at a time, you can often save a little time by combining the two shifts:

MOV A,T ; Move A to temporary storage
ADD A, T ; Add T to A

That’s pretty efficient even on processors that can multiply.

Division Shifts Work but Composites Are Tricky

Division is the same way, but it doesn’t combine nicely. So n>>1 is n/2 and n>>8 is n/256. But there’s no easy way to combine divisions like you can multiplications.

The code we looked at looked like this:

unsigned divu10(unsigned n) {
  unsigned q, r;
  q = (n >> 1) + (n >> 2);
  q = q + (q >> 4);
  q = q + (q >> 8);
  q = q + (q >> 16);
  q = q >> 3;
  r = n - (((q << 2) + q) << 1); return q + (r > 9);

That’s a mouthful! But it does work. The secret to understanding this is to treat each shift as taking a fraction of the number. Look at the first working line:


This is really n/2 + n/4. If you remember your high school math, that’s the same as 3n/4. Of course, this is the same as multiplying by 0.75. If you look ahead to the last assignment of q, you’ll get a clue:q=q>>3;

That’s saying q = q/8. So if our goal is to divide by 10 it may be easier to think of it as multiplying by 0.1. To fit well with powers of two, we really want to think about multiply the whole thing by 0.8 and then divide by 8.

So adding one right shift to two right shifts gets us 0.75, which isn’t too far away from 0.8, but it isn’t quite there. The next line adds a little bit more to our 0.75 factor. How much more? The amount is 3n/64 and the total is now 51n/64. That equates to 0.797 or so. Getting closer to 0.8 already. Each term adds a little bit less and gets us a little bit closer. Here’s how it breaks down: The last term gets you only a little bit closer. The 13107n/16384 term is almost as close.

Expression Base Value Delta Total Value Ratio
(n>>1)+n(>>2) 3n/4 0 3n/4 0.75
q+(q>>4) 3n/4 + (3n/4)/16 3n/64 51n/64 0.7969
q+(q>>8) 51n/64+(51n/64)/256 51n/16384 13107n/16384 0.7999
q+(q>>16) 13107n/16384+(13107n/16384)/65536 13107n/1073741824 858993458n/1073741824 0.8

I couldn’t help but think that this would be pretty easy to implement in an FPGA.

Commenting the Code

Here’s the code with comments, which is a bit easier to follow:

unsigned divu10(unsigned n) {
  unsigned q, r;
  q = (n >> 1) + (n >> 2); // q=n/2+n/4 = 3n/4
  q = q + (q >> 4);        // q=3n/4+(3n/4)/16 = 3n/4+3n/64 = 51n/64
  q = q + (q >> 8);        // q=51n/64+(51n/64)/256 = 51n/64 + 51n/16384 = 13107n/16384 q = q + (q >> 16); // q= 13107n/16384+(13107n/16384)/65536=13107n/16348+13107n/1073741824=858993458n/1073741824
  // note: q is now roughly 0.8n
  q = q >> 3;              // q=n/8 = (about 0.1n or n/10)
  r = n - (((q << 2) + q) << 1); // rounding: r= n-2*(n/10*4+n/10)=n-2*5n/10=n-10n/10
  return q + (r > 9);      // adjust answer by error term

Once you break it down, it isn’t that hard to understand. This method dates back a while, and it appears the original source is the book Hacker’s Delight (see Figure 10-12 in the second edition). I couldn’t find the associated web site, but there is an archival copy. The only difference in that code is the last line is commented out and replaced with: return q+((r+6)>>4);

The book explains how to figure out the optimal shifts and adds and gives several other examples. There are also many other division tricks in that chapter. If you like this sort of thing, have a look at the bit twiddling hacks page. If your division dreams tend to floating point, you might find this interesting.

51 thoughts on “Binary Math Tricks: Shifting To Divide By Ten Ain’t Easy

  1. “Division is the same way, but it doesn’t combine nicely. So n>>1 is n/2 and n>>8 is n/256. But there’s no easy way to combine divisions like you can multiplications.

    The code we looked at looked like this”

    Or you can just subtract 10 from the passed number, returning if it carries, incrementing if it doesn’t and looping. Much *longer*, time-wise, but *significantly* more compact without a barrel-shifter. Sometimes execution time doesn’t matter. (Also can be faster, too, depending on what you expect the range to be). Always cracks me up when people forget that division’s repeated subtraction.

    1. “returning if it carries,” this is poor practice because unsigned integers don’t “carry”, the behavior is Undefined and you’ve created a really nasty bug that will bite you hard when you change compilers or platforms.

      Welcome once again to the poorly defined world of C arithmetic and why we should all give up on this stupid language.

      1. “Welcome once again to the poorly defined world of C arithmetic ”

        (looks at what I wrote)
        Uh… exactly where was the C code in there?

        That’s an algorithm. If you can’t figure out how to implement it in your language, that’s not the algorithm’s problem. For reference, in C, it’s just while (1) { if (n < 10) return; n-= 10; q++; } – and if the compiler can't figure out that it can do subtract / return if C / increment / jump, the compiler's stupid, not the code.

      2. Stupid language? So what is better in your mind then Python perhaps? Dare to guess what Python was written in? I think you call it stupid because you don’t know how to use it.

    2. Yes, that method can work just fine for some cases. You can also just use a general purpose division subroutine. Clearly this post is about needing it to be fast. One last thing pedantic point division isn’t repeated subtraction; division is best defined as a scaling operation.

  2. In the code, q=quotient, r=remainder. The remainder is calculated as (input number) – 10 * q. If the quotient were exact, the remainder would always be < 10. However, since it is not exact, the remainder can be larger. If this happens, it means q was too small, so this is why it's adjusted by 1. Presumably, q is calculated with enough precision that r won't be bigger than 19, meaning that an adjustment of 1 wouldn't be enough.

    1. Divide by 10 in Z80 using ADD only, works on 8-bit numbers, uses Z80’s 16-bit capability and then uses the upper 8-bits as the answer

      ; H = E \ 10
      ld d,0
      ld h,d
      ld l,e
      add hl,hl
      add hl,de
      add hl,hl
      add hl,hl
      add hl,de
      add hl,hl

      it works, but it’s a play on numbers.

  3. Easy, divide by almost anything…. Let’s say you’re working on a 16 bit number. Multiply it with (0x10000 / 10) and shift the result 16 bits to the right.
    As long as your desired divided by amount is constant, it works pretty efficiently. The compiler takes the hit for you.

      1. Its true, the solution above does assume a hardware multiply. A hardware multiply is pretty cheap in silicon, while a divide is not. Even the humble 8 bit atmega8 includes a hardware multiply, while the latest 32 bit cortex m0+ still excludes a divide. The above solution is clean, easily understandable, and extends to a division of all real number numbers. Algorithms designed round specific constants, inevitably results in different solution to different constants. It unnecessarily adds workload on programmers and debuggers. Again you’re right there is a need, but its very small.

        1. The 6502 and 65C02 and its variants had no multiply or divide.
          Neither did the Z80.
          While you may be spoiled by what you have now, for many of us in the 70’s and 80’s we had to make due with what we had…

          Division on the Z80 was easy.

            xor a 
            ld b,16 
            add hl,hl
            cp d
            jp c,NxtBit
            sub d
            inc l
            djnz LoopForDiv
      2. Some compilers (e.g. gcc) will convert constant multiply to shift/add/subtract where it makes sense. In that case, it’s better to write it as a multiply, and increase portability of the code (or simply reduce size of the code with -Os option, so it will go back to using library call)

    1. Where I most often use this type of technique is in ADC scaling. Let’s say you have a 10 bit ADC, and the ADC maxes out at 23V (this is the actual input, before the voltage divider). To calculate the amount of milli volts, without a divide, multiply the ADC result with 23000 and shift to right by the ADC precision (in this case 10bits). This makes for efficient, portable code that is easy to read, and requires no divide.

  4. How many even decently modern CPUs would benefit from such hackery? I don’t think it makes you a good programmer to know many tricks of that kind. It actually held me back in the past worrying too much about performance, where it wasn’t an issue. Others take it into the opposite extreme, which is why too much software are real resource hogs now.

    Most CPUs even in the 90s took just a few cycles for division. With pipelining it matters even less. Maybe this is useful for an FPGA or some very low performance µC. But in general it’s not too useful, but It’s a nice curiosity.

    1. Not so much modern cpu’s, I’m sure, but I actually could’ve totally used this article about 3 months ago when doing some embedded stuff, where there was no division available, but shifts were cheap. Ended up having to resort to pre-computed lookup tables, which, in the end, still might have been the better choice, but this would’ve been nice to at least benchmark.

      The amount of effort needed to hack around dividing by 10 was insane.

    2. It may matter a lot if you take into account huge data manipulations as with matrix calculus when performing simmilar operations on a large number of numbers.
      You won’t notice the time benefit on one operation but you will on several million operations.

  5. “left shift three times to multiply by eight, adding that value to your original value shifted left once to multiply by two.”
    no — that’s a total of 3+1 = 4 left shifts (although with a barrel shifter it’s only 2 shift operations) — instead left shift TWICE, add to the original and left shift once more. Fewer operations, same result.

    1. You mention but don’t explicitly say this only matters if you don’t have a barrel shifter. It also could actually be slower in some superscalar architectures because you’re introducing a data dependency.

      That being said if you care about hand optimizing execution speed that much, you probably don’t have a barrel shift or a superscalar architecture.

  6. How we used to do it on the PDP-11 back in the 70s was to have a table like this:

    table: .word 40000, 20000, 10000, 8000, 4000, 2000, 1000, 800, 400, 200, 100, 80, 40, 20, 10, 8, 4, 2 1

    Then for each word in the table compare it with the source value. If table < source subtract table from source and set a bit in the destination. The destination was then 20 bit BCD, add ASCII 0 to each nibble and you have a conversion.

    1. Here’s an algorithm for BCD that doesn’t require a table: iterate over the input register, left shifting each MS bit into the bottom of the output register. After each shift, add 3 to any nibble in the output reg that is >=5.

      If you’re using a CPU that has BCD arithmetic or if your dealing with words > the ALU size, this would be a more efficient way of doing it.

      1. A little PoC of the above, as we’re code golfing…

        int main(void)
        int i;
        int a = 0xbc614e, b = 0;
        int x;

        printf(“0x%08x “, a);
        for (i = 1; i; i <> 3) | ((b >> 2) & ((b >> 1) | b))) & 0x11111111;
        x |= x << 1;
        b = (b + x) << 1;
        if (a & (1 << 31))
        b |= 1;
        a <<= 1;

        printf("= %08x\n", b);
        return 0;

  7. This is from what is now extremely flakey memory. When doing my training at Control Data in London 40 odd years ago, one language we had to learn was ICL Plan (for ICL 1900). There was no native divide but there was a library routine. For our code we were not allowed to use the library but had to implement divide ourselves.

  8. Great article and I’m already a good way through the Hacker’s Delight. You have a typo in the example code comments line 4: 51n/644. It should be 51n/64. It initially confused me as if never come across the fractional approach to division.

  9. Hello,

    Maybe the BLD format is what you want. (BLD: binary-look decimal)

    For examle: let N = 19735.

    Now convert it to BLD.

    04312 : 11011
    02101 : 00110
    01000 : 00101
    00000 : 01000

    So, the result is: 01000, 00101, 00110, 11011..

    Note that hese are not binary numbers, these are regular decimal numbers, but only made up 1s and 0s.

    Actually it looks like to BCD but totally different.

    Backward: 1000 * 2^3 + 101 * 2^2 + 110 * 2 + 11011 = 8000 + 404 + 220 + 11011 = 19635

    Advantage is you only operate with 1s and 0s.


    PS: (C) Copyright by Adem GÜNEŞ

    1. Dammit html sanitizer!
      …what should have been between the angular brackets:
      “Division by shifts is just multiplication by a fraction where denominator is a power of two, so you can use traditional composition (by addition) of left shifts in the numerator, and just a right shift for the denominator. The higher power of two you can afford to pick for the denominator, the closer you will generally get to exact result.”

  10. Yes, this would be easy to implement in an FPGA, but you’re more apt to use one of the hard-wired multiplier cores, which probably take less power and fewer gates than this version, which would be implemented with fabric LUTs and carry chains. You’d just multiply by 1/10 with the multiplier.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.