On small CPUs, you often don’t have a multiply or divide instruction. Of course, good programmers know that shifting right and left will multiply or divide by a power of two. But there are always cases where you need to use something that isn’t a power of two. Sometimes you can work it out for multiplication.
For example, multiplying by 10 is common when dealing with conversion between binary and decimal. But since 10n
is equal to 8n+2n
, you can express that as a bunch of left shift three times to multiply by eight, adding that value to your original value shifted left once to multiply by two.
But division is a different problem. n/10
does not equal n/8-n/2
or anything else simple like that. The other day a friend showed me a very convoluted snippet of code on Stack Overflow by user [realtime] that divides a number by 10 and wanted to know how it worked. It is pretty straightforward if you just stick with the math and I’ll show you what I mean in this post. Turns out the post referenced the venerable Hacker’s Delight book, which has a wealth of little tricks like this.
First Multiply
First, let’s do some shifts to multiply. Each left shift is a power of two, so n<<1
is 2*n
and n<<8
is 256*n
. That’s easy. The hard part is decomposing it for non-powers of two: n<<3 + n<<1
is the same as n*10
If you are in assembly doing one shift at a time, you can often save a little time by combining the two shifts:
SHL A
MOV A,T ; Move A to temporary storage
SHL A
SHL A
ADD A, T ; Add T to A
That’s pretty efficient even on processors that can multiply.
Division Shifts Work but Composites Are Tricky
Division is the same way, but it doesn’t combine nicely. So n>>1
is n/2 and n>>8
is n/256. But there’s no easy way to combine divisions like you can multiplications.
The code we looked at looked like this:
unsigned divu10(unsigned n) { unsigned q, r; q = (n >> 1) + (n >> 2); q = q + (q >> 4); q = q + (q >> 8); q = q + (q >> 16); q = q >> 3; r = n - (((q << 2) + q) << 1); return q + (r > 9); }
That’s a mouthful! But it does work. The secret to understanding this is to treat each shift as taking a fraction of the number. Look at the first working line:
q=(n>>1)+(n>>2)
This is really n/2 + n/4
. If you remember your high school math, that’s the same as 3n/4
. Of course, this is the same as multiplying by 0.75. If you look ahead to the last assignment of q, you’ll get a clue:q=q>>3;
That’s saying q = q/8
. So if our goal is to divide by 10 it may be easier to think of it as multiplying by 0.1. To fit well with powers of two, we really want to think about multiply the whole thing by 0.8 and then divide by 8.
So adding one right shift to two right shifts gets us 0.75, which isn’t too far away from 0.8, but it isn’t quite there. The next line adds a little bit more to our 0.75 factor. How much more? The amount is 3n/64
and the total is now 51n/64
. That equates to 0.797 or so. Getting closer to 0.8 already. Each term adds a little bit less and gets us a little bit closer. Here’s how it breaks down: The last term gets you only a little bit closer. The 13107n/16384
term is almost as close.
Expression | Base Value | Delta | Total Value | Ratio |
---|---|---|---|---|
(n>>1)+n(>>2) | 3n/4 | 0 | 3n/4 | 0.75 |
q+(q>>4) | 3n/4 + (3n/4)/16 | 3n/64 | 51n/64 | 0.7969 |
q+(q>>8) | 51n/64+(51n/64)/256 | 51n/16384 | 13107n/16384 | 0.7999 |
q+(q>>16) | 13107n/16384+(13107n/16384)/65536 | 13107n/1073741824 | 858993458n/1073741824 | 0.8 |
I couldn’t help but think that this would be pretty easy to implement in an FPGA.
Commenting the Code
Here’s the code with comments, which is a bit easier to follow:
unsigned divu10(unsigned n) { unsigned q, r; q = (n >> 1) + (n >> 2); // q=n/2+n/4 = 3n/4 q = q + (q >> 4); // q=3n/4+(3n/4)/16 = 3n/4+3n/64 = 51n/64 q = q + (q >> 8); // q=51n/64+(51n/64)/256 = 51n/64 + 51n/16384 = 13107n/16384 q = q + (q >> 16); // q= 13107n/16384+(13107n/16384)/65536=13107n/16348+13107n/1073741824=858993458n/1073741824 // note: q is now roughly 0.8n q = q >> 3; // q=n/8 = (about 0.1n or n/10) r = n - (((q << 2) + q) << 1); // rounding: r= n-2*(n/10*4+n/10)=n-2*5n/10=n-10n/10 return q + (r > 9); // adjust answer by error term }
Once you break it down, it isn’t that hard to understand. This method dates back a while, and it appears the original source is the book Hacker’s Delight (see Figure 10-12 in the second edition). I couldn’t find the associated web site, but there is an archival copy. The only difference in that code is the last line is commented out and replaced with: return q+((r+6)>>4);
The book explains how to figure out the optimal shifts and adds and gives several other examples. There are also many other division tricks in that chapter. If you like this sort of thing, have a look at the bit twiddling hacks page. If your division dreams tend to floating point, you might find this interesting.
“Division is the same way, but it doesn’t combine nicely. So n>>1 is n/2 and n>>8 is n/256. But there’s no easy way to combine divisions like you can multiplications.
The code we looked at looked like this”
Or you can just subtract 10 from the passed number, returning if it carries, incrementing if it doesn’t and looping. Much *longer*, time-wise, but *significantly* more compact without a barrel-shifter. Sometimes execution time doesn’t matter. (Also can be faster, too, depending on what you expect the range to be). Always cracks me up when people forget that division’s repeated subtraction.
“returning if it carries,” this is poor practice because unsigned integers don’t “carry”, the behavior is Undefined and you’ve created a really nasty bug that will bite you hard when you change compilers or platforms.
Welcome once again to the poorly defined world of C arithmetic and why we should all give up on this stupid language.
“Welcome once again to the poorly defined world of C arithmetic ”
(looks at what I wrote)
Uh… exactly where was the C code in there?
That’s an algorithm. If you can’t figure out how to implement it in your language, that’s not the algorithm’s problem. For reference, in C, it’s just while (1) { if (n < 10) return; n-= 10; q++; } – and if the compiler can't figure out that it can do subtract / return if C / increment / jump, the compiler's stupid, not the code.
You use the C shifting operators, so a C assumption seems reasonable as far as the arithmetic rules, or lack of them.
That wasn’t me, that was the quoted parent article.
C didn’t invent shifting.
Stupid language? So what is better in your mind then Python perhaps? Dare to guess what Python was written in? I think you call it stupid because you don’t know how to use it.
Yes, that method can work just fine for some cases. You can also just use a general purpose division subroutine. Clearly this post is about needing it to be fast. One last thing pedantic point division isn’t repeated subtraction; division is best defined as a scaling operation.
General purpose subroutines offer some trade between space and speed. The routine here is a speed corner, the one I gave is a space corner.
Just do ‘ / 10’ and let the compiler figure it out.
In the commented code, you’re missing a new-line in line 5.
Some small UC such as the old 8051 has both MUL and DIV. They both take 4 cycles. Not enough to make me want to use it.
http://www.keil.com/support/man/docs/is51/is51_mul.htm
If you think 4 cycles is bad, try doing without them. Here’s an asm code snippet of an 8x8bit multiply algorithm for the 8051 using the shift-add method (no MUL opcode): https://archive.thebearsenal.com/2016/01/multiplication-by-shift-and-add-method.html
A quick tally of the asm code looks to be (worst case) around 107 cycles, versus ‘only’ 4 cycles with the specific MUL opcode. So it did indeed help! :-)
In the code, q=quotient, r=remainder. The remainder is calculated as (input number) – 10 * q. If the quotient were exact, the remainder would always be < 10. However, since it is not exact, the remainder can be larger. If this happens, it means q was too small, so this is why it's adjusted by 1. Presumably, q is calculated with enough precision that r won't be bigger than 19, meaning that an adjustment of 1 wouldn't be enough.
Ah, bitshifting math in c, loverly.
Also, see “multiplying by magic numbers to achieve division” from Hackers Delight.
This one worked out for me:
// via https://cboard.cprogramming.com/c-programming/128545-division-ten-magic-numbers.html
div by 10 – works up to 99,999
int x = 150;
(x * 6554UL) >> 16;
https://github.com/cups/Binary-fun/blob/master/code/bitshift-division.c
Divide by 10 in Z80 using ADD only, works on 8-bit numbers, uses Z80’s 16-bit capability and then uses the upper 8-bits as the answer
; H = E \ 10
e_div_10:
ld d,0
ld h,d
ld l,e
add hl,hl
add hl,de
add hl,hl
add hl,hl
add hl,de
add hl,hl
it works, but it’s a play on numbers.
Similar would work on the 6809 I presume or anything else that has 2 registers that can be glommed into one 16 bit.
Easy, divide by almost anything…. Let’s say you’re working on a 16 bit number. Multiply it with (0x10000 / 10) and shift the result 16 bits to the right.
As long as your desired divided by amount is constant, it works pretty efficiently. The compiler takes the hit for you.
maybe the point was to use just shifting and +,-, so no multiplication
Its true, the solution above does assume a hardware multiply. A hardware multiply is pretty cheap in silicon, while a divide is not. Even the humble 8 bit atmega8 includes a hardware multiply, while the latest 32 bit cortex m0+ still excludes a divide. The above solution is clean, easily understandable, and extends to a division of all real number numbers. Algorithms designed round specific constants, inevitably results in different solution to different constants. It unnecessarily adds workload on programmers and debuggers. Again you’re right there is a need, but its very small.
The 6502 and 65C02 and its variants had no multiply or divide.
Neither did the Z80.
While you may be spoiled by what you have now, for many of us in the 70’s and 80’s we had to make due with what we had…
Division on the Z80 was easy.
Some compilers (e.g. gcc) will convert constant multiply to shift/add/subtract where it makes sense. In that case, it’s better to write it as a multiply, and increase portability of the code (or simply reduce size of the code with -Os option, so it will go back to using library call)
It’s better to write it as a multiply, compile, and look at the assembly to make sure. GCC can be a right git plenty of times.
Where I most often use this type of technique is in ADC scaling. Let’s say you have a 10 bit ADC, and the ADC maxes out at 23V (this is the actual input, before the voltage divider). To calculate the amount of milli volts, without a divide, multiply the ADC result with 23000 and shift to right by the ADC precision (in this case 10bits). This makes for efficient, portable code that is easy to read, and requires no divide.
How many even decently modern CPUs would benefit from such hackery? I don’t think it makes you a good programmer to know many tricks of that kind. It actually held me back in the past worrying too much about performance, where it wasn’t an issue. Others take it into the opposite extreme, which is why too much software are real resource hogs now.
Most CPUs even in the 90s took just a few cycles for division. With pipelining it matters even less. Maybe this is useful for an FPGA or some very low performance µC. But in general it’s not too useful, but It’s a nice curiosity.
Not so much modern cpu’s, I’m sure, but I actually could’ve totally used this article about 3 months ago when doing some embedded stuff, where there was no division available, but shifts were cheap. Ended up having to resort to pre-computed lookup tables, which, in the end, still might have been the better choice, but this would’ve been nice to at least benchmark.
The amount of effort needed to hack around dividing by 10 was insane.
It may matter a lot if you take into account huge data manipulations as with matrix calculus when performing simmilar operations on a large number of numbers.
You won’t notice the time benefit on one operation but you will on several million operations.
If you’re doing that kind of stuff, it’s probably a better idea to find a CPU that has a fast divide
“left shift three times to multiply by eight, adding that value to your original value shifted left once to multiply by two.”
no — that’s a total of 3+1 = 4 left shifts (although with a barrel shifter it’s only 2 shift operations) — instead left shift TWICE, add to the original and left shift once more. Fewer operations, same result.
You mention but don’t explicitly say this only matters if you don’t have a barrel shifter. It also could actually be slower in some superscalar architectures because you’re introducing a data dependency.
That being said if you care about hand optimizing execution speed that much, you probably don’t have a barrel shift or a superscalar architecture.
Z80 asm code snippet in the article for the win!
All other CPU’s suck eggs.
And Im wrong.
What a dumb comment I made.
Better yet, use the fact you’re living mod 2^32 and multiply by the modular inverse – division by invariant integers via multiplication (and a shift or two)
Ahhh that’s a good one. I know for years that division works by storing rests somewhere in computers, yet I didn’t think about that.
How we used to do it on the PDP-11 back in the 70s was to have a table like this:
table: .word 40000, 20000, 10000, 8000, 4000, 2000, 1000, 800, 400, 200, 100, 80, 40, 20, 10, 8, 4, 2 1
Then for each word in the table compare it with the source value. If table < source subtract table from source and set a bit in the destination. The destination was then 20 bit BCD, add ASCII 0 to each nibble and you have a conversion.
On my first 6502 home computer, the BASIC ROM had a similar table, but only with powers of 10 (in 32 bit), and then did repeated subtraction for each value.
Here’s an algorithm for BCD that doesn’t require a table: iterate over the input register, left shifting each MS bit into the bottom of the output register. After each shift, add 3 to any nibble in the output reg that is >=5.
If you’re using a CPU that has BCD arithmetic or if your dealing with words > the ALU size, this would be a more efficient way of doing it.
A little PoC of the above, as we’re code golfing…
#include
int main(void)
{
int i;
int a = 0xbc614e, b = 0;
int x;
printf(“0x%08x “, a);
for (i = 1; i; i <> 3) | ((b >> 2) & ((b >> 1) | b))) & 0x11111111;
x |= x << 1;
b = (b + x) << 1;
if (a & (1 << 31))
b |= 1;
a <<= 1;
}
printf("= %08x\n", b);
return 0;
}
Oh dear, html comments evidently not compatible with C shift operator. Here’s a link instead:
https://tio.run/##TY/dasMwDIWv46c4tGxYrBtxO8ogP0@yG9tJNkPnlCwdhqbPnklNO@YbH32SDjr@@cP7eV6H6A@npkX5PTahf/mslQpxxJcNUf/0oSF1VpmQUCy/RYU8Ob83r@0GTqpbJxVKZceBZadXeXrI3xJWG1jiftcP0IGHTcFOCChL1gQ2zxJjrR3qGjvCdNdbwuNdG@GOSFCezO0V1@WpQmI7XEu5h1eekOiPhQ7aipcRtDNEDHlyqpa@XY5hefkXoIIEeI@cwEmCoR1PQ5Swl3n@BQ
Dang, what does that site do, uuencode the content in the URLs?
if you “left shift” too many times, windows will prompt you to turn on sticky keys…
Dad jokes.. why, just why..
You, sir, made me chuckle! thanks for that :-D
Reminds me of the famous Q_rsqrt, the fast inverse square root that was used in Quake III Arena (but presumaly existed before then).
https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code
Cormacks comments (code is on the wikipedia) describe it best.
(410*N) >> 12 in unsigned 32 bit integers. If your N wont cause overflow.
It’s close enough for many things.
410 / 2^12 = 0.10009765625
This is from what is now extremely flakey memory. When doing my training at Control Data in London 40 odd years ago, one language we had to learn was ICL Plan (for ICL 1900). There was no native divide but there was a library routine. For our code we were not allowed to use the library but had to implement divide ourselves.
Great article and I’m already a good way through the Hacker’s Delight. You have a typo in the example code comments line 4: 51n/644. It should be 51n/64. It initially confused me as if never come across the fractional approach to division.
Hello,
Maybe the BLD format is what you want. (BLD: binary-look decimal)
For examle: let N = 19735.
Now convert it to BLD.
19635
04312 : 11011
02101 : 00110
01000 : 00101
00000 : 01000
So, the result is: 01000, 00101, 00110, 11011..
Note that hese are not binary numbers, these are regular decimal numbers, but only made up 1s and 0s.
Actually it looks like to BCD but totally different.
Backward: 1000 * 2^3 + 101 * 2^2 + 110 * 2 + 11011 = 8000 + 404 + 220 + 11011 = 19635
Advantage is you only operate with 1s and 0s.
Regards
PS: (C) Copyright by Adem GÜNEŞ
How the article would have looked if taking the short path:
<>
Dammit html sanitizer!
…what should have been between the angular brackets:
“Division by shifts is just multiplication by a fraction where denominator is a power of two, so you can use traditional composition (by addition) of left shifts in the numerator, and just a right shift for the denominator. The higher power of two you can afford to pick for the denominator, the closer you will generally get to exact result.”
Yes, this would be easy to implement in an FPGA, but you’re more apt to use one of the hard-wired multiplier cores, which probably take less power and fewer gates than this version, which would be implemented with fabric LUTs and carry chains. You’d just multiply by 1/10 with the multiplier.