A Detailed Tutorial On Speeding Up AVR Division

July 15, 2012

[Alan Burlison] is working on an Arduino project with an accelerometer and a few LEDs. Having the LEDs light up as his board is tilted to one side or another is an easy enough project a computer cowboy could whip out in an hour, but [Alan] – ever the perfectionist – decided to optimize his code so his accelerometer-controlled LEDs don’t jitter. The result is a spectacular blog post chronicling the pitfalls of floating point math and division on an AVR.

To remove the jitter from his LEDs, [Alan] used a smoothing algorithm known as an exponential moving average. This algorithm uses multiplication and is usually implemented using floating point arithmetic. Unfortunately, AVRs don’t have floating point arithmetic so [Alan] used fixed point arithmetic – a system similar to balancing your checkbook in cents rather than dollars.

With a clever use of bit shifting to calculate the average with scaling, [Alan] was able to make the fixed point version nearly six times faster than the floating point algorithm implementation. After digging into the assembly of his fixed point algorithm, he was able to speed it up to 10 times faster than floating point arithmetic.

The takeaway from [Alan]’s adventures in arithmetic is that division on an AVR is slow. Not very surprising after you realize the AVR doesn’t have a division instruction. Of course, sometimes you can’t get around having to divide so multiplying by the reciprocal and using fixed point arithmetic is the way to go if speed is an issue.

Sure, squeezing every last cycle out of an 8 bit microcontroller is a bit excessive if you’re just using an Arduino as a switch. If you’re doing something with graphics or need very fast response times, [Alan] gives a lot of really useful tips.

32 thoughts on “A Detailed Tutorial On Speeding Up AVR Division”

tuxfool says:

July 15, 2012 at 10:09 am

This may be a stupid question, but how do you calculate the reciprocal of a number without dividing?

The only way i see this working is if you want to divide by a constant which is pre calculated…

Report comment

Reply
1. tuxfool says:
  
  July 15, 2012 at 10:13 am
  
  Well, division by multiples of 2 is also possible with bit shifting. but otherwise….
  
  (Note to HackA Day, being able to edit comments would be quite useful).
  
  Report comment
  
  Reply
  1. Alan Burlison says:
    
    July 15, 2012 at 2:11 pm
    
    That’s the whole point – by choosing reciprocals where the denominators are a power of two, and by choosing fixed-point scaling factors that are also a power of two the divisions and some of the multiplications can just be replaced with shifts. The remaining multiplications can be done with the hardware MUL instruction. And in some cases, the operations will even partially cancel out, which is why the new sample value is just shifted left by 1 place, because scaling (multiplying) by 32 and then multiplying by 1/16 is just the same as multiplying by 2, i.e. 1 left shift.
    
    Report comment
    
    Reply
2. doragasu says:
  
  July 15, 2012 at 11:11 am
  
  There are a lot of fast algorithms to calculate the reciprocal of a number without using a division. A fast and easy to implement one is for example the Newton-Raphson algorithm. I have used it lots of times when working with fixed point DSPs.
  
  Report comment
  
  Reply
  1. tuxfool says:
    
    July 15, 2012 at 1:17 pm
    
    Cool, I should have gone to wikipedia. They show how to use Newton-Raphson and Goldschmidt division.
    
    Report comment
    
    Reply
  2. guan says:
    
    July 15, 2012 at 3:17 pm
    
    Why don’t the existing software floating point library in avr-gcc simply calculate the reciprocal using Newton-Raphson and multiply by the reciprocal? Or are there situations where that’s not faster, or would violate IEEE 754?
    
    Report comment
    
    Reply
3. Old School says:
  
  July 16, 2012 at 4:50 am
  
  Nice tutorial on reciprocal multiplication : http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
  fix point math : http://homepage.cs.uiowa.edu/~jones/bcd/
  
  Report comment
  
  Reply
4. Bill Stewart says:
  
  July 16, 2012 at 8:42 pm
  
  For exponentially weighted moving averages, you’re going to be running a very large number of cycles of avg = avg*p + measurement*(1-p), so you want to think about roundoff error accumulation, and about doing calculations upfront whenever possible instead of doing them for each measurement. So using division to calculate a reciprocal once and then using multiplication each time is going to be a win.
  
  If you’re using fixed-point calculations, you want your base to be a power of 2, not 10, so you can use shifting instead of division, and so you get precise answers instead of repeating decimals. For instance, instead of avg*95/100 + measurement*5/100, you may want to do avg*122/128 + measurement*6/128, as (avg*122+measurement*6)>>7, or maybe (avg*122+measurement*6+64)>>7 so you get the round-off errors to balance.
  
  Report comment
  
  Reply
Brett_cgb says:

July 15, 2012 at 11:00 am

You calculate the reciprocal of a number at once at compile-time rather than every time during execution. This assumes the number is known at prior to compiling.

These “divide” issues are common to processors that lack division support in hardware (virtually all the small, low-speed microcontrollers that most of use will encounter).

Report comment

Reply
imroy264 says:

July 15, 2012 at 11:05 am

Division is always the slowest of the basic arithmetic operations, no matter the hardware. It’s just not an easy thing to do!

Report comment

Reply
1. Paul says:
  
  July 15, 2012 at 2:05 pm
  
  The Cortex M3 & M4 chips have a hardware divide instruction that executes in 2 cycles.
  
  Report comment
  
  Reply
  1. Megol says:
    
    July 15, 2012 at 3:51 pm
    
    A radix-65536 divider in an embedded processor? That sounds very unlikely.
    
    Report comment
    
    Reply
  2. Rick says:
    
    July 15, 2012 at 7:06 pm
    
    Integer division between 2 & 12 cycles depending on # of leading zeroes:
    
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337h/CHDDIGAC.html
    
    Report comment
    
    Reply
mikeselectricstuff says:

July 15, 2012 at 12:36 pm

It is rarely necessary to use floating point on a micro, and definitely not for controlling LEDs.
Fixed point is much faster, and usually enirelt adequate ( and sometimes more accurate)

Report comment

Reply
1. Megol says:
  
  July 15, 2012 at 3:53 pm
  
  Exactly. Floating point have its uses but for embedded stuff fixed point is often faster, more accurate and more reliable.
  
  Report comment
  
  Reply
Tom says:

July 15, 2012 at 3:41 pm

Wow, Exponential moving average is a beautifully simplistic way to remove jitter from data, here’s a pic of it on mouse input with the drop-off factor equal to 0.1:

Report comment

Reply
1. Tom says:
  
  July 15, 2012 at 3:42 pm
  
  Seem my link tag failed
  https://twitter.com/T_Hodson/status/224634074181017600/photo/1
  
  Report comment
  
  Reply
Cyril says:

July 15, 2012 at 10:05 pm

If you’re after speed why not use simple rolling average with a 2^n sized stack? Then there’s no need for MUL (granted a 2 cycle MUL is slick) or DIV at all. Assuming unsigned, just ADD, SUB and LSL. Use a stack pointer so you don’t have to roll the whole stack each time. The only problem with this method is you do get a stack size * sample rate (approx) initial lag to the result while the stack fills whereas the EMA method gives you somewhat noisy data initially. In theory the lag seems like a problem but in practice it rarely (for me) is.

Report comment

Reply
Alan Burlison says:

July 15, 2012 at 11:46 pm

@Cyril, I’ve used rolling average (RA) in the past but in this case it didn’t work very well. The raw sensor values tend to be longish sequences of one value, then of another. If I make the RA stack long enough to kill the jitter it also kill the responsiveness to movement. EMA kills the jitter without killing responsiveness to movement.

Report comment

Reply
ftorama says:

July 16, 2012 at 12:02 am

I never understood that people needed to go through floating-point arithmetics to go from an entire value (the ADC reading of accelerometer) to another entire value (the port for driving leds).

First thing before programming is to think about it and find a way to optimize before writing.

The calcultation seems to be this one:
“avg = val * 0.1 + avg * 0.9;”

Simply replace it with:
“avg= (val + 9 * avg)/10;”

And you’re done with it….

Report comment

Reply
1. Kaz says:
  
  July 16, 2012 at 6:40 am
  
  You mean (val + avg + (avg << 3)) ;)
  
  Report comment
  
  Reply
Alan Burlison says:

July 16, 2012 at 2:02 am

@ftorama, except of course that there’s a division in there so your suggested equivalent calculation will end up being much slower as a result. That’s the whole point of the article.

Report comment

Reply
1. ftorama says:
  
  July 16, 2012 at 9:23 am
  
  But he talks about float divisions while I use integer divisions simply by thinking as a microcontroller instead of thinking as a human….
  
  The µC doesn’t care if its ADC reading is °C, Volts or bananas, it’s an integer from 0 to 1023 and we should work with that.
  
  Report comment
  
  Reply
  1. Kaz says:
    
    July 16, 2012 at 3:45 pm
    
    As I read it, the point is that division of ANY kind sucks, regardless of whether it’s fixed or floating-point. If you read the article, “avg = val * 0.1 + avg * 0.9;” came out faster than “v2 = ((val <> 5;” because of the two divisions in there. I agree that simply weighting them in advance and the dividing should probably be investigated too, to see what benefits that provides.
    
    Report comment
    
    Reply
  2. Alan Burlison says:
    
    July 16, 2012 at 9:30 pm
    
    Integer division can end up being slower that floating-point multiplication, it says that at the start of the article.
    
    Report comment
    
    Reply
bflorea says:

July 16, 2012 at 2:19 am

As long as you use powers of 2, almost everything is fast.

Report comment

Reply
Kaz says:

July 16, 2012 at 5:11 am

I came across this site awhile back. It’s pretty handy for finding “good enough” fractional equivalents to floating point numbers.
http://www.mindspring.com/~alanh/fracs.html
It takes any decimal number in and gives you various fractional versions of that number pretty much instantly at varying levels of accuracy. I’ve used it also to avoid overflows in my code (for example, when setting up an audio IC one time I needed to multiply an already large number by 1048576/48000 and this shrank it down to 8192/375 for me, which avoided overflow.

Report comment

Reply
engblaze says:

July 16, 2012 at 11:06 pm

Very cool. We actually ran into this issue ourselves and did some simple division speed testing on an Arduino: http://www.engblaze.com/faster-code-fridays-understand-division-and-speed-of-operations/

This article goes into much more depth. Great find!

Report comment

Reply
asdf says:

July 17, 2012 at 7:46 am

As Blaise Pascal stated; the ‘point’ is imaginary and thus pointless.

i.e.
point = cake

Jones’ post on reciprocal multiplication is a great one.

Additionally, why not just add a low pass filter (cap and resistor, ~$0.01). ZERO additional lines of software. No loss in accuracy. Completely adjustable.

Report comment

Reply
1. Alan Burlison says:
  
  July 17, 2012 at 7:55 am
  
  The accelerometer has a SPI interface, so I don’t think analogue filtering would help much ;-)
  
  Report comment
  
  Reply
Paul W. Dent says:

October 28, 2019 at 6:58 pm

Love this thread. Pity it’s so old.
This subject has legs. I am doing small FFTs on an ATMEGA1284 ! ( and a TINY)

Report comment

Reply
Paul W. Dent says:

October 28, 2019 at 7:26 pm

Reason I commented that this subject has legs is the following:
-Logarithmetic on AVR (where only adds and subtracts are quirky but mul, div , raise to power and sqrt are trivial
– Complex in log(r), theta format (3 bytes)
(note: The +/- sign in real can be considered a special case of 1 bit quantizing of the angle to 0 or 180 )
– Speeding up calculations by casting them to use MACs and Butterflies which can be implemented more efficiently than the separate operations
-Karatsuba-Offman for longer integer multiplications e.g. 32, 64 or 128 bit or more
-Fast division of any length integer by any other length integer byte-at-a-time using 256-word precalculated lookup table of remainders

Report comment

Reply