# A Detailed Tutorial On Speeding Up AVR Division

[Alan Burlison] is working on an Arduino project with an accelerometer and a few LEDs. Having the LEDs light up as his board is tilted to one side or another is an easy enough project a computer cowboy could whip out in an hour, but [Alan] – ever the perfectionist – decided to optimize his code so his accelerometer-controlled LEDs don’t jitter. The result is a spectacular blog post chronicling the pitfalls of floating point math and division on an AVR.

To remove the jitter from his LEDs, [Alan] used a smoothing algorithm known as an exponential moving average. This algorithm uses multiplication and is usually implemented using floating point arithmetic. Unfortunately, AVRs don’t have floating point arithmetic so [Alan] used fixed point arithmetic – a system similar to balancing your checkbook in cents rather than dollars.

With a clever use of bit shifting to calculate the average with scaling, [Alan] was able to make the fixed point version nearly six times faster than  the floating point algorithm implementation. After digging into the assembly of his fixed point algorithm, he was able to speed it up to 10 times faster than floating point arithmetic.

The takeaway from [Alan]’s adventures in arithmetic is that division on an AVR is slow. Not very surprising after you realize the AVR doesn’t have a division instruction. Of course, sometimes you can’t get around having to divide so multiplying by the reciprocal and using fixed point arithmetic is the way to go if speed is an issue.

Sure, squeezing every last cycle out of an 8 bit microcontroller is a bit excessive if you’re just using an Arduino as a switch. If you’re doing something with graphics or need very fast response times, [Alan] gives a lot of really useful tips.

## 32 thoughts on “A Detailed Tutorial On Speeding Up AVR Division”

1. tuxfool says:

This may be a stupid question, but how do you calculate the reciprocal of a number without dividing?

The only way i see this working is if you want to divide by a constant which is pre calculated…

1. tuxfool says:

Well, division by multiples of 2 is also possible with bit shifting. but otherwise….

(Note to HackA Day, being able to edit comments would be quite useful).

1. That’s the whole point – by choosing reciprocals where the denominators are a power of two, and by choosing fixed-point scaling factors that are also a power of two the divisions and some of the multiplications can just be replaced with shifts. The remaining multiplications can be done with the hardware MUL instruction. And in some cases, the operations will even partially cancel out, which is why the new sample value is just shifted left by 1 place, because scaling (multiplying) by 32 and then multiplying by 1/16 is just the same as multiplying by 2, i.e. 1 left shift.

2. There are a lot of fast algorithms to calculate the reciprocal of a number without using a division. A fast and easy to implement one is for example the Newton-Raphson algorithm. I have used it lots of times when working with fixed point DSPs.

1. tuxfool says:

Cool, I should have gone to wikipedia. They show how to use Newton-Raphson and Goldschmidt division.

2. Why don’t the existing software floating point library in avr-gcc simply calculate the reciprocal using Newton-Raphson and multiply by the reciprocal? Or are there situations where that’s not faster, or would violate IEEE 754?

3. Bill Stewart says:

For exponentially weighted moving averages, you’re going to be running a very large number of cycles of avg = avg*p + measurement*(1-p), so you want to think about roundoff error accumulation, and about doing calculations upfront whenever possible instead of doing them for each measurement. So using division to calculate a reciprocal once and then using multiplication each time is going to be a win.

If you’re using fixed-point calculations, you want your base to be a power of 2, not 10, so you can use shifting instead of division, and so you get precise answers instead of repeating decimals. For instance, instead of avg*95/100 + measurement*5/100, you may want to do avg*122/128 + measurement*6/128, as (avg*122+measurement*6)>>7, or maybe (avg*122+measurement*6+64)>>7 so you get the round-off errors to balance.

2. Brett_cgb says:

You calculate the reciprocal of a number at once at compile-time rather than every time during execution. This assumes the number is known at prior to compiling.

These “divide” issues are common to processors that lack division support in hardware (virtually all the small, low-speed microcontrollers that most of use will encounter).

3. imroy264 says:

Division is always the slowest of the basic arithmetic operations, no matter the hardware. It’s just not an easy thing to do!

1. The Cortex M3 & M4 chips have a hardware divide instruction that executes in 2 cycles.

1. Megol says:

A radix-65536 divider in an embedded processor? That sounds very unlikely.

4. mikeselectricstuff says:

It is rarely necessary to use floating point on a micro, and definitely not for controlling LEDs.
Fixed point is much faster, and usually enirelt adequate ( and sometimes more accurate)

1. Megol says:

Exactly. Floating point have its uses but for embedded stuff fixed point is often faster, more accurate and more reliable.

5. Tom says:

Wow, Exponential moving average is a beautifully simplistic way to remove jitter from data, here’s a pic of it on mouse input with the drop-off factor equal to 0.1:

6. Cyril says:

If you’re after speed why not use simple rolling average with a 2^n sized stack? Then there’s no need for MUL (granted a 2 cycle MUL is slick) or DIV at all. Assuming unsigned, just ADD, SUB and LSL. Use a stack pointer so you don’t have to roll the whole stack each time. The only problem with this method is you do get a stack size * sample rate (approx) initial lag to the result while the stack fills whereas the EMA method gives you somewhat noisy data initially. In theory the lag seems like a problem but in practice it rarely (for me) is.

7. @Cyril, I’ve used rolling average (RA) in the past but in this case it didn’t work very well. The raw sensor values tend to be longish sequences of one value, then of another. If I make the RA stack long enough to kill the jitter it also kill the responsiveness to movement. EMA kills the jitter without killing responsiveness to movement.

8. ftorama says:

I never understood that people needed to go through floating-point arithmetics to go from an entire value (the ADC reading of accelerometer) to another entire value (the port for driving leds).

First thing before programming is to think about it and find a way to optimize before writing.

The calcultation seems to be this one:
“avg = val * 0.1 + avg * 0.9;”

Simply replace it with:
“avg= (val + 9 * avg)/10;”

And you’re done with it….

1. You mean (val + avg + (avg << 3)) ;)

9. @ftorama, except of course that there’s a division in there so your suggested equivalent calculation will end up being much slower as a result. That’s the whole point of the article.

1. ftorama says:

But he talks about float divisions while I use integer divisions simply by thinking as a microcontroller instead of thinking as a human….

The µC doesn’t care if its ADC reading is °C, Volts or bananas, it’s an integer from 0 to 1023 and we should work with that.

1. As I read it, the point is that division of ANY kind sucks, regardless of whether it’s fixed or floating-point. If you read the article, “avg = val * 0.1 + avg * 0.9;” came out faster than “v2 = ((val <> 5;” because of the two divisions in there. I agree that simply weighting them in advance and the dividing should probably be investigated too, to see what benefits that provides.

2. Integer division can end up being slower that floating-point multiplication, it says that at the start of the article.

10. As long as you use powers of 2, almost everything is fast.

11. I came across this site awhile back. It’s pretty handy for finding “good enough” fractional equivalents to floating point numbers.
http://www.mindspring.com/~alanh/fracs.html
It takes any decimal number in and gives you various fractional versions of that number pretty much instantly at varying levels of accuracy. I’ve used it also to avoid overflows in my code (for example, when setting up an audio IC one time I needed to multiply an already large number by 1048576/48000 and this shrank it down to 8192/375 for me, which avoided overflow.

12. asdf says:

As Blaise Pascal stated; the ‘point’ is imaginary and thus pointless.

i.e.
point = cake

Jones’ post on reciprocal multiplication is a great one.

Additionally, why not just add a low pass filter (cap and resistor, ~\$0.01). ZERO additional lines of software. No loss in accuracy. Completely adjustable.

1. The accelerometer has a SPI interface, so I don’t think analogue filtering would help much ;-)

13. Paul W. Dent says:

Love this thread. Pity it’s so old.
This subject has legs. I am doing small FFTs on an ATMEGA1284 ! ( and a TINY)

14. Paul W. Dent says:

Reason I commented that this subject has legs is the following:
-Logarithmetic on AVR (where only adds and subtracts are quirky but mul, div , raise to power and sqrt are trivial
– Complex in log(r), theta format (3 bytes)
(note: The +/- sign in real can be considered a special case of 1 bit quantizing of the angle to 0 or 180 )
– Speeding up calculations by casting them to use MACs and Butterflies which can be implemented more efficiently than the separate operations
-Karatsuba-Offman for longer integer multiplications e.g. 32, 64 or 128 bit or more
-Fast division of any length integer by any other length integer byte-at-a-time using 256-word precalculated lookup table of remainders

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.