The effect of code on power consumption

Of course putting a microcontroller into sleep mode or changing the clock rate has an effect on the power consumption of the chip, but what about different bits of code? Is multiplying two numbers more efficient than adding them, and does ORing two values consume more power than NOPping? [jcw] wanted to compare the power draw of a microcontroller running different loops, so he threw some code on a JeeNode and hooked it up to an oscilloscope.

For his test, [jcw] tested two instructions: multiply and shift left. These loops run 50,000 and 5,000 times, respectively (bit shifting is really slow on ATMegas, apparently) and looked at the oscilloscope as the JeeNode was doing its work.

Surprisingly, there is a difference in power consumption between the multiply and shift loops. The shift loop draws 8.4 mA, while the multiply loop draws 8.8 mA. Not much, but clearly visible and measurable. While you’re probably not going to optimize the power draw of a project by only using low-power instructions, it’s still very interesting to watch a microcontroller do its thing.

Comments

  1. Sheldon says:

    On the higher-end processors a lot is made about minimising the number of times you update registers, in particular when you change the state of the register i.e. 1->0 or 0->1. If it changes to the same value (as in the case of most of the bits in the bit-shift) then you don’t consume that much.

    On the sorts of geometries that are used for micro-controllers the power consumption of the combinatorial logic doesn’t equate to much (on the really small stuff it’s down to the wires and general leakage of the technology – hence very low voltages)

  2. 0x4368726973 says:

    You have to keep in mind when optimizing the time taken per instruction if different instructions take different amounts of time, or if you have to combine multiple lower power instructions to do the same job as a higher power instruction, you may be using more energy for the same job.

    • taintedkernel says:

      Yes – I recall discussion about powersaving modes on laptops that may not benefit much from large computation due to the longer execution time. Granted they are completely opposite ends of the spectrum comparing a modern CPU to a uC.

      It’s always made me wonder a bit though – bringing the age-old car analogy in there is a huge difference – slow and steady is vastly more fuel efficient then fast and short. I’m guessing the mechanical nature and friction at higher speeds has a lot to do with that.

      • Alex says:

        Intel recommends a “hurry up and wait” strategy. More power can be saved if the CPU is run at full speed when work needs to be done, so it can go back to sleep at the lowest state for the longest time possible.

  3. rasz says:

    site is dead, but something tells me he did it wrong, used C or Crapuino or some Basic instead of assembler, not to mention he probly loops after one instruction instead of lining up 1KB of same instruction and looping that.

    • yup. bit shifting on AVR is fast (but instructions only do it by one bit) … he used gcc, which makes a loop to shift the right amount…fail

      • andygoth says:

        As documented on the linked site and commented on here, the AVR lacks a barrel shifter, so gcc has no choice but to generate a loop. Using assembly language would make no difference. Also, the reference to “Crapuino” is very out of place since this work was done on a JeeNode, not an Arduino. The purpose of the exercise was not to get high performance, only to measure power consumption in typical operations, so loop unrolling isn’t appropriate. And the site is certainly not dead; there’s a huge archive of great stuff there, which I suggest you read.

      • adcurtin says:

        there’s a big difference:

        gcc generating a loop, in pseudo – asm:
        #define LOOP_COUNT 10

        r1 = 0
        loop: if r1 >= LOOP_COUNT
        goto end
        left shift r0 //assuming no second input to specific
        //shift amount since there’s no barrel shifter
        r1++
        goto loop
        end halt

        doing the same thing, manually:
        left shift r0
        left shift r0
        left shift r0
        left shift r0
        left shift r0
        left shift r0
        left shift r0
        left shift r0
        left shift r0
        left shift r0

        The first one takes about 50 instructions to complete (it’s not 100% optimal for a loop, but good enough to make a point)

        The second one takes 10 instructions.

        They’re both O(n), but one is n and the other is 5n. n beats 5n in therms of efficiency any day.

        Note: yes this could go to shit with caches and using cached instructions in the loop vs. uncached in the non loop. Also, space taken up is O(1) for the loop, and O(n) for the nonloop. It’s a tradeoff, but for a test like this it probably doesn’t matter, especially if he’s doing the same method for each instruction (so the repeated loading of instructions vs. cached loop would effectively be nullified, and would lead to a possible increase in power use, but that would be included in the baseline, and you could better tell the difference between the power each instruction uses.

  4. McMonster says:

    I’m just curious how exactly the test case looks like. Specifically the output assembler code. Did he take optimization into account?

  5. Volfram says:

    “Oh hay, multiplication is more expensive than bitshifting.”

    Well YEAH. It was mentioned early in my first microcontrollers class that if you need to multiply or divide an integer by a power of 2, you should probably use bitshifting instead.

    • Volfram says:

      Incidentally, this is why I also try to use & operations instead of % operations to catch odd numbers. I haven’t tested it, but something tells me that the fact that a % involves division will make it fairly expensive.

      • Ken says:

        Don’t bother replacing “x%2″ with “x&1″ if the former reads better to you. An optimizing compiler will produce the same code for both expressions.

      • Volfram says:

        @Ken
        That’s comforting to know. I’ve been making a habit of it for long enough that they both read about the same to me, though.

      • charliex says:

        “optimizing compiler will produce”

        optimizing compiler might produce

      • Ken says:

        Charliex: Any compiler that misses this does not deserve to be called optimizing!

        Compilers have been producing better (e.g. faster) code than experienced assembly language programmers for decades, and they don’t miss obvious bottlenecks like division by a constant. Check out the code your favorite compiler generates for division by a constant (any constant), and you’ll see what I mean.

    • Volfram says:

      Note to self: read article first.

      Error in article: the Multiply required more power than the bitshift(though this is probably offset by the factor of 10 increase in runtime)

      My work was with a Motorola HC11.

      Hmm. Now I’m curious what processors(the ones that are the core of a microcontroller and the ones that you install in personal computers, either way) have barrel shifters and what don’t. It seems like it would be a pretty bad oversight for Intel to leave something like that out.

    • Ren says:

      Wolfram,
      From my Computer Architectures class I recall the time penalty of OP codes. Some operations take more machine cycles than others. It also makes sense what each command does with the gates of the core, i.e. using more gates, or more changes to gate states will consume more power.

      I’m asking you to consider the “tone” of your reply “well YEAH”. I (and I suspect most of HaD readers) did not have a microcontrollers class and so this is new to many of them.

      • Volfram says:

        Bit late to change the tone of my OP now, HaD doesn’t have an edit button ;p

        It should still be obvious to anyone who thinks about it, though. Multiplication and division are inherently expensive operations, and even done by hand, they are represented as sequential addition and subtraction. Computers don’t have the ability to intuit like humans do. If you go just by silicon chip space, it’s significantly cheaper to handle both operations in software as sequential addition or subtraction.(Multiplication takes up a lot more space than addition, and it reduces the maximum clock speed. Division even moreso. I wrote an unclocked divider for my final project at school, ended up multiplexing the outputs to reduce the number of dividers from around 20 to closer to 8, and still 80-90% of the FPGA I was using ended up being occupied with dividers.)

        Doing it that way is really slow, though.

  6. Sam says:

    I actually believe this is one of the more sophisticated ways of cracking some RSA encryption… by watching the time it takes for a processor to complete it’s cycles, you can infer what it was calculating. crazy stuff.

  7. The bit shift operations on AVRs (LSL/LSR/ROL/ROR) all take only one cycle so they aren’t ‘slow’ per se, but they only shift left/right one bit at a time. He’s using 16bit integers *and* an ridiculous number of shifts (321). Under those circumstances gcc quite reasonably implements the shift as a loop. In addition, an 8bit x 8bit multiply only takes 2 cycles but of course you need several of those plus some additions and shifts to implement a 16bit x 16bit multiply – and because of C’s type promotion rules, most arithmetic ops end up being 16bit, as that’s the size of a int with gcc on AVR.

    There’s really nothing here other than ‘Making the CPU do work uses energy’ – well, duh.

    • Volfram says:

      Performing 321 shifts on a 16-bit register strikes me as a really expensive way of clearing that register. Small wonder it would take 10x as long as multiplication, he’s overloading the register by 2000%.(I really need to get in there and read the whole article to figure out what he was trying to do)

      This is like firing a howitzer at a piece of safety glass to determine how bullet-proof it is. You’re going to fail regardless, and it’s not really going to tell you anything useful about what you’re actually trying to test.

  8. Bogdan says:

    HAD you got things swapped, multiply consumes more.

    The code from the website shifts a 16 bit number a 321 digits, so there is a lot of added code because the CPU doesn’t support is directly, more than a simple 8 bit shift instruction.

  9. E says:

    this is the insight that enables many side channel atacks against crypto on micros.

    http://www.cryptography.com/public/pdf/DPA.pdf

  10. Lionel Brits says:

    Besides the obvious implied application of this (energy efficient code), my understanding is that the real magic is in side-channel attacks.

  11. nes says:

    Gating the clock to the multiplier unit perhaps? It would make sense if the manufacturer had implemented it as an iterative multiplier and is using an internal ring oscillator to run it much faster than the rest of the CPU. Just switch it off when not in use.

  12. Chris C. says:

    Power consumption varying with the complexity of operations? Not surprising, though cool to see it documented.

    The ATMega doesn’t have a barrel shifter? Now that is surprising, especially since it’s so common to optimize by bit-shifting instead of multiplying/dividing by powers of two. Makes me glad I’m a PIC user.

  13. neimad says:

    I once heard that on a Commodore 64, it might be possible to make the red LED power light on the top of the console ‘flicker’ or change brightness depending on the code that was running. I never got it to work, but this certainly does shed some light on how it might have worked.

  14. NateOcean says:

    For some applications measuring *power* consumption may not be as important as measuring *energy* consumption. In the example given the multiply drew only 5% more current (thus 5% more power) than using the shifts. But if the multiply completed the job a lot faster than a similar shift based multiply routine, then potentially the micro could return to some very low power sleep mode thus extending battery life.

  15. xyxy1024 says:

    Still it would be more efficient to use a single multiply instruction instead of multiple shift instructions. You would expect that to be done by an optimising compiler – but is GCC that smart yet?

  16. He updated the article. Bit shifting is faster.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 94,503 other followers