Want Lower Power? Add More Cores!

[Jacob Beningo] over at Embedded.com recently posted his thoughts on how to do a low-power microcontroller design. On the surface, some of his advice seems a little counter-intuitive. Even he admits, “…I’m suggesting adding more cores! I must be crazy!” There are a few tips, but the part he’s talking about is that you can save power by using CPUs with multiple cores and optimizing for speed.

This seems strange since you think of additional cores and speed to consume more power. But the idea is that the faster you get your work done, the faster you can go to sleep. We’ve seen that in our own projects — faster work means more napping, and that’s good for power consumption.

Of course, it isn’t just that simple. Multiple cores don’t help you if you don’t use them. The overarching goal is to get done quickly so you can get back to sleep. You know, kind of like work. The other advice in the post is generally good, too. Measure your power consumption, respond to events, and — maybe slightly surprising — with modern CPUs, variations within the CPU family, according to [Jacob], isn’t very significant. Instead, he reports that the big changes are switching to the least-capable processor family.

Naturally, Hackaday readers are no strangers to low-power design. If you get your power consumption low enough, you can consider a low-tech battery or even a potato.

35 thoughts on “Want Lower Power? Add More Cores!

    1. The Propeller still is a good idea though quite niche. It’s also the only processor I’ve found that you leave running slow code on a slow clock for less than 10uA. (10uA is enough for 3-4 cores at 32Khz) It’s just a whole lot less painful to write low power UI and sensing code when you can just let it tick over slowly vs stop-start in bursts.

  1. low, high etc. this is not informations.
    very simple question. how long my pda,computer with microcontroler can work? month? What is power eficient, screen? memory, peripherial, keyboard?
    I need real information and digits
    not low, very low, tiny , super ultra top speed ++

  2. Assuming exactly the same architecture, and the same set of instructions performing the same work, by performing the work quicker (by adding more cores), you use less energy?

    What a wonderful universe you live in. In mine, the law-of-conservation-of-energy makes this impossible.

    1. More core voltage increases max frequency linearly, but power consumption with a power of 2. Other factors like caches and bus utilization can also have an impact.

    2. no reason to be so snooty – this makes perfect sense to me.

      when the processor is in deep sleep, a lot of subsystems are powered down.
      When come out of sleep for instance you have to start the clocks up to high speed. And that clock is then available to all systems.
      So and then perhaps instead of being awake for 10 time units times power level 3 for sequential processing, you are awake for 6 time units at power level 4 for parallel processing.

    3. It’s because you live on the Textbook’s Universe and he does not. On his universe, capacitors leak over time, wires have resistance and inductance, transistors have resistance, parallel wires have capacitance.

      Longer running times makes those effects add loses, so if you do the job 10 times faster, those pesky little energy loses will steal a tenth of the power.

      On Textbook Universe those effects won’t exist.

      1. …and to do the same work 10 times faster, you’ll have to add at least 10 times the instantaneous energy (more for those “pesky” losses – your base losses, excluding the work you want to perform – increases exponencially along the way).

        You can shift the work around in time, distribute it over bigger time intervals, or compress it into shorter time intervals, but you can NEVER use less energy, while performing the same work, in exactly the same manner. Instantanious current does NOT equal enerny consumed. Personally I think I’m the one living in the real universe…hence my comment.

        1. In our universe we’re nowhere close to the theoretical efficiency where power usage scales with computation. There’s this pesky thing called “parasitic power usage” that has a large constant term that doesn’t increase with the amount of computations done, or even with core count. A uC that’s halted but still fully powered consumes a lot more energy than one that’s in a deep sleep. That’s because sleep disables parasitic draws such as external bus buffers, on-chip peripherals, and clock generators–stuff that doesn’t scale with the number of cores.

        2. :P Johan… come on now. You know exactly you can make it more efficient and yet because technically you’re correct as just adding more cores doesn’t help that much at all, you won’t break it down for everyone to show what you can do?

          Let’s ELI84 shall we?

          Let’s say you have to walk to the grocery store everyday to buy groceries, each time you make the trip there and back you expend energy (Energy spent per trip = 100 Auts). However, your parents arent home from work yet so everytime you go to buy groceries you have to bring along your sluggish little brother verschwenderisch along with you. So now every time you go to the store there is double the energy being used (200 Auts) and because Versch is so damn slower then you, you actually take longer to finish the job too. Gawd, so lame Versch why don’t you grow up already!

          Anyways, if you went to the grocery store everyday with your brother to buy your meals you’re spending 1400 Auts in energy units. So instead you hatch a plan, you convince your 6 neighbours to help out and walk to the store with you. Now you can carry enough groceries for the entire week! So looking at how much energy we spent we’re at (200 Auts for you and Versch, plus an additional 600 Auts for your neighbours for a total of 800) So now, just by adding some more cores .. errr. neighbours, re-distributing the tasks and compressing everything into a smaller time interval we’ve somehow managed to drop our total energy usage down by 600 Auts a week! We didn’t actual use less energy per task but we were able to re-arrange things so we used less energy overall.

          Ok ok, I am trivializing things. Honestly its crazy complex and the further you dig the more dirt falls back down in on you.

          Johan is mostly right though, In reality we don’t really save that much energy at all on the face of it. Yes we cut out inefficiencies but unless we go further we don’t gain that much at all and we’re increasing costs and other stuff to boot. But we can do some other tricks to help. A big one would be lowering the clock speed in conjunction with throwing up multiple cores. Lowering the clock speed will increase the indiv task time a bit but since we’re running them all at once its not a big deal as overall its faster then the serialized tasks. But while turning down the freq may slow things down it helps us because of the relationship freq/power. Reducing the freq by half reduces the power used by 8x in turn making your task 4x more power efficient and since you’re running things in parallel now the speed losses are offset.

          1. Good example, but only partially correct. You are missing baseload energy inputs…your neighbours need energy inputs just to exist, and not only when they perform work. All electronic circuitry has the same requirement…they need to consume current just to be on (and not perform any work). More cores, more baseload, and even with the additional cores switched off, the core arbitrator will always need power. And if you have a cache, power gets a lot worse.

        3. The author wasn’t even suggesting multiple cores of the same type. He suggested a low-power core for routine on-going housekeeping, and a higher-power core which only runs when a beefy computation needs to be quickly done. The example he gives is a low-power, slow M0+ core that is listening for an audio keyword, and wakes up a higher-powered, faster M4 core that does the heavy *real-time* number-crunching needed to analyze and respond to the request. The low-power core might not even be able drive needed peripherals (such as an Ethernet interface), might not be able to even address all the RAM needed for buffers or ML model, and even if it could would waste a lot of time and energy dealing with word-size limititations and swapping between internal and external buffers. It would be like shipping a semi-trailer load of potatoes by making multiple trips in a pickup.

          1. Reminds me of the 8087 floating point coprocessor. You could use code to (slowly) perform math on the 8086, OR send the values to 8087 registers and invoke the FP instructions.

            Now I’m wondering: where do you draw the distinction between multi core, and multiple CPU? A series of small cores could each run portions of a DSP filter code, for example. But to be truly efficient, they probably should have their own code, which means their own PC register, stack register… this could get very messy quite quickly.

            Maybe I’m missing something.

        4. I don’t think you understand how math works.

          Let’s say we have a CPU that consumes 10W when awake and 1W for every core that’s active running instructions. Your program needs to run 1000 instructions, and each core can run 1000 instructions in 1 second.

          A single core will take 1 second, so (10W x 1s) + (1W x 1s) = 11 Watt-seconds of energy consumed.

          10 cores will take 0.1s, so (10W x 0.1s) + (10 x 1W x 0.1s) = 2 Watt-seconds.

          I think your confusion is that you think that work as defined in physics is the same as logical operations. Or perhaps you think that cores are the main energy consumer in a CPU? Either way, that’s just not how it works. A dual-core dsPIC33 doesn’t consume twice the energy of a single-core version, and turning off half the cores in an i9-9900K doesn’t cut energy consumption by 50%.

          1. I think you don’t understand how processors work in real life…

            If you have only 2 cores connected to the same memory (assuming that they are going to share the same work), your available bus access time halves if both cores are running, unless you double the core clocks (assuming your memory can keep up), or double the memory bus speed (if you want exactly the same computation time for your work units), which pushes up your power consumption again. In your example with 10 cores, in a “real life” processor, you can bargain on closer to 1s execution time or more for every core, when all 10 are running in parallel (each core will only have a 10th of the available access time to memory). In practical processor designs, you generaly end up bringing the processor core clocks down, to get everyone to play nicely together. You cannot create a free energy machine by convieniently ignoring certain loads and energy inputs.

            P.S. – I’m an engineer…not a physicist. However, you cannot do engineering without understanding physics, as the two are not mutually exclusive.

      2. Lets’ start with the most obvious disadvantage… the faster you switch a FET, the more time it spends in a power-wasting regime. And the waste is non-linear (and not in our favor), so you can’t just say 10x speed uses 10x more power.

        There is likely advantage to use some slower cores to do mundane chores while the faster ones crank out your hardcore problem… the minimum time slot necessary to do whatever the most difficult task is sets your boundary. But to say “everything should be done faster” is a major oversimplification.

        1. Haha damn, I wrote a stupid story with my post and now I am beaten. So ignore my post below regarding the kid brother and core speed, etc.

          MacGyverS2K said it much simpler.

        2. Correct…therefore I did say “at least 10X”… :-)
          There is also that other “pesky” little gotcha with multiple cores…they have to share the memory bandwidth, so your computation time increases. To counter that you have to bump up core clocks, or bump up memory bus speed (asuming you don’t want to increase the computation time), both increasing your power.
          Our processing paradigm is over 70 years old. If it was this easy to get more work done faster, in less time, it would have been done decades ago. Getting multiple cores to work in harmony, and getting anything productive done with them is already a very precarious exercise.

          1. “they have to share the memory bandwidth, so your computation time increases.”, that just isn’t true. If you have computations that are bottlenecked by the memory bandwidth then yes it would be true but the majority of calculations aren’t. If you have two cores running the same instructions and they each require less than half of the memory bandwidth then they can both execute simultaneously without running into any memory bandwidth bottlenecks. It’s even less of a problem if you have caches for each core.

            You have had the power argument explained to you multiple times but seem incapable of understanding it properly. Your whole argument seems to revolve around the fact that energy usage scales linearly with number of cores or amount of computation done, which it absolutely doesn’t in any modern processor. There are multiple situations where increasing the core count can increase the efficiency and as others have pointed out a dual core device doesn’t generally use double the power of a single core device.

          2. “If it was this easy to get more work done faster, in less time”

            ….??

            You think it’s *easy* to shove multiple processing cores on a die?

            It wasn’t done decades ago because the silicon manufacturing processes weren’t good enough that you could put multiple ALUs/register banks/etc. and not get killed by the excess leakage and still produce something usable for pennies.

            Seriously, just go look at any multi-core microcontroller’s datasheet where they break down current draw. Running 2 cores does not result in twice the current draw, just like (for practically all situations) running at twice the frequency does not result in twice the current draw.

      3. thoriumbr:
        2X * 0.9 = X*0.9 + X*0.9
        if you do not get it I can explain.

        As I do agree with some statements from the article,
        I do not agree with your assessment of Johan’s.

        The article do not provide enough technical context to argue the details.
        Johan just indicated general truth.

        Note on shutting down parts of system – to put controller back on to full
        from all off (including crystals) requires huge overhead to initialise system, stabilise clocks, etc ,
        It is not great to respond to events, waking up may take longer than some events to be serviced. Again more technical info required to argue the details.

        Note on keyword detection – this requires the same computational power as normal speech recognition difference is in dictionary size (Way oversimplified statement but I hope you catch the drift) so using “smaller” core at start – not great for response.

    4. The idea is that you add a mix of big & small cores, and let the small core do the light work so that the big core doesn’t have to wake up. For instance, small core could handle individual characters from a UART, and then wake up the big core when a complete message has been received.

    5. “the law-of-conservation-of-energy makes this impossible”

      If each core was perfectly efficient then yes, that makes sense. But perfect efficiency is also impossible in this universe.

      If each core is spending more energy creating heat, just because it is on than it is spending actually solving the problem then if two cores can get the work done and back to sleep in half the time you will save power. Of course.. that also assumes an algorithm that can run perfectly in parallel with no overhead. That’s probably also impossible but that just means the ratio of power wasted to power used to perform calculations has to be that much higher before extra cores become a benefit.

      Now.. are real world cores actually that inefficient? Are people doing things with them that can be parallelized that well? I have no idea. I’m just saying before you get to apply those textbook physics laws you have more real world variables to consider.

    6. The obvious reason is that there is significant overhead power use that’s the same (or similar) regardless of the number of cores used. If the overhead power uses the same amount as one core, then it’s power use per unit time is 2 and a quad core would have power use of 5, but if the quad core works 4 times faster then it’s 5 vs 8 (2*4) so, yes, the multi-core is less power.

      Though, this may not be true for all processors.

      1. This depends on many things, one of which is the technology used. Static CMOS, dynamic CMOS, passive load NMOS, etc., are going to have different optimizations. Even within static CMOS, choosing fast transistors is going to mean having more leakage than slow transistors.

        Of course, when purchasing off-the-shelf processors, the choice of technology has already been made by the manufacturer, and the system designer is left to make optimizations based on datasheets. The choices of number of cores, burst versus low-speed-low-voltage processing, design effort versus expected sales, all vary with the job to be done.

    7. Can you find me a data sheet for an ARM CPU that’s offered in 1, 2, and 4 cores? Are the power draw at idle and load exactly 2x and 4x for those systems? If it’s more, you missed components in your model. If it’s less, then it sounds like your theory is incorrect from the jump. But if it’s spot on? There’s more to talk about!

    8. No, in your contrived case, of course not. If we take our fingers off the scale and write code that actually takes advantage of having multiple cores and parallelizes the problem, you can get better than linear speedup.

      You act like this is a novel concept, the device you’re writing this on is guaranteed to have a unit in it that exploits this very thing to save power by splitting certain workloads across identical cores.

      In reality, parallelism, the overhead of context switching, response latency, latency to access slow resources and other issues means that for some problems, 2 cores may finish in a lot less than T/2 time.

    1. Actually, while the article they link to does suggest making full use of the peripherals (such as DMA), it does also suggest having more cores… however it’s not suggesting more *of the same* cores but instead suggesting adding lower-power cores when the work isn’t very complex in the same way that most modern processor chips have a mixture of high-performance (high-power consuming) cores and low-performance (low-energy) cores.

  3. It very much depends what your application is.

    Doing work faster usually results in more energy consumed overall. Even if your faster cores don’t have additional power consuming hardware like caches, branch prediction, out of order execution and the like, power consumption doesn’t scale linearly with frequency. That is especially true of very high performance cores, but even with low end 8 bit MCUs it applies.

    If you need to power something up to do your work, like say you are reading data from an SD card and processing it, then there is a decent chance that doing it faster will save power overall.

    On the very low power side the source of your clock also matters. Crystals pull more power than the MCU’s internal RC oscillators, for example.

  4. The main controversial thing in the article isn’t the “more cores is better!” it’s the “microcontroller selection isn’t that important.”

    It really, really is. The problem with his statement is that it sounds like you can just grab a microcontroller based on its processing capabilities and ignore the low-power features, and that’s… really not true. There are *definitely* microcontrollers out there that have utterly horrible low-power features, where it’s very hard to get into or out of a low-power mode.

    What he’s really trying to say is “don’t believe manufacturer hype” and that idea is fine. But you *absolutely* need to understand the processor’s low-power capabilities if you want a low-power design.

    I don’t get a lot of what’s said in that section, where he seems to be touting tens-of-microamp static draw as like, difficult or something? I’ve had designs with microamp-level static draw for *years*. It actually isn’t just about power, it’s also about switching noise – you can stick microcontrollers right by sensitive equipment for configuration changes if you can shut down their switching sections entirely.

  5. I have seen this claims numerous times, and fail numerous time due to some basic fact: you need a very good low power architecture to do that, and once you have it, you no longer need the “fast lane to sleep”.

    The main reason is that today leaking current is the main contributor, and the leak current is mostly related to SRAM size. And when you have multiple core, you have multiple times the same SRAM, so there is a sweet spot between number of core and sleep power.
    Ok you can decide to not retain SRAMs, but then it increase wakeup time…

    So in the end the balance is very difficult to achieve, and the simpler very low power core may be better 90% of the time.

  6. My only question is how can I harvest the wasted watts of “spirited” debates on posts like this to pop popcorn? It seems inefficient use my own electricity to do so when watching so much wasted wattage on debates :D It’s probably worth pointing out this isn’t a one-size-fits-all solution, but I mean this is hack-a-day, I don’t think I’ve seen a single article claim “this odd approach will absolutely revolutionize every usage of X”. And no editors, that’s not a challenge, please do not do this :D

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.