Templates Speed Up Arduino I/O

It is easy to forget, but the Arduino does use C++. Typically, the C++ part is in the libraries and the framework and most people just tend to code their main programs using a C-style just using the library objects like C-language extensions. [Fredllll] recently created a template library to speed up Arduino I/O and he shared it on GitHub.

If you’ve ever done anything serious with the Arduino, you probably know that while digitalWrite is handy, it does a lot of work behind the scenes to make sure the pin is setup and this adds overhead to every call. [Fredllll’s] template versions can switch a pin’s state in two cycles. You can cut that in half if you don’t mind bothering the state of other pins on the same port.

You can use a constant to turn on a pin, like this:

switchOn<1>();

If you don’t like to use magic numbers (and that’s smart) you can define a constant:

const uint8_t ledPin=1;
switchOn<ledPin>();

Because you probably want to do some fancy timing, there’s also a nop template that lets you delay a set number of cycles. Here’s some test code from Reddit that generates a 1.3 MHz square wave, for example:

const uint8_t myPin = 5;
void loop(){
 cli(); //disable interrupts as they would screw up the timing
 do {
    switchOnExclusive<myPin>(); // 1 cycle
    nop<5>(); // 5 cycles
    switchOffPortOfPin<myPin>(); // 1 cycle
    nop<3>(); // 3 cycles
 } while(1) //jump back to do is 2 cycles
}

Obviously, this isn’t the maximum, either, since there are eight delay cycles in the loop.

You don’t need to know much about templates to use this library, but if you want to know more, we’ve covered them in the past. We’ve noted before that digitalWrite is about fifty times slower than a direct port access, and the other I/O operations aren’t much better. It would be interesting to explore if templates could make other operations more efficient.

25 thoughts on “Templates Speed Up Arduino I/O

  1. Oh gods why, just inline the functions, it’s going to be exactly the same code output without the terrible template syntax abuse.

    Also, those nop’s are recursive and always expanded at compile time. I’m pretty sure you’ll run out of stack in the compiler trying to implement a nop()

      1. The generated code is like unrolling a loop.

        Is the compiler doing it through recursion? Who knows but r4m0m seems to have tested it and found that 1000 was the limit so probably.

    1. It might not result in the same code. You’d need to try it to find out. An inline function is a hint to the compiler. In one of my articles I found that after a certain number of calls to a function the compiler optimized for size by ignoring the inline. (Which is another issues – the optimization level changes things.)

      But for a template the compiler may continue to inline. Don’t know without testing.

      Personally, I’d like to see a template class for a pin. That gets rid of the syntax issues. May try it if I ever get back onto little processors again.

      Pin pin1;
      pin1.switchOn();

    2. The point is to expand the nops at compile time, because it is harder to know how many cycles you spend looping. If you hit a recursion limit you can always use -ftemplate-depth

  2. I took a different track in my ATTinyCore fork where I maintained the use of pinMode, digitalRead etc.. but if the arguments are static (and a couple other compile-time prerequisites) calling those functions can optimise and inline down in many cases to a single instruction, while when it can’t be optimised at compile time it will jump to the full function.

    To do this, in short, I renamed for example the existing pinMode to _pinMode, defined a wrapper pinMode function in the always-gets-included Arduino.h, with a couple of GCC attributes to ensure this definition is always inlined and keep some compiler warnings quiet. the wrapper function uses the magic function __builtin_constant_p to check when the arguments are static and if so the code will optimize down to one or two instructions (a single sbi/cbi in many cases) while if the arguments are not constant then it will call the old function to do the full job.

    Now of course this does mean that calling digitalWrite(1) and calling digitalWrite(ANonConstantVariable) produce dramatically different compiled code (one will be a simple sbi, and the other will be a load of stuff), so if timing is important to you then you need to be aware of if you are passing a constant or not, but for the most part timing isn’t important, but code size very much is, and since most code dealing with I/O DOES use constant pin numbers known at compile time, it can save a bunch of space – on for example a Tiny25, Tiny13, or even a Tiny5 that’s very useful.

    Most of the details are here, I think I made a few more adjustments later…
    https://github.com/sleemanj/ATTinyCore/commit/23f1aab07d3074811b2daa22e4bcf393ebb77f01

      1. Then we need passwords and maybe a real email address to post a comment here. Not to mention the hacking risk on a site called “hackaday”. No, just review your comment before you post.

          1. or spit out a one-time password (i.e. Random 6 digit number) that expires in two minutes,
            you’ll have to click edit and type in the password in order to re-open the reply textbox on said comment.

            After expiry, the edit button/link disappears.

            Also two minutes to both be too short for brute-force attacks and to keep the comments area clean-looking.

    1. I’ve been doing this __builtin_constant_p trick for AVR-based Teensy, in 2009. To give credit where credit is due, it was first proposed (as far as I know) in early 2009 by Ben Combee on the Arduino Developers Mail List. Since late 2009, digitalWrite() on Teensy has compiled to a single instruction if the inputs are constant expressions. At the time David Mellis expressed interest to include this in Arduino, but then in early 2010 David changed his mind, feeling such code makes the core library less easily ported to other microcontrollers. Honestly, that is a valid concern. Arduino has kept the slow but very easily portable code ever since.

      For ARM-based boards, I’ve been implementing it as digitalWriteFast(). By default the I/O pins use slew rate limiting, which dramatically reduces high frequency noise and ringing, especially if driving lengthy wires without impedance matching resistors. The trouble with speeding things up so much is you can easily execute digitalWriteFast(pin, HIGH) and digitalWriteFast(pin, LOW) before the voltage on the pin manages to change through the logic threshold, when running at 180 MHz. Or if you’ve turned off the slew rate feature, you can end up with a pulse only several nanoseconds wide, which is too fast for common 74HC logic chips and lots of other widely used circuitry.

      Anyway, the point is the __builtin_constant_p idea isn’t new. Ben Combee proposed it a little over eight years ago, in the early days of Arduino. I implemented it later that year. I also implemented it for Duemilanove, and when asked I contributed wrote the code Mega too (those were their 2 boards on the market at the time). Arduino almost accepted the contribution, but didn’t….

      1. I know of some minimum timing specs for some HC chips and some pins (at least at 5V) which are down to 3,3ns. The shortest pulse with 180ns clock is 11ns. But today there are also faster logic families available. So I do not see a reason to artificially restrict the possible speed. So it’s good to know, that the …Fast functions exist. Of course you need to know, what you are doing and add some delay yourself, if it is necessary.

    1. Ehhh no!!

      Digitalwritefast seems like a mature lib: It has more compability with allot of microcontrollers and you can more easily port/exchange like “digitalWrite” to “digitalWriteFast” BUT it does not set ports and pins at compile time.

    1. Portability is a valid use case, like those 3D printer firmwares that must support a plathora of different controller boards and configurations. But for many projects I prefer direct access to ports and peripherals. I often port my own code between different series of pic microcontrollers but doesn’t take that much time and keeps performance consistent and fast.

      Porting arduino code to plain c in xc8 is bit of a pain though but still doable.

      As for delay loops I prefer using timers and interrupts. It’s more consistent and doesn’t depend completely on compiler optimizations and is easier to port to other platforms. But if you really need a higher resolution than a timer can provide delay loops are hard to beat.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.