Developed On Hackaday: Sometimes, All You Need Is A Few Flags

October 1, 2014

The development of the Hackaday community offline password keeper has been going on for a little less than a year now. Since July our beta testers have been hard at work giving us constant suggestions about features they’d like to see implemented and improvements the development team could make. This led up to more than 1100 GitHub commits and ten thousand lines of code. As you can guess, our little 8bit microcontroller’s flash memory was starting to get filled pretty quickly.

One of our contributors, [Miguel], recently discovered one compilation and one linker flags that made us save around 3KB of Flash storage on our 26KB firmware with little added processing overhead. Hold on to your hats, this write-up is going to get technical…

Many coders from all around the globe work at the same time on the Mooltipass firmware. Depending on the functionality they want to implement, a dedicated folder is assigned for them to work in. Logically, the code they produce is split into many C functions depending on the required task. This adds up to many function calls that the GCC compiler usually makes using the CALL assembler instruction.

This particular 8-bit instruction uses a 22-bit long value containing the absolute address of the function to call. Hence, a total of 4 flash bytes are used per function call (without argument passing). However, the AVR instruction set also contains another way to call functions by using relative addressing. This instruction is RCALL and uses an 11-bit long value containing the offset between the current program counter and the function to call. This reduces a function call to 2 bytes and takes one less clock cycle. The -mrelax flag therefore made us save 1KB by having the linker switch CALL with RCALL instructions whenever possible.

Finally, the -mcall-prologues compiler flag freed 2KB of Flash storage. It creates master prologue/epilogue routines that are called at the start and end of program routines. To put things simply, it prepares the AVR stack and registers in a same manner before any function is executed. This will therefore waste a little execution time while saving a lot of code space.

More space saving techniques can be found by clicking this link. Want to stay tuned of the Mooltipass launch date? Subscribe to our official Google Group!

26 thoughts on “Developed On Hackaday: Sometimes, All You Need Is A Few Flags”

gannon says:

October 1, 2014 at 7:34 am

Don’t forget to inline functions that are only ever used once also

Report comment

Reply
1. Mathieu Stephan says:
  
  October 1, 2014 at 7:38 am
  
  yep! we did that quite a few times.
  We also compared flash space savings by declaring functions as inline depending on the number of times they were called..
  
  Report comment
  
  Reply
ganzuul says:

October 1, 2014 at 8:01 am

Is unrolling loops a 6502 demo scene thing only?

Report comment

Reply
1. Generic Human says:
  
  October 1, 2014 at 9:03 am
  
  No, but loop unrolling is a speed optimization that costs code space. The Mooltipass is short on code space.
  
  Report comment
  
  Reply
Depot says:

October 1, 2014 at 8:02 am

I’m not an expert in these things, but I wonder if your development is going on for an unusually long time. With indies I hear it’s very fast and you need the funds quickly to sustain overall effort. Perhaps your beta would’ve been released as a product if developed by someone working on it alone. Or maybe it’s a part-time job. Anyway, curious. Neat that you’re talking about developing it.

Any particular reason to optimize your mcu rather than starting with a bigger one?

Report comment

Reply
1. Mathieu Stephan says:
  
  October 1, 2014 at 8:07 am
  
  Well there’s an enormous amount of development on the software side of things as well (chrome plugin, python scripts for bundle generation), and now that we have beta testers we’re changing a few things as well.
  All of us are developing during our spare time.
  Main reason we stick with this MCU is arduino compatibility. Anyway until now it has proven good enough for our use..
  
  Report comment
  
  Reply
Tyler says:

October 1, 2014 at 8:36 am

I saved lots of space by changing:
void setLED(bool on) {
if (on) {
P1OUT |= 1;
} else {
P1OUT &= ~1;
}
}

Into two calls:

void setLEDOn() {
P1OUT |= 1;
}

void setLEDOff() {
P1OUT &= ~1;
}

The first one typically isn’t inlined and it has to pass an arg. The second was always inlined. This depends on the usage of setLED of course, in my code it was always setting the LED to a static state; either true or false. Sometimes more is less!

Report comment

Reply
Ralph Doncaster (Nerd Ralph) says:

October 1, 2014 at 8:42 am

Using link-time optimization (-flto) should reduce the size even more.
http://nerdralph.blogspot.ca/2014/04/gcc-link-time-optimization-can-fix-bad.html

Report comment

Reply
1. Mathieu Stephan says:
  
  October 1, 2014 at 9:51 am
  
  The code saving using this flag seems to be a good metric for knowing if a coder is good: for our solution, it actually increases flash use by 24 bytes!
  
  Report comment
  
  Reply
  1. Ralph Doncaster (Nerd Ralph) says:
    
    October 1, 2014 at 11:00 am
    
    That’s a bit surprising. With 4.8.3 and 4.9.1 I have yet to see an example where lto, when used with all libraries linked to the elf, increased the size.
    Which gcc version are you using?
    
    Report comment
    
    Reply
    1. Mathieu Stephan says:
      
      October 1, 2014 at 11:11 am
      
      for me it’d be 4.8.1, I’ll ask the other contributors to try the flag and let me know their flash savings
      
      Report comment
      
      Reply
  2. Ralph Doncaster (Nerd Ralph) says:
    
    October 1, 2014 at 11:08 am
    
    Just looked at the github and it looks like you’re using the Arduino IDE.
    Arduino builds libs in a funky way that can lead to problems with lto.
    http://nerdralph.blogspot.ca/2014/07/gcc-lto-call-graph-generation.html
    
    Using Ino (inotool.org) would give you more control over the build process. And it would be easier to use gcc 4.9.1 vs trying to integrate it with the Arduino IDE. I’m pretty sure the 1.5 beta Arduino nightly builds are still on 4.8.1 (possibly with a couple patches).
    
    Report comment
    
    Reply
    1. Mathieu Stephan says:
      
      October 1, 2014 at 11:12 am
      
      we’re not using Arduino IDE for fw dev.
      Some contributors use avrstudio, others a makefile
      
      Report comment
      
      Reply
charliex says:

October 1, 2014 at 10:14 am

i’m surprised that is not a default optimisation, pc relative code is extremely common. though it is gcc, so not much surprises me after all.

Report comment

Reply
1. daid303 says:
  
  October 1, 2014 at 10:50 am
  
  Not all code will work with -mrelax, (see the gcc manual for details)
  
  Report comment
  
  Reply
  1. charliex says:
    
    October 1, 2014 at 11:53 am
    
    yeah i was more referring to the pc relative nature of the code gen vs relax , which isn’t what the article was referring too,. i’d assume the non working code is just too far for a relative jump and should be a warning anyway. i don’t see any specifics in the gcc manual otherwise, there might be some bugs related to it though. add more pragma/attributes etc that i haven’t seen.
    
    Report comment
    
    Reply
charliex says:

October 1, 2014 at 10:36 am

if no one minds a bit of code golf (ha ha).. if i were tight on code/data space here are some of the things i’d consider after premature optimisation was dealt with, each is on a case by case with careful testing to see if plus or minus, not just globally changed with no checking (as in internet coding) and i did just glance over bits of code while reading coffee and ordering 80/20, but these are also general tips.

a bunch of true/false flags in 8 bits, use bitfields instead where possible.
lots of temp buffers, where possible and reentrant/race conditions aren’t going to happen consider a global buffer with indirect accesses, this is two fold in that it reduces stack usage and can reduce code size if implemented correctly, if you’re able to use naked functions with no stack setup for instance, maybe some relative addressing gains
move single globals into structs, that’ll give you relative addressing.
volatile can be costly, a lot of optimisers will just stop prematurely when they see one, so consider that
static isn’t const, make sure if the data is ready only, that its const (unless you’ve got more ram than codespace) same goes for pointers or arrays of points,often people forget to make both sides of a pointers to pointers const

Report comment

Reply
1. Mathieu Stephan says:
  
  October 1, 2014 at 12:32 pm
  
  Hey Charliex,
  
  We actually talked about using a global buffer but for the moment we prefer using temp buffers as we have a dedicated function monitoring stack usage and prefer code clarity :).
  Would you have some literature on the benefits of const vars in function calls? That’s quite an interesting topic.
  
  Report comment
  
  Reply
  1. charliex says:
    
    October 1, 2014 at 1:11 pm
    
    const vars is a misnomer, i’m taking about read only data (typo as ready only in my previous post)
    so if you have a data struct that is filled in and read only, just marking it static only changes its scope, it doesn’t move it out of RAM , again coming down to whats more expensive for you ram or rom.
    
    so static const vs static for predefined RO data arrays, for pointer arrays const * const ptr; vs const *ptr;
    
    I do disagree on the code clarity of using a shared buffer vs local temps, you can keep code readability since instead of fixed local auto array, its just an aliased pointer the accesses all look the same. It’s also just as simple to add overwrite guards to a global buffer vs an auto stack one, since if you kill the stack, you kill the execution and limited recovery.
    
    You can still use monitoring but its also making you consider how much auto space you need, embedded devices obviously run in a limited space so you do want to have all your data has to be fixed sizes, since that’s the case you can pre allocate in the global buffer, you can’t go over the max anyway.
    
    It does make it harder in terms of making sure you can be re-entrant or multi threading etc, but if its all single threaded simple code, its not an issue, adding code guards takes care of that.
    
    most if not all of this (including relax) is covered in the avr-gcc notes .
    
    http://www.atmel.com/images/doc8453.pdf
    
    http://www.tty1.net/blog/2008/avr-gcc-optimisations_en.html
    
    Report comment
    
    Reply
    1. Mathieu Stephan says:
      
      October 1, 2014 at 2:25 pm
      
      Hey charliex,
      
      Actually most read only data is either stored in PROGMEM or in the external flash (graphics and such, see bitmaps dir).
      You make a good point on the shared buffer… I’ll relaunch a discussion with the team!
      
      Thanks again
      
      Report comment
      
      Reply
      1. charliex says:
        
        October 1, 2014 at 4:39 pm
        
        anim.c was a place i think i was looking at the static non const that appeared to be used RO and not in PROGMEM. whats the structure alignment set to?
        
        Report comment
      2. harlequin says:
        
        October 2, 2014 at 1:58 am
        
        The anim.c code is a work-in-progress and does not represent the rest of the code. Those static structures will become media files in the SPI flash.
        
        The AVR aligns to 8-bits.
        
        Report comment
      3. harlequin says:
        
        October 2, 2014 at 2:03 am
        
        Also, on the AVR making a structure const does not move it into code-space. You have to explicitly put it into the progmem section to achieve that and then use special functions to access it.
        
        See http://www.nongnu.org/avr-libc/user-manual/pgmspace.html for more details.
        
        Report comment
Galane says:

October 1, 2014 at 1:24 pm

Can you store each function just once in the code then use much shorter tokens that point to the location of each function? For any function used 2 or more times that would save space.

Texas Instruments used that method in their Extended BASIC.

Report comment

Reply
RobHeffo says:

October 3, 2014 at 7:42 pm

If the project is hitting the limits of available resources in the MCU developing the stock firmware you are going to make updates and fixes difficult.

Also, being a device that is meant to be open and hackable, you will all but block those who want to add their own features to the stock firmware since there is no flash space left for them to use.

Definitely need a bigger MCU.

Report comment

Reply
1. Mathieu Stephan says:
  
  October 4, 2014 at 4:14 am
  
  well the main point of this article was to say that we have now plenty of space for users wanting to implement their own features…
  
  Report comment
  
  Reply

Hackaday

Developed On Hackaday: Sometimes, All You Need Is A Few Flags

26 thoughts on “Developed On Hackaday: Sometimes, All You Need Is A Few Flags”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Mining And Refining: Uranium And Plutonium

Programming Ada: First Steps On The Desktop

The Hunt For MH370 Goes On With Barnacles As A Lead

MXM: Powerful, Misused, Hackable

VCF East 2024 Was Bigger And Better Than Ever

Our Columns

Welcome Back, Voyager

Hackaday Podcast Episode 268: RF Burns, Wireless Charging Sucks, And Barnacles Grow On Flaperons

This Week In Security: Cisco, Mitel, And AI False Flags

Keebin’ With Kristina: The One With The Transmitting Typewriter

Supercon 2023: Alex Lynd Explores MCUs In Infosec