The development of the Hackaday community offline password keeper has been going on for a little less than a year now. Since July our beta testers have been hard at work giving us constant suggestions about features they’d like to see implemented and improvements the development team could make. This led up to more than 1100 GitHub commits and ten thousand lines of code. As you can guess, our little 8bit microcontroller’s flash memory was starting to get filled pretty quickly.
One of our contributors, [Miguel], recently discovered one compilation and one linker flags that made us save around 3KB of Flash storage on our 26KB firmware with little added processing overhead. Hold on to your hats, this write-up is going to get technical…
Many coders from all around the globe work at the same time on the Mooltipass firmware. Depending on the functionality they want to implement, a dedicated folder is assigned for them to work in. Logically, the code they produce is split into many C functions depending on the required task. This adds up to many function calls that the GCC compiler usually makes using the CALL assembler instruction.
This particular 8-bit instruction uses a 22-bit long value containing the absolute address of the function to call. Hence, a total of 4 flash bytes are used per function call (without argument passing). However, the AVR instruction set also contains another way to call functions by using relative addressing. This instruction is RCALL and uses an 11-bit long value containing the offset between the current program counter and the function to call. This reduces a function call to 2 bytes and takes one less clock cycle. The -mrelax flag therefore made us save 1KB by having the linker switch CALL with RCALL instructions whenever possible.
Finally, the -mcall-prologues compiler flag freed 2KB of Flash storage. It creates master prologue/epilogue routines that are called at the start and end of program routines. To put things simply, it prepares the AVR stack and registers in a same manner before any function is executed. This will therefore waste a little execution time while saving a lot of code space.
More space saving techniques can be found by clicking this link. Want to stay tuned of the Mooltipass launch date? Subscribe to our official Google Group!
Don’t forget to inline functions that are only ever used once also
yep! we did that quite a few times.
We also compared flash space savings by declaring functions as inline depending on the number of times they were called..
Is unrolling loops a 6502 demo scene thing only?
No, but loop unrolling is a speed optimization that costs code space. The Mooltipass is short on code space.
I’m not an expert in these things, but I wonder if your development is going on for an unusually long time. With indies I hear it’s very fast and you need the funds quickly to sustain overall effort. Perhaps your beta would’ve been released as a product if developed by someone working on it alone. Or maybe it’s a part-time job. Anyway, curious. Neat that you’re talking about developing it.
Any particular reason to optimize your mcu rather than starting with a bigger one?
Well there’s an enormous amount of development on the software side of things as well (chrome plugin, python scripts for bundle generation), and now that we have beta testers we’re changing a few things as well.
All of us are developing during our spare time.
Main reason we stick with this MCU is arduino compatibility. Anyway until now it has proven good enough for our use..
I saved lots of space by changing:
void setLED(bool on) {
if (on) {
P1OUT |= 1;
} else {
P1OUT &= ~1;
}
}
Into two calls:
void setLEDOn() {
P1OUT |= 1;
}
void setLEDOff() {
P1OUT &= ~1;
}
The first one typically isn’t inlined and it has to pass an arg. The second was always inlined. This depends on the usage of setLED of course, in my code it was always setting the LED to a static state; either true or false. Sometimes more is less!
Using link-time optimization (-flto) should reduce the size even more.
http://nerdralph.blogspot.ca/2014/04/gcc-link-time-optimization-can-fix-bad.html
The code saving using this flag seems to be a good metric for knowing if a coder is good: for our solution, it actually increases flash use by 24 bytes!
That’s a bit surprising. With 4.8.3 and 4.9.1 I have yet to see an example where lto, when used with all libraries linked to the elf, increased the size.
Which gcc version are you using?
for me it’d be 4.8.1, I’ll ask the other contributors to try the flag and let me know their flash savings
Just looked at the github and it looks like you’re using the Arduino IDE.
Arduino builds libs in a funky way that can lead to problems with lto.
http://nerdralph.blogspot.ca/2014/07/gcc-lto-call-graph-generation.html
Using Ino (inotool.org) would give you more control over the build process. And it would be easier to use gcc 4.9.1 vs trying to integrate it with the Arduino IDE. I’m pretty sure the 1.5 beta Arduino nightly builds are still on 4.8.1 (possibly with a couple patches).
we’re not using Arduino IDE for fw dev.
Some contributors use avrstudio, others a makefile
i’m surprised that is not a default optimisation, pc relative code is extremely common. though it is gcc, so not much surprises me after all.
Not all code will work with -mrelax, (see the gcc manual for details)
yeah i was more referring to the pc relative nature of the code gen vs relax , which isn’t what the article was referring too,. i’d assume the non working code is just too far for a relative jump and should be a warning anyway. i don’t see any specifics in the gcc manual otherwise, there might be some bugs related to it though. add more pragma/attributes etc that i haven’t seen.
if no one minds a bit of code golf (ha ha).. if i were tight on code/data space here are some of the things i’d consider after premature optimisation was dealt with, each is on a case by case with careful testing to see if plus or minus, not just globally changed with no checking (as in internet coding) and i did just glance over bits of code while reading coffee and ordering 80/20, but these are also general tips.
a bunch of true/false flags in 8 bits, use bitfields instead where possible.
lots of temp buffers, where possible and reentrant/race conditions aren’t going to happen consider a global buffer with indirect accesses, this is two fold in that it reduces stack usage and can reduce code size if implemented correctly, if you’re able to use naked functions with no stack setup for instance, maybe some relative addressing gains
move single globals into structs, that’ll give you relative addressing.
volatile can be costly, a lot of optimisers will just stop prematurely when they see one, so consider that
static isn’t const, make sure if the data is ready only, that its const (unless you’ve got more ram than codespace) same goes for pointers or arrays of points,often people forget to make both sides of a pointers to pointers const
Hey Charliex,
We actually talked about using a global buffer but for the moment we prefer using temp buffers as we have a dedicated function monitoring stack usage and prefer code clarity :).
Would you have some literature on the benefits of const vars in function calls? That’s quite an interesting topic.
const vars is a misnomer, i’m taking about read only data (typo as ready only in my previous post)
so if you have a data struct that is filled in and read only, just marking it static only changes its scope, it doesn’t move it out of RAM , again coming down to whats more expensive for you ram or rom.
so static const vs static for predefined RO data arrays, for pointer arrays const * const ptr; vs const *ptr;
I do disagree on the code clarity of using a shared buffer vs local temps, you can keep code readability since instead of fixed local auto array, its just an aliased pointer the accesses all look the same. It’s also just as simple to add overwrite guards to a global buffer vs an auto stack one, since if you kill the stack, you kill the execution and limited recovery.
You can still use monitoring but its also making you consider how much auto space you need, embedded devices obviously run in a limited space so you do want to have all your data has to be fixed sizes, since that’s the case you can pre allocate in the global buffer, you can’t go over the max anyway.
It does make it harder in terms of making sure you can be re-entrant or multi threading etc, but if its all single threaded simple code, its not an issue, adding code guards takes care of that.
most if not all of this (including relax) is covered in the avr-gcc notes .
http://www.atmel.com/images/doc8453.pdf
http://www.tty1.net/blog/2008/avr-gcc-optimisations_en.html
Hey charliex,
Actually most read only data is either stored in PROGMEM or in the external flash (graphics and such, see bitmaps dir).
You make a good point on the shared buffer… I’ll relaunch a discussion with the team!
Thanks again
anim.c was a place i think i was looking at the static non const that appeared to be used RO and not in PROGMEM. whats the structure alignment set to?
The anim.c code is a work-in-progress and does not represent the rest of the code. Those static structures will become media files in the SPI flash.
The AVR aligns to 8-bits.
Also, on the AVR making a structure const does not move it into code-space. You have to explicitly put it into the progmem section to achieve that and then use special functions to access it.
See http://www.nongnu.org/avr-libc/user-manual/pgmspace.html for more details.
Can you store each function just once in the code then use much shorter tokens that point to the location of each function? For any function used 2 or more times that would save space.
Texas Instruments used that method in their Extended BASIC.
If the project is hitting the limits of available resources in the MCU developing the stock firmware you are going to make updates and fixes difficult.
Also, being a device that is meant to be open and hackable, you will all but block those who want to add their own features to the stock firmware since there is no flash space left for them to use.
Definitely need a bigger MCU.
well the main point of this article was to say that we have now plenty of space for users wanting to implement their own features…