Pi Zero Power Optimization Leaves No Stone Unturned

October 3, 2024 by Arya Voronova 17 Comments

If you’ve ever designed a battery-powered device with a Pi Zero, you have no doubt looked into decreasing its power consumption. Generic advice, like disabling the HDMI interface and the onboard LED, is omnipresent, but [Manawyrm] from [Kittenlabs] goes beyond the surface-level, and gifts us an extensive write-up where every recommendation is backed with measurements. Armed with the Nordic Power Profiler kit and an SD card mux for quick experimentation, she aimed at two factors, boot time and power consumed while booting, and made sure to get all the debug information we could use.

Thanks to fast experimentation cycles and immediate feedback, we learn plenty of new things about what a Pi Zero does and when, and how we can tame various power-hungry aspects of its behavior. Disabling the GPU or its aspects like HDMI output, tweaking features like HAT and other peripheral probing, and even tactical overclocking during boot – it’s an extensive look at what makes a Pi Zero tick, and no chance for spreading baseless advice or myths.

All in all, this write-up helps you decrease the boot time from twelve seconds to just three seconds, and slash the power budget of the boot process by 80%. Some recommendations are as simple as config.txt entries, while others require you to recompile the kernel. No matter the amount of effort you can put into power optimization, you’ll certainly find things worth learning while following along, and [Manawyrm]’s effort in building her solar-powered Pi setup will help us all build better Pi-Zero-powered solar devices and handhelds.

Make Your Code Slower With Multithreading

June 7, 2024 by Dave Rowntree 51 Comments

With the performance of modern CPU cores plateauing recently, the main performance gains are with multiple cores and multithreaded applications. Typically, a fast GPU is only so mind-bogglingly quick because thousands of cores operate in parallel on the same set of tasks. So, it would seem prudent for our applications to try to code in a multithreaded fashion to take advantage of this parallelism. Or so it would seem, but as [Marc Brooker] illustrates, it’s not as simple as one would assume, and it’s very easy to end up with far worse overall performance and no easy way to fix it.

[Marc] was rerunning an old experiment to calculate the expected number of birthdays in a shared group of people using brute force. The experiment was essentially a tight loop running a pseudorandom number generator, the standard libc rand() function. [Marc] profiled the code for single-thread and multithreaded versions and noted the runtime dramatically increased beyond two threads. Something fishy was going on. Running perf, [Marc] noted that there were significant L1 cache misses, but the real killer for performance was the increase in expensive context switches. Perf indicated that for four threads, the was an overhead of nearly 50% servicing spin locks. There were no locks in the code, so after more perf magic, the syscalls taking all the time were identified. Something in there was using a futex (or fast userspace mutex) a whole lot.

Continue reading “Make Your Code Slower With Multithreading” →

Where The Work Is Really Done – Casual Profiling

July 29, 2019 by Bryan Cockfield 13 Comments

Once a program has been debugged and works properly, it might be time to start optimizing it. A common way of doing this is a method called profiling – watching a program execute and counting the amount of computing time each step in the program takes. This is all well and good for most programs, but gets complicated when processes execute on more than one core. A profiler may count time spent waiting in a program for a process in another core to finish, giving meaningless results. To solve this problem, a method called casual profiling was developed.

In casual profiling, markers are placed in the code and the profiler can measure how fast the program gets to these markers. Since multiple cores are involved, and the profiler can’t speed up the rest of the program, it actually slows everything else down and measures the markers in order to simulate an increase in speed. [Daniel Morsig] took this idea and implemented it in Go, with an example used to demonstrate its effectiveness speeding up a single process by 95%, resulting in a 22% increase in the entire program. Using a regular profiler only counted a 3% increase, which was not as informative as the casual profiler’s 22% measurement.

We got this tip from [Greg Kennedy] who notes that he hasn’t seen much use of casual profiling outside of the academic world, but we agree that there is likely some usefulness to this method of keeping track of a multi-threaded program’s efficiency. If you know of any other ways of solving this problem, or have seen causal profiling in use in the wild, let us know in the comments below.

Header image: Alan Lorenzo [CC BY-SA 3.0].

Gcc: Some Assembly Required

June 8, 2016 by Al Williams 51 Comments

There was a time when you pretty much had to be an assembly language programmer to work with embedded systems. Yes, there have always been high-level languages available, but it took improvements in tools and processors for that to make sense for anything but the simplest projects. Today, though, high-quality compilers are readily available for a lot of languages and even an inexpensive CPU is likely to outperform even desktop computers that many of us have used.

So assembly language is dead, right? Not exactly. There are several reasons people still want to use assembly. First, sometimes you need to get every clock cycle of performance out of a chip. It can be the case that a smart compiler will often produce better code than a person will write off the cuff. However, a smart person who is looking at performance can usually find a way to beat a compiler’s generated code. Besides, people can make value trades of speed versus space, for example, or pick entirely different algorithms. All a compiler can do is convert your code over as cleverly as possible.

Besides that, some people just like to program in assembly. Morse code, bows and arrows, and steam engines are all archaic, but there are still people who enjoy mastering them anyway. If you fall into that category, you might just want to write everything in assembly (and that’s fine). Most people, though, would prefer to work with something at a higher level and then slip into assembly just for that critical pieces. For example, a program might spend 5% of its time reading data, 5% of its time writing data, and 90% of the time crunching data. You probably don’t need to recreate the reading and writing parts. They won’t go to zero, after all, and so even if you could cut them in half (and you probably can’t) you get a 2.5% boost for each one. That 90% section is the big target.

The Profiler

Sometimes it is obvious what’s taking time in your programs. When it isn’t, you can actually turn on profiling. If you are running GCC under Linux, for example, you can use the -pg option to have GCC add profiling instrumentation to your code automatically. When you compile the code with -pg, it doesn’t appear to do anything different. You run your program as usual. However, the program will now silently write a file named gmon.out during execution. This file contains execution statistics you can display using gprof (see partial output below). The function b_fact takes up 65.9% of CPU time.

If you don’t have a profiling option for your environment, you might have to resort to toggling I/O pins or writing to a serial port to get an idea of how long your code spends doing different functions. However you do it, though, it is important to figure it out so you don’t waste time optimizing code that doesn’t really affect overall performance (this is good advice, by the way, for any kind of optimization).

Assembly

If you start with a C or C++ program, one thing you can do is ask the compiler to output assembly language for you. With GCC, use a file name like test.s with the -o option and then use -S to force assembly language output. The output isn’t great, but it is readable. You can also use the -ahl option to get assembly code mixed with source code in comments, which is useful.

You can use this trick with most, if not all, versions of GCC. Of course, the output will be a lot different, depending. A 32-bit Linux compiler, a 64-bit Linux compiler, a Raspberry Pi compiler, and an Arduino compiler are all going to have very different output. Also, you can’t always figure out how the compiler mangles your code, so that is another problem.

If you find a function or section of code you want to rewrite, you can still use GCC and just stick the assembly language inline. Exactly how that works depends on what platform you use, but in general, GCC will send a string inside asm() or __asm__() to the system assembler. There are rules about how to interact with the rest of the C program, too. Here’s a simple example from the a GCC HOWTO document (from a PC program):

__asm__ ("movl %eax, %ebx\n\t"
"movl $56, %esi\n\t"
"movl %ecx, $label(%edx,%ebx,$4)\n\t"
"movb %ah, (%ebx)");

You can also use extended assembly that lets you use placeholders for parts of the C code. You can read more about that in the HOWTO document. If you prefer Arduino, there’s a document for that, too. If you are on ARM (like a Raspberry Pi) you might prefer to start with this document.

So?

You may never need to mix assembly language with C code. But if you do, it is good to know it is possible and maybe not even too difficult. You do need to find what parts of your program can benefit from the effort. Even if you aren’t using GCC, there is probably a way to mix assembly and your language, you just have to learn how. You also have to learn the particulars of your platform.

On the other hand, what if you want to write an entire program in assembly? That’s even more platform-specific, but we’ll look at that next time.

Profiling An Arduino

April 28, 2014 by Brian Benchoff 10 Comments

profiling

In proper, high-dollar embedded development environments – and quite a few free and open source ones, as well – you get really cool features like debugging, emulation, and profiling. The Arduino IDE doesn’t feature any of these bells a whistles, so figuring out how much time is spent in one section of code is nigh impossible. [William] came up with a clever solution to this problem, and while it doesn’t tell you exactly how much time is spent on a specific line of code, it’s still a good enough tool to be a great help in optimization.

[William]’s solution is to create a ‘bin’ for arbitrary chunks of code – one for each subroutine or deeply nested loop. When the profiler run, you end up with a histogram of how much time is spent per block of code. This is done with an interrupt that runs at about 1 kHz, with macros sprinkled around the code. Each time the interrupt ticks, the macro runs and increases a counter by one. Let the sketch run for a minute or so, and you get an idea of how much time is spent in a specific area of code.

It’s a bit of a kludge, but when you’re dealing with extremely minimal tools, any sort of help in debugging is sorely needed and greatly appreciated.