We all know what bugs in code are. We don’t like them when they are in programs we use, and they’re even worse when they are in code which we have written. Clearly, the best code is bug-free, but how do we get there?
This isn’t a new question, of course, just one that has become ever more important as the total number of lines of code (LoC) that run modern day society keeps increasing and which is affecting even hobbyists more and more often now that everything has a microcontroller inside.
Although many of us know the smug satisfaction of watching a full row of green result markers light up across the board after running the unit tests for a project, the painful reality is that you don’t know whether the code really is functionally correct until it runs in an environment that is akin to the production environment. Yet how can one test an application in this situation?
This is where tools like those contained in the Valgrind suite come into play, allowing us to profile, analyze and otherwise nitpick every single opcode and memory read or write. Let’s take a look, shall we?
It’s Broken, Make It Work Again
When it comes to software development (and hardware development to some extent as well), there are three possible states of being broken:
- It is obviously broken.
- It works, but sometimes it breaks.
- It works fine, but is actually broken.
The first type is the one that shall not be a surprise to anyone. It’s the kind of failure that happily announces itself with such cheerful terms like ‘SIGSEGV‘ (segmentation fault) and ‘SIGBUS‘ (address bus fault) which indicate that the operating system’s kernel has detected that the application is about to do something that is illegal, or impossible. Dividing by zero is a good example of the latter.
The second type of brokenness — where does run but sometimes throws errors — is more intriguing, in that it allows the application to be run through its paces, transferring data, opening and writing files, and displaying data on screen without any issues. Until suddenly when doing the same thing a second time it fails. Or after an hour of working fine it fails. Or it starts doing something ‘weird’, after which the application’s behavior begins to feel almost random.
The third type of brokenness — where it runs but it shouldn’t — is also known as ‘how the heck did this ever work in the first place’, with its discovery usually accompanied by loud exclamations, the questioning of the very fabric of reality, and possibly a few quick prayers to one’s deity of choice depending on theological affinity. This kind of code has managed to reach just the perfect balance within a perfect storm of mistakes that allows it to do the right thing by sheer chance. Until one dares to alter a line of code, of course.
It’s Not Magic, It’s Just Complicated
In its most elementary form, software is merely a series of instructions for the underlying hardware. This hardware attempts to carry out these instructions to the best of its abilities, which involve not only the processing core(s) of the CPU, but also its caches, cache synchronization logic (for multi-core CPUs), memory controller(s) and system memory. On top of this there is usually an operating system (OS) which serves to make life easy for application developers, as they don’t have to worry about implementing a task scheduler, heap and stack management, as well as a lot of other fun details that no application developer wants to mess with.
Each of these elements of the OS and underlying hardware can affect the execution of the code, and each issue will affect different parts of this whole system. This is why we we need the have a range of tools. In the case of a suite like Valgrind, the main tools that we find ourselves using are called Memcheck, DRD and Helgrind.
Using Valgrind to Monitor Memory
The default tool that Valgrind uses when started is Memcheck. As the name suggests, it checks memory. More specifically, it inserts a layer between the OS and the application that is being tested. Much like a debugger, it then tracks each memory write and read, keeping track of references, valid memory ranges, whether blocks of memory are still reachable or not, and so on.
Common use cases of Memcheck are to detect memory leaks, e.g.:
void main() { int* foo = new int; int* bar = new int; *foo = 42; *bar = 24; bar = foo; }
Which would spit out something like in the Memcheck logging:
4 bytes in 1 blocks are definitely lost in loss record 1 of 14
Followed by a backtrace indicating when access to the data (previously pointed to by bar
) was lost. When passing --leak-check=full
to Memcheck, it will also let you know where the data that has been lost was allocated. Here Memcheck may report ‘definitely’, ‘indirectly’ or ‘possibly’ lost data. Unless you have an obvious problem, the ‘definitely’ lost blocks of data are the ones to focus on. Indirectly lost data is usually the result of losing the address of a block of pointers, so fixing the ‘definitely lost’ issue for that should also resolve any ‘indirectly lost’ issues.
Usually one runs Memcheck with this CLI command to get the most useful information:
$ valgrind --tool=memcheck --log-file=memcheck00.txt --leak-check=full --read-var-info=yes path/to/binary
This way the output will be written to a log file (memcheck00.txt), we will get the full leak report, and Memcheck will use any debug information in the binary, if present, to make the trace even more readable. It’s highly advisable to use binaries that have all debug symbols in place to make one’s life easier.
Finding Other Memory Problems
Memcheck is also very useful for detecting invalid reads and writes, as well as the freeing of memory that was not allocated by the application. This would suggest that the application is doing something naughty with memory, which could lead to crashes, corrupted data and other fun. This also includes the use of mismatched free()
and delete()
calls, which can be an issue when mixing C and C++ code in the same application.
Finally, Memcheck will also sanity check your arguments to malloc and similar memory allocation functions, as well as memcpy and similar C functions, catching a lot of issues that would otherwise show up during testing if one is lucky. The Memcheck manual has an assortment of examples, as do various Memcheck tutorials out there (like this one, which covers debugging a memory leak).
Keep Your Threads Where We Can See Them
The other two tools in Valgrind that are exceedingly useful are Helgrind and DRD, which focus primarily on multithreading and all the issues that this may cause. Depending on the settings used, they can track thread activity in a fairly coarse fashion, or log every single mutex movement and so on. Of course, the more one tracks, the more one’s application slows to a crawl.
Although it may seem redundant for Valgrind to have two tools which at first glance appear to do the same thing, Helgrind and DRD are not identical. Each uses a different approach for analyzing application behavior and thus each may give (slightly) different results. It’s often a good idea to run both for this reason.
Issues that we can track down using Helgrind and DRD are for example deadlocks, where two or more threads try to obtain the lock (mutex, rwlock, or similar) to a resource, while also holding a lock themselves. As each thread will only release their lock after they have obtained the other lock, nothing will happen and the application is effectively frozen.
With DRD we can also trace the behavior or locks, including the time that a specific lock was held for:
$ valgrind --tool=drd --exclusive-threshold=10 drd/tests/hold_lock -i 500 ... ==10668== Acquired at: ==10668== at 0x4C267C8: pthread_mutex_lock (drd_pthread_intercepts.c:395) ==10668== by 0x400D92: main (hold_lock.c:51) ==10668== Lock on mutex 0x7fefffd50 was held during 503 ms (threshold: 10 ms). ==10668== at 0x4C26ADA: pthread_mutex_unlock (drd_pthread_intercepts.c:441) ==10668== by 0x400DB5: main (hold_lock.c:55)
Here we set a threshold value of 10 ms, with the test application being instructed to hold the lock for 500 ms. As we can see, the lock (mutex) was held for 503 ms, according to DRD.
Sometimes Order Matters
A useful feature of Helgrind is the tracking of in which order locks are normally used, and when their order changes:
Thread #1: lock order "0x7FF0006D0 before 0x7FF0006A0" violated Observed (incorrect) order is: acquisition of lock at 0x7FF0006A0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x400825: main (tc13_laog1.c:23) followed by a later acquisition of lock at 0x7FF0006D0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x400853: main (tc13_laog1.c:24) Required order was established by acquisition of lock at 0x7FF0006D0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x40076D: main (tc13_laog1.c:17) followed by a later acquisition of lock at 0x7FF0006A0 at 0x4C2BC62: pthread_mutex_lock (hg_intercepts.c:494) by 0x40079B: main (tc13_laog1.c:18)
The thing about the way that locks are used is that it might be totally valid to have them be used in different orders throughout the execution of the application, or it might be indicative of a logic error.
Thinking About Multithread Flow
Both tools will track data races, which occur when two or more threads try to access the same resource simultaneously, without a locking mechanism or the use of atomics to prevent data corruption and worse. This can be as subtle as a single unsigned 64-bit integer that is being read by one thread while another writes to it. If the read operation isn’t atomic (i.e., the whole 64-bit value is read in one CPU cycle), the value can be changed by the writing thread half-way through the reading operation.
Data races are generally bad news, and must be fixed. Though a data race is reported even for atomic operations (e.g. reading a boolean or 8-bit integer on most architectures), specifying the type as an atomic type (e.g. in the STL’s <atomic>
header for C++) is an easy way to make DRD and Helgrind happy, while also being the technically correct approach to writing multithreaded code.
But Wait, There’s More
In this article we only addressed the Valgrind tools that are most useful for debugging, as these tend to be memory and multithreading-related issues. This raises the prospect of another highly enjoyable and educational pursuit for any software developer: optimizing code.
After your application has stopped crashing, no longer corrupts data and is behaving itself, what better use of one’s time than to dive deep into its performance statistics to eek out more performance? This is where tools such as Cachegrind, Callgrind and Massif are useful to figure out where the bottlenecks in the application lie, and where one should focus any optimization efforts.
We will have to save that joyful topic for another day, however.
I’ve always referred to things/software that worked, but should not have as “Dancing Bears” as in the old Russian/Jewish saying: “The true wonder of the dancing bear is not how well he dances, but that he dances at all.”
This would be more useful to the HAD crowd if you focused on how it can be used with embedded code where typically you don’t have any dynamic memory allocation at all.
Obviously it can’t run on embedded platforms but it’s common to produce a software test harness that lets at least part of the firmware run on a PC so you can use these kinds of tools and make unit tests.
One way to use Valgrind for embedded development, is to compile and run your code as part of a unit tests on a PC with a standard GCC compiler. You simply execute the compiled unittest through Valgrind, and BAM – errors / leaks are presented.
Furthermore, if continous integration is your thing, there is a nice Jenkins plugin for easily presenting valgrind errors.
While I am surprised an entire plug in was made for this type of test:
valgrind –leak-check=full –show-leak-kinds=all –track-origins=yes –verbose –log-file=bork.log myprog
It will work 99.98% of the time for C/C++ based programs. One issue this line will often not catch is extremely slow leaks in anon memory within buggy esoteric shared libraries. A common issue if you are writing something that ideally needs to run for years, and losing a few KiB a week is unacceptable as a general rule anyway.
Thus it is not fun, but one can end up going back to manual inspection
ps -o pid,user,%mem,command ax | sort -b -k3 -r
sudo pmap 12345
But at least this will usually confirm if your stuff is responsible for a leak…
after a few days of waiting…
so lame when this happens once in a blue moon… ;-)
73
In many cases bugs are in the tiny details, the things that seem unimportant. Therefore these things don’t get the attention it deserves… this lack of attention feeds the anger of these little critters and while everything seems to work at first, then when you least expect it… the bug bites you.
After many hours of searching for the problem you find it… you needed to code “greater or equal” instead of just “greater”.
Or perhaps it was that one time where you wrote 35 but it should have been 34… because computers start to count from 0 (not 1).
Or perhaps it was that time where you forgot to check if the variable was 0 before you entered the loop, now you loop runs 256 times instead of none.
And then there is the copy paste bug… the piece of code you use everywhere, so you can just copy it and change the values where needed. So you do this everywhere… well almost, because you missed one value, because that was the moment the phone rang and you lost your attention and forgot where you were exactly with the modifying of the values.
I won’t even mention the part of the code where you need to check for an exception situation, but when you got to that piece of code you decided to write it later… to stay focused on the big picture… and then you forget all about it. Hmmm… should have prepared that more thoroughly before starting to type the code.
Well… those are the bugs that I make, it’s perfectly fine code. But doesn’t always does what it needs to when it needs to. Which depends on the situation (or pressure to fail). Sure, the brilliant loop where you scan for the special situation is all worked out flawlessly and one day will get you a Nobel prize for it and people will recognize you for the fine programmer that you are, but why didn’t you just didn’t pay enough attention to those tiny things.
In many cases bugs aren’t about the things you find difficult, it’s in the things you think you know well, the things you don’t need to think of, because they seem to be so easy… and then the bugs are born. Stay focused!
And if you can’t find the problem in 15 minutes after staring at you loops, do something else, take a break, stop searching, then when you start searching a few hours later you’ll have a clear mind and find the obvious.
And ohhh… and a strategically placed printf or debug LED which both can work wonders.
Step 1: write the documentation first
Step 2: write the unit tests
Step 3: write the code last
You should feel like you want to write code first, resist the temptation. Keep writing more and more documentation.
Every function needs at least one unit test. Your tests will probe out to the edges of the ranges of each parameter. Try to make all of the errors happen. Read over your documentation and make sure you are testing every bit of what it says your program will do. It will seem like stupid busy work at this point but it will all pay off soon enough.
By the time you get to this point, you will feel the code in your fingers, just itching to get out. It will all come pouring out at once. Let it flow! Get it all out of you and start testing right away while it it all fresh in your head. Fix all of the warning messages. Compile with different compilers on different platforms and fix the warning messages you get from those other compilers, too. If you don’t have one, get a macbook and test with xcode, you will surely find even more bugs. Test and fix, lather and repeat until all your tests are passing. Test on 32 and 64 bit systems, big-endian and little-endian.
If you do things in this order then bugs like typos in your code, edge cases, etc. get caught immediately before they fester into weirdness. Also it loses the mystique and it becomes more like an ordinary thing that you can just do every day. It’s also the fastest way to arrive at your destination of a working program.
and valginnd absolutely rules! figure it out and use it! Run it on your unit tests. Run it on your program. It will probably find bugs in the libraries that you use. File bug reports and get them fixed so they don’t break your program.
I’d be more interested in this project if it had two things:
1. A reasonable website that easily provided the info I need; why I should use it and who is it’s target audience and it’s use-cases.
2. Regular software updates; Last release shows: “Valgrind 3.15.0 — 14 April 2019”, which is one year ago.
Great philosophical question there, should you be more worried if experts on bughunting rarely release new versions, or are doing nightly builds, with a stable release every few days???
Tools like valgrind belong to a class that:
a) Don’t access the Internet.
b) Don’t do GUI.
c) Have been around for a long time.
These need much less support than do the hottest new applications. I would say that having a last release one year old is pretty current for this class of programs.
The most infuriating bugs are the memory that don’t occur when you are running using a debugger. Valgrind is good for tracking those bastards down.
memory bugs*
Heisenbugs as I’ve often heard them referred to are very annoying. Performance and race conditions similarly can be changed by the act of inspecting them.
1. There’s an explanation on their front page. If that’s not enough, it’s probably not for you.
2. Don’t fix what ain’t broke. Frequent updates can be considered a good sign or a not so good sign.
Valgrind is a fairly well known name. I don’t think it’s desperate for users.
Issues with pointers and memory corruption is why most of the MCU code our company writes is designed in a ways that avoid them at all costs. Any use of pointers must be explicitly described in JIRA task and related confluence story page.
If you want to go -way- down the rabbit hole of testing multi-threaded applications for concurrency problems, consider using SPIN.
I love valgrind/GDB and miss it anytime I can’t use it. What will really bake your noodle is you can run valgrind with GDB so when it does have an error it will stop and you can debug/trace! Valgrind will catch a lot of bugs you didn’t even know you had too.
valgrind –vgdb=yes –vgdb-error=0
Talking about finding memory and thread problems without even mentioning the clang/gcc sanitizers is a bit strange