Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs

What is the right time to optimize code? This is a very good question, which usually comes down to two answers. The first answer is to have a good design for the code to begin with, because ‘optimization’ does not mean ‘fixing bad design decisions’. The second answer is that it should happen after the application has been sufficiently debugged and its developers are at risk of getting bored.

There should also be a goal for the optimization, based on what makes sense for the application. Does it need to process data faster? Should it send less data over the network or to disk? Shouldn’t one really have a look at that memory usage? And just what is going on inside those CPU caches that makes performance sometimes drop off a cliff on a single core?

All of this and more can be analyzed using tools from the Valgrind suite, including Cachegrind, Callgrind, DHAT and Massif.

Keeping Those Cores Cool

Modern day processors are designed with low power usage in mind, regardless of whether they are aimed at servers, desktop systems or embedded applications. This essentially means that they are in a low power state when not doing any work (idle loop), with some CPUs and microcontrollers turning off power to parts of the chip which are not being used. Consequently, the more the processor has to do, the more power it will use and the hotter it will get.

Code that needs fewer instructions to perform the same task due to a more efficient algorithm or fewer abstractions not only runs cooler, but also faster. This means that for the user, the experience is that not only does the task complete faster, but the device also gets less hot, with less fan noise. If it is battery powered, the battery will also last longer with a single charge. Basically everyone will be happier.

The weapons of choice here are Cachegrind and Callgrind. Although heap profiling (covered later in this article) can also be useful for power saving, the main focus should be on the processor. This means that we need to know what our code is doing, especially in terms of what parts of our code run most often, as those would be prime targets for optimization.

Track and Trace Those Calls

Running Cachegrind and Callgrind is quite uncomplicated. One simply passes the executable name and any flags it needs to Valgrind along with the tool you wish to use:

$ valgrind --tool=callgrind my_program 

This command would start the Callgrind tool for our program called my_program. Optionally, we can also have Callgrind simulate the CPU caches with --simulate-cache=yes. While running, Callgrind generates an output file called callgrind.out.<pid>, where <pid> is the process ID of the application while it was running. This file is then converted to a human-readable format:

$ callgrind_annotate callgrind.out.<pid> > callgrind00.txt

This produces a file that contains (among other things) a function call summary, ranked by how long execution spent in that particular function, making it obvious that a lot of speed could be gained if that function were to be optimized.

As explained in this article over at Stanford, the use of cache simulation adds details on cache hit/misses:

  • Ir: I cache reads (instructions executed)
  • I1mr: I1 cache read misses (instruction wasn’t in I1 cache but was in L2)
  • I2mr: L2 cache instruction read misses (instruction wasn’t in I1 or L2 cache, had to be fetched from memory)
  • Dr: D cache reads (memory reads)
  • D1mr: D1 cache read misses (data location not in D1 cache, but in L2)
  • D2mr: L2 cache data read misses (location not in D1 or L2)
  • Dw: D cache writes (memory writes)
  • D1mw: D1 cache write misses (location not in D1 cache, but in L2)
  • D2mw: L2 cache data write misses (location not in D1 or L2)

Seeing a large number of cache misses in an algorithm or loop would be a strong hint to optimize it to take up less data in the cache, employ prefetching to prevent cache misses, or take other measures applicable to the code in question.

Using Cachegrind is fairly similar to Callgrind, except that Cachegrind focuses on CPU caches first and foremost, with function calls a secondary concern. This should make it obvious which tool to pick of these two, depending on one’s most pressing questions.

Doing More with Less Memory

Even though computers and even microcontrollers often come with more memory in the form of caches and main system memory (RAM) than a developer in the 1990s could even begin to dream of, there are two negatives that relate to RAM:

  • RAM isn’t infinite; at some point heap space will run out. Best case, it’s just your application getting terminated by the OS, not the entire OS (or RTOS) bailing out and causing a cascade of faults throughout the system.
  • Active RAM costs power. Each part of a dynamic RAM (DRAM) module has to be constantly refreshed in order for the capacitive charges that store the value to be retained. Is becomes an especially important issue for battery-powered devices.

Reducing the amount of memory used does not only affect the system RAM, but also helps with the caches between the CPU’s processing units and RAM. Less data means fewer cache misses and fewer delays as the memory subsystem scrambles to move the requested data from RAM into the L3, L2 and (usually) the L1 cache.

Although a beefy Xeon or Epyc server processors tend to have a healthy 128 MB of L3 cache (or more), a commonly used ARM processor such as the one in the Raspberry Pi 3 (the BCM2837 SoC) has 16 kB L1 cache for data and instructions each, as well 512 kB L2 cache. There is no L3 cache here. Unless your application uses less than 512 kB memory in total (stack and heap), system RAM will be hit regularly, which will heavily affect the application’s performance.

One distinction to make here is that every application tends to have data stored in RAM — whether on the heap or the stack — that may get accessed regularly or only occasionally. Using Valgrind’s Massif and DHAT tools it’s fairly easy to figure out the usage patterns of this data on the heap, as well as which data doesn’t need to be stored at all any more.

Running the Numbers

Massif is the easiest to use of these two tools, all it takes is a single call on the command line:

$ valgrind --tool=massif my_program

This will run the application and output to the file massif.out.<pid>, where <pid> is the process ID of the application as it was being executed. Before we can use this data which Massif gathered, we first have to process it:

$ ms_print massif.out.<pid> > massif00.txt

This will direct the output from the ms_print utility to a file with the details in a human-readable form. It contains a graph of heap usage over time, like this example from the Massif documentation:

    MB
3.952^                                                                    # 
     |                                                                   @#:
     |                                                                 :@@#:
     |                                                            @@::::@@#: 
     |                                                            @ :: :@@#::
     |                                                          @@@ :: :@@#::
     |                                                       @@:@@@ :: :@@#::
     |                                                    :::@ :@@@ :: :@@#::
     |                                                    : :@ :@@@ :: :@@#::
     |                                                  :@: :@ :@@@ :: :@@#:: 
     |                                                @@:@: :@ :@@@ :: :@@#:::
     |                           :       ::         ::@@:@: :@ :@@@ :: :@@#:::
     |                        :@@:    ::::: ::::@@@:::@@:@: :@ :@@@ :: :@@#:::
     |                     ::::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                    @: ::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                    @: ::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                    @: ::@@:::::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |                ::@@@: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |             :::::@ @: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |           @@:::::@ @: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   626.4

Number of snapshots: 63
 Detailed snapshots: [3, 4, 10, 11, 15, 16, 29, 33, 34, 36, 39, 41,
                      42, 43, 44, 49, 50, 51, 53, 55, 56, 57 (peak)]

The graph shows KDE’s Konquerer web browser as it’s started and run for a while. The vertical axis shows the heap usage (in megabytes), and the horizontal axis the number of instructions that have been executed since the application was launched. This way one can get an idea what heap usage looks like, with each of the slices in the graph detailed further in the file, e.g.:

--------------------------------------------------------------------------------
  n        time(B)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 15         21,112           19,096           19,000            96            0
 16         22,120           18,088           18,000            88            0
 17         23,128           17,080           17,000            80            0
 18         24,136           16,072           16,000            72            0
 19         25,144           15,064           15,000            64            0
 20         26,152           14,056           14,000            56            0
 21         27,160           13,048           13,000            48            0
 22         28,168           12,040           12,000            40            0
 23         29,176           11,032           11,000            32            0
 24         30,184           10,024           10,000            24            0
99.76% (10,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->79.81% (8,000B) 0x80483C2: g (example.c:5)
| ->39.90% (4,000B) 0x80483E2: f (example.c:11)
| | ->39.90% (4,000B) 0x8048431: main (example.c:23)
| |   
| ->39.90% (4,000B) 0x8048436: main (example.c:25)
|   
->19.95% (2,000B) 0x80483DA: f (example.c:10)
| ->19.95% (2,000B) 0x8048431: main (example.c:23)
|   
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

The first column is the slice number, with detailed slices being expanded, showing the percentage of heap space being taken up by specific data in the heap, as well as where in the code it was allocated. Obviously it is required to compile the application with debug symbols included (-g option for GCC) to get the most out of this functionality.

The use of DHAT is similar to Massif, though it outputs to a JSON format, requiring the browser-based viewer (dh_view.html) to actually analyze the data. DHAT can provide more detailed information about the allocated data in the heap than Massif, including things like allocations which are never fully used. Whether this is necessary depends on what kind of optimizations one desires.

Keep That Toolbox Filled

After previously looking at the other commonly used tools in the Valgrind suite, you should have a pretty good idea of when to use Valgrind to assist in debugging and optimization efforts. Although they are all incredibly useful tools, they are not the end all and be all of debugging analysis tools. As every experienced developer knows, what counts is knowing when to use each approach.

Sometimes all one needs is a simple debugger or solid logging inside the application to shake out the most serious issues. Only when that doesn’t help is it time to start breaking out the heavier tools, with the cautionary warning that with powerful tools comes great responsibility in interpreting the data. The experience and knowledge to make the right decisions and draw the right conclusions is as essential a tool as any other.

27 thoughts on “Using Valgrind To Analyze Code For Bottlenecks Makes Faster, Less Power-Hungry Programs

  1. Most/Much code simply doesn’t need to be optimized.

    That which does can often have the biggest improvements by major strategy changes, but sometimes hair splitting optimization can have benefits. Usually though it is wasted time, so save it for when you are bored.

      1. No, I’ve been in the business for over 40 years and spent a lot of my early years wasting time on code efficiency and advocating assembly language as the cure for all ills.

        First make it right, then if needed make it fast.

        1. Meanwhile back on planet earth, database vendors are still finding that a heavy focus on optimization is opening new markets for analytics and creating vast new revenue opportunities.

      2. Here is a concrete example. An embedded RT system has a clock that ticks at 100 Hz. Every tick it has to do the same processing and the existing code can do it in 8 ms, leaving 2ms idle. The requirements will not change. Is there any point in optimizing that code?

        And another — I have some program that runs once a day and generates some kind of summary. It takes 10 seconds to run. My junior programmer tells me he has been busy all week and now it runs in 5 seconds. What do I say to him? You’re fired!

        1. While an embedded system for a particular purpose may not have changing requirements, requirements do change when systems are repurposed and repurposing errors do occur. Such improvements may prevent future errors from occurring (however I also agree that they may not be caught because of this). For example, it is also likely for optimisations to create errors – eg Ariane 4 / Ariane 5.

          I would say to her, “Fantastic work. The optimisation you achieved may not be useful in this particular case; however, please share the knowledge gained from this work with others so we can improve our solutions in the future.”

          1. I don’t know that I’d have fired him, but I’d certainly have told him he’d wasted his time, possibly the time of others, and needlessly cost the company money. There may not have been any worthwhile knowledge gained, either.

            As a specific example, back in the early 1980s, I worked for NBI, which made one of the early dedicated word processing systems. It was built around an 8086-based central server with 6800-based terminals.

            I was assigned at one point to fix a problem with a certain piece of code on the server that had been highly optimized – the initial programmer had spent three weeks coming up with what he claimed was absolutely the most efficient way of performing the task. It took me three weeks to understand how things worked well enough to fix the code. The engineer who had fixed a previous problem with the code also took three weeks to understand it before applying her fix.

            The optimized code saved perhaps a few hundred milliseconds compared to a straightforward implementation. It ran once during system boot, which, for the server, was probably about once per week.

            As another example, in the 1990s at another company, I had to fix a problem with a serial port ring buffer in one of our products – a customer had discovered that a buffer overrun caused a persistent error state in which characters were put into the buffer twice. When I went into the code, I found comments that lauded the robustness of the implementation, but no comments describing the code, nor about the thought process that led to implementing a ring buffer using several hundred lines of code for a state machine with 64 states.

            Some programmers don’t care about business justifications, future maintainers, or anything else, they just want to be tricky to show off.

        2. there sure is a point in optimizing the code if you can reduce the clock speed to reduce power consumption or even substitute a lower powered part.

          ” I have some program that runs once a day and generates some kind of summary. It takes 10 seconds to run. My junior programmer tells me he has been busy all week and now it runs in 5 seconds. What do I say to him? You’re fired!”

          If you actually sold your products to customers you would give him a raise for saving them so much time. My boss would certainly give me a raise if I saved each of them 5 seconds a week.

    1. Well, somebody has to post it:

      “There is no reason for any individual to have a computer in his home.” – Ken Olsen

      “Linux is a cancer that attaches itself in an intellectual property sense to everything it touches,” – Steve Ballmer

      We can pretend all day that these folks were “visionary”

    2. “Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

      Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.”

      — Donald Knuth, 1974, “Structured Programming with go to Statements”

  2. Nowaday most programmers relies on dynamic allocation where as, in the old time, you’d use them the less often possible. First because managing memory was cumbersome and it was easy to leak it, but mostly because dynamic variables wouldn’t be optimized as much (if at all) by the compiler. Having fixed execution path with static variable is so much easier for the compiler to optmize and keep those variables inside the registers.

    1. Using static variables is a great way to make sure your code will never make its way into a modern multithreaded design. Processor designers spent a lot of effort to give us threads, mutexes, etc. in even simple microcontrollers and we programmers should take more advantage of them.

  3. An early-in-development round of profiling can reveal bad design decisions that seemed like good ideas at the time. If you wait til the very end of a project to look at optimization, you cost yourself more work fixing things.

    On the other hand, the really fine tuned optimization really does need to wait. There’s no use optimizing code that’s never actually run.

  4. kcachegrind – The tool that makes interpreting ‘callgrind’ output easy! Readily available on github.

    I have no affiliation with the makers of this tool, but i have used this tool many times over the years.
    Totally worth any effort to get it installed and running. The UI, graphs, and charts make it easy to see the pain-points in the code.

    On my philosophy of “optimization”, it must be a balanced with “time and place”.
    * Is it the right time to optimize?
    * Is it the right place to optimize?
    * What is the amount of developer time (which is $) your willing to spend?

    Because optimization is an iterative process, you need to repeatedly ask yourself about your optimization objective:
    * Is your optimization going for a 5% or a 25% performance gain?
    * Are you spending 95% of your time for a 5% gain?

  5. I’ve worked on multi core systems where every clock mattered. Profiling with custom tooling was the first thing created. That way every little change, a hit can be noticed and further improvement became intuitive.

  6. Valgrind is a useful tool. However, for the performance tuning I’ve done, I’ve found an execution profiler to be a better tool. This allows the program to run at full speed on the desired processor, with it being periodically interrupted (every few micro-seconds), and the program counter recorded. Afterward, the results can be computed to determine where the processor is spending most of its time.

    One of the earliest discussions of a program execution analyzer/profiler was by Leigh Power in 1983, in the IBM System Journal article, “Design and use of a program execution analyzer”.

    https://dl.acm.org/doi/abs/10.5555/1717752.1717761

  7. Optimizing your code is a bit of good citizenship. These days so many systems are virtualized that the few seconds you did not care about are depriving someone else of those compute cycles. There is little reason to assume your code will always run on exclusive hardware.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.