Linux performance testing with perf, gprof and Valgrind

There are a lot of different Linux tools that can be used to profile and instrument executables. Some of them are non-intrusive and do not require any modification of our source code or build process, even sometimes do not introduce any runtime overhead. Others can increase the time needed to execute code up to few hundred times. Right now we’ll look into some of those tools and list their pros and cons when it comes to profiling and debugging. They are ordered from the least to the most intrusive.

perf

`Performance counters for Linux`, `perf` or `perf_events` is a potent Linux tool that abstracts CPU hardware differences to show performance measurements in command line interface. It can be used for finding bottlenecks, analysing application time or thread wait latency, or even for accurate benchmarking. Gathered data can be stored and visualised using external tools like Brendan Gregg’s `FlameGraph`. All you need to do is choose what aspect of application do you need to analyse, look for events related to it and start recording session. Watch out as logs from perf with many events enabled can get big, really fast!

`perf` reads from hardware CPU registers that count hardware events. Those events can contain info on, e.g. cache misses, branch misses, CPU cycles. Besides that, there are software events (e.g. page faults), and special static tracepoint events in the kernel that can be traced unobtrusively. To get the full list (which can be long), all you need to do is run `perf list`. Most of those tracepoints are static and are for listening to low-level calls and events, but `perf` enables you to do dynamic tracing in both kernel and userspace, but using `kprobes` and `uprobes` for that is a topic for whole another article.

Example of available events that can be traced:

$ perf list



List of pre-defined events (to be used in -e):



cpu-cycles OR cycles                       [Hardware event]

instructions                               [Hardware event]

cache-references                           [Hardware event]

cache-misses                               [Hardware event]

branch-instructions OR branches            [Hardware event]

branch-misses                              [Hardware event]

bus-cycles                                 [Hardware event]



cpu-clock                                  [Software event]

task-clock                                 [Software event]

page-faults OR faults                      [Software event]

minor-faults                               [Software event]

major-faults                               [Software event]

context-switches OR cs                     [Software event]

cpu-migrations OR migrations               [Software event]

alignment-faults                           [Software event]

emulation-faults                           [Software event]



L1-dcache-loads                            [Hardware cache event]

L1-dcache-load-misses                      [Hardware cache event]

L1-dcache-stores                           [Hardware cache event]

L1-dcache-store-misses                     [Hardware cache event]

L1-dcache-prefetches                       [Hardware cache event]

L1-dcache-prefetch-misses                  [Hardware cache event]

L1-icache-loads                            [Hardware cache event]

L1-icache-load-misses                      [Hardware cache event]

L1-icache-prefetches                       [Hardware cache event]

L1-icache-prefetch-misses                  [Hardware cache event]

LLC-loads                                  [Hardware cache event]

LLC-load-misses                            [Hardware cache event]

LLC-stores                                 [Hardware cache event]

LLC-store-misses                           [Hardware cache event]



rNNN (see 'perf list --help' on how to encode it) [Raw hardware event descriptor]



mem:<addr>[:access]                        [Hardware breakpoint]



kvmmmu:kvm_mmu_pagetable_walk              [Tracepoint event]



[...]



sched:sched_stat_runtime                   [Tracepoint event]

sched:sched_pi_setprio                     [Tracepoint event]

syscalls:sys_enter_socket                  [Tracepoint event]

syscalls:sys_exit_socket                   [Tracepoint event]



[...]

Measuring counters for `ls -la` on a virtual machine (hence lack of some statistics):

$ perf stat ls -la



[output omitted...]



Performance counter stats for 'ls -la':



         1,120814    task-clock (msec)         # 0,772 CPUs utilized        

               48 context-switches          # 0,043 M/sec

                0 cpu-migrations            # 0,000 K/sec

              131 page-faults               # 0,117 M/sec

  <not supported>      cycles                                       

  <not supported>      instructions                                       

  <not supported>      branches                                       

  <not supported>      branch-misses                                       



      0,001452149 seconds time elapsed

gprof

`gprof` is a tool for instrumentation and performance analysis on Linux-based platforms. Unfortunately, contrary to `perf` you need to include additional profiling information during compilation and linkage of the application. For C++ programs just add `-pg` to compiler/linker flags and run your program. It will emit a file called `gmon.out` that can be read using `gprof`. There you will find a lot of info about execution, code coverage, and other divided into parts:

– flat profile,

– call graph.

In essence `gprof` instruments given source code and samples the execution with statistically relevant sampling rate. You won’t be able to debug your application with it, but using special flag `-finstrument-function` you can add your code to be run on function entry and exit. If you would like to have more useful information contained in `gmon.out` you will need to compile your code with debug symbols, so you will be able to read stack information.

Valgrind

`Valgrind` is an underutilised toolkit. Most of the people know only of its `memcheck` tool, but really `Valgrind` is a whole framework with a myriad of tools that have different uses. You can also write your own tools, but first, we need to discuss how it works. When you run your code, it is not run on the real, physical CPU in your computer, but on synthetic CPU provided by `Valgrind core`. Instructions are passed to the selected tool which adds instrumentation code and passes the whole package back to the core, which executes it. It is one of the most intrusive methods to profile applications, but also the most powerful. With great power comes great time overhead, so `Valgrind` can run more than 200 times slower than during regular execution.

Commonly tools:

– _memcheck_ – memory leaks,

– _massif_ – heap memory usage profiling,

– _cachegrind_ – cache usage profiling,

– _callgrind_ – call graph tracing,

– _drd_ – data race condition detection,

– _helgrind_ – deadlock/livelock detection.

In the example below, the output from the `memcheck` tool allows you can see what can be checked. It looks for dynamically allocated memory and examines if there are any occurring or potential memory leaks. After running it with `–leak-check=full` you can obtain the full log with information about functions (or their addresses in case of stripped binaries) that allocate but do not free memory.

==10934== Process terminating with default action of signal 2 (SIGINT)

==10934==    at 0x4F43C73: __recvfrom_nocancel (syscall-template.S:84)

==10934==    by 0x108E7A: main (in /home/example/sockets)

==10934==

==10934== HEAP SUMMARY:

==10934==     in use at exit: 3,044 bytes in 1 blocks

==10934==   total heap usage: 2 allocs, 1 frees, 4,068 bytes allocated

==10934==

==10934== LEAK SUMMARY:

==10934==    definitely lost: 0 bytes in 0 blocks

==10934==    indirectly lost: 0 bytes in 0 blocks

==10934==      possibly lost: 0 bytes in 0 blocks

==10934==    still reachable: 3,044 bytes in 1 blocks

==10934==         suppressed: 0 bytes in 0 blocks

==10934== Rerun with --leak-check=full to see details of leaked memory

==10934==

==10934== For counts of detected and suppressed errors, rerun with: -v

==10934== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

`massif` on the other hand examines how much memory is allocated on both heap and stack and makes snapshots of those spaces to create graphs showing memory usage. Here you can see the visualisation made with `massif-visualizer` based on data acquired during short usage of `gedit` text editor.

Here we have the call graph obtained with `callgrind` and visualised with `kcachegrind`. We can explore whole execution tree and read which of the function calls took the most time (indicated here as a percentage of CPU time). Unfortunately without debug information we are unable to pinpoint the exact location and names or functions, so it’s imperative to compile with additional debug symbols when trying to instrument and analyse code using `Valgrind` tools.

Conclusion

There are a lot of different approaches to profiling and instrumentation. It’s not like there is one _best_ or _optimal_ solution, based on your needs you have to decide which one is good for you. For production environments and fast overview of performance try to go with `perf` and other non-intrusive tools. If you have specific functions that are slow and need to be overhauled then use `gprof` or `Valgrind`. Sometimes even those tools are not enough, and you will have to create your own, just as we did with `xprof`.

If you liked this overview you might also want to read other articles from the tools series, which gave a rundown on /proc, netstat and JMap.

Linux performance testing with perf, gprof and Valgrind

perf

gprof

Valgrind

Conclusion

Related Posts