OProfile manual


Table of Contents

1. Introduction
1. Applications of OProfile
2. System requirements
3. Internet resources
4. Installation
5. Uninstalling OProfile
2. Overview
1. Getting started
2. Tools summary
3. Controlling the profiler
1. Using opcontrol
1.1. Examples
2. Using oprof_start
3. Configuration details
3.1. Hardware performance counters
3.2. OProfile in RTC mode
3.3. OProfile in timer interrupt mode
3.4. Pentium 4 support
3.5. Intel Itanium 2 support
3.6. Dangerous counter settings
4. Other features
4.1. pid/pgrp filter
4.2. Unloading the kernel module
4. Obtaining results
1. oprofpp usage
2. op_time: overall view of all system binaries
3. op_to_source: outputting annotated source
4. op_merge: merging samples files
5. Interpreting profiling results
1. Profiling interrupt latency
2. Kernel profiling
2.1. Interrupt masking
2.2. Idle time
2.3. Exiting tasks
2.4. Profiling kernel modules
3. Inaccuracies in annotated source
3.1. Side effects of optimizations
3.2. Prologues and epilogues
3.3. Inlined functions
3.4. Inaccuracy in line number information
4. Assembly functions
5. Other discrepancies
6. Profiling overhead
7. Acknowledgments

Chapter 1. Introduction

This manual applies to OProfile version 0.5.4. OProfile is a profiling system for Linux 2.2/2.4/2.6 systems on a number of architectures. It is capable of profiling all parts of a running system, from the kernel (including modules and interrupt handlers) to shared libraries to binaries. It runs transparently in the background collecting information at a low overhead. These features make it ideal for profiling entire systems to determine bottle necks in real-world systems.

Many CPUs provide "performance counters", hardware registers that can count "events"; for example, cache misses, or CPU cycles. OProfile provides profiles of code based on the number of these occurring events: repeatedly, every time a certain (configurable) number of events has occurred, the PC value is recorded. This information is aggregated into profiles for each binary image.

Some hardware setups do not allow OProfile to use performance counters: in these cases, no events are available, and OProfile operates in timer/RTC mode, as described in later chapters.

1. Applications of OProfile

OProfile is useful in a number of situations. You might want to use OProfile when you :

  • need low overhead

  • cannot use highly intrusive profiling methods

  • need to profile interrupt handlers

  • need to profile an application and its shared libraries

  • need to capture the performance behaviour of entire system

  • want to examine hardware effects such as cache misses

  • want detailed source annotation

  • want instruction-level profiles

OProfile is not a panacea. OProfile might not be a complete solution when you :

  • require call graph profiles

  • don't have root permissions

  • require 100% instruction-accurate profiles

  • need function call counts or an interstitial profiling API

  • cannot tolerate any disturbance to the system whatsoever

  • need to profile interpreted or dynamically compiled code such as Java or Python

2. System requirements

Linux kernel 2.2/2.4

OProfile uses a kernel module that can be compiled for 2.2.11 or later and 2.4. Versions 2.4.10 or above are recommended, and required if you use the boot-time kernel option nosmp. AMD Hammer support requires a recent (>= 2.4.19) kernel with the line EXPORT_SYMBOL(do_fork); present in kernel/ksyms.c. Such a kernel is present in the x86-64.org CVS repository. 2.5 kernels are supported with the in-kernel OProfile driver.

modutils 2.4.6 or above

You should have installed modutils 2.4.6 or higher (in fact earlier versions work well in almost all cases).

Supported architecture

For Intel IA32, a CPU with either a P6 generation or Pentium 4 core is required. In marketing terms this translates to anything between an Intel Pentium Pro (not Pentium Classics) and a Pentium 4 / Xeon, including all Celerons. The AMD Athlon, Duron, and Hammer CPUs are also supported. Other IA32 CPU types only support the RTC mode of OProfile; please see later in this manual for details. Hyper-threaded Pentium IVs are not supported in 2.4. For 2.4 kernels, the Intel IA-64 CPUs are also supported. For 2.5 kernels, there is additionally support for Alpha processors, and sparc64, ppc64, and PA-RISC in timer mode.

Uniprocessor or SMP

SMP machines are fully supported.

Required libraries

These libraries are required : popt, bfd, liberty (debian users: libiberty is provided in binutils-dev package), dl, plus the standard C++ libraries.

Bash version 2

The opcontrol script requires bash version 2 at least to be installed as /bin/bash or /bin/bash2

OProfile GUI

The use of the GUI to start the profiler requires the Qt 2 library. Qt 3 should also work.

ELF

Probably not too strenuous a requirement, but older A.OUT binaries/libraries are not supported.

K&R coding style

OK, so it's not really a requirement, but I wish it was...

3. Internet resources

Web page

There is a web page (which you may be reading now) at http://oprofile.sf.net/.

Download

You can download a source tarball or get anonymous CVS at the sourceforge page, http://sf.net/projects/oprofile/.

Mailing list

There is a low-traffic OProfile-specific mailing list, details at http://sf.net/mail/?group_id=16191.

Bug tracker

There is a bug tracker for OProfile at SourceForge, http://sf.net/tracker/?group_id=16191&atid=116191.

IRC channel

Several OProfile developers and users sometimes hang out on channel #oprofile on the freenode network.

4. Installation

First you need to build OProfile and install it. ./configure, make, make install is often all you need, but note these arguments to ./configure :

--with-linux

Use this option to specify the location of the kernel source tree you wish to compile against. The kernel module is built against this source and will only work with a running kernel built from the same source with similar options, so it is important you specify this option if you need to.

--with-kernel-support

Use this option with 2.5 and above kernels to indicate the kernel provides the OProfile device driver.

--with-qt-dir/includes/libraries

Specify the location of Qt headers and libraries. It defaults to searching in $QTDIR if these are not specified.

--enable-abi

Activate code within the OProfile sample collection daemon oprofiled which records information about the binary format of sample files in /var/lib/oprofile/abi, to permit their transport between hosts using the op_import utility. See op_import. This option is primarily intended for embedded systems or remote analysis of production machines; if you will be performing all sample analysis on the same machine as you are profiling, it is safe to omit this option.

--enable-gcov

Activate GCC compile-time options if available to record information in the build directory to determine which sections of the C and C++ code are executed. This option is for developers and it is safe to omit this option from normal builds.

You'll need to have a configured kernel source for the current kernel to build the module for 2.4 kernels. It is also recommended that if you have a uniprocessor machine, you enable the local APIC / IO_APIC support for your kernel (this is automatically enabled for SMP kernels). On machines with power management, such as laptops, the power management must be turned off when using OProfile with 2.4 kernels. The power management software in the BIOS cannot handle the non-maskable interrupts (NMIs) used by OProfile for data collection. If you use the NMI watchdog, be aware that the watchdog is disabled when profiling starts, and not re-enabled until the OProfile module is removed (or, in 2.5, when OProfile is not running). If you compile OProfile for a 2.2 kernel you must be root to compile the module. If you are using 2.5 kernels or higher, you do not need kernel source, as long as the OProfile driver is enabled; additionally, you should not need to disable power management.

Please note that you must save or have available the vmlinux file generated during a kernel compile, as OProfile needs it (you can use --no-vmlinux, but this will prevent kernel profiling).

5. Uninstalling OProfile

You must have the source tree available to uninstall OProfile; a make uninstall will remove all installed files except your configuration file in the directory ~/.oprofile.

Chapter 2. Overview

Table of Contents

1. Getting started
2. Tools summary

1. Getting started

Before you can use OProfile, you must set it up. The minimum setup required for this is to tell OProfile where the vmlinux file corresponding to the running kernel is, for example :

opcontrol --vmlinux=/boot/vmlinux-`uname -r`

If you don't want to profile the kernel itself, you can tell OProfile you don't have a vmlinux file :

opcontrol --no-vmlinux

Now we are ready to start the daemon (oprofiled) which collects the profile data :

opcontrol --start

When I want to stop profiling, I can do so with :

opcontrol --shutdown

Note that unlike gprof, no instrumentation (-pg and -a options to gcc) is necessary.

Periodically (or on opcontrol --shutdown or opcontrol --dump) the profile data is written out into the /var/lib/oprofile/samples directory. These profile files cover shared libraries, applications, the kernel (vmlinux), and kernel modules. You can get summaries of this data in a number of ways at any time. To get a summary of data across the entire system for all of these profiles, you can do :

op_time

Or to get a more detailed summary, for a particular image, you can do something like :

oprofpp -l /boot/vmlinux-`uname -r`

There are also a number of other ways of presenting the data, as described later in this manual. Note that OProfile will choose a default profiling setup for you. However, there are a number of options you can pass to opcontrol if you need to change something, also detailed later.

2. Tools summary

This section gives a brief description of the available OProfile utilities and their purpose.

op_help

This utility lists the available events and short descriptions.

opcontrol

Used for controlling the OProfile data collection, discussed in Chapter 3, Controlling the profiler.

oprofpp

This is the main tool for retrieving useful profile data, described in Section 1, “oprofpp usage”.

op_time

This utility is useful for examining the relative profile values for all images on the system to determine the applications with the largest impact on system performance, described in Section 2, “op_time: overall view of all system binaries”.

op_to_source

This utility can be used to produce annotated source, assembly or mixed source/assembly. Source level annotation is available only if the application was compiled with debugging symbols. See Section 3, “op_to_source: outputting annotated source”.

op_merge

This utility is useful to merge samples files which belongs to the same application especially when you profile with separating samples for shared libs. See Section 4, “op_merge: merging samples files”.

op_import

This utility converts sample database files from a foreign binary format (abi) to the native format. This is useful only when moving sample files between hosts, for analysis on platforms other than the one used for collection. The abi format of the file to be imported is described in a text file located in /var/lib/oprofile/abi, if the --enable-abi configure-time option was enabled. Furthermore, the op_import tool is not built unless --enable-abi is given. See --enable-abi.

Chapter 3. Controlling the profiler

1. Using opcontrol

In this section we describe the configuration and control of the profiling system with opcontrol in more depth. The opcontrol script has a default setup, but you can alter this with the options given below. In particular, if your hardware supports performance counters, you can configure them. There are a number of counters (for example, counter 0 and counter 1 on the Pentium III). Each of these counters can be programmed with an event to count, such as cache misses or MMX operations. The event chosen for each counter is reflected in the profile data collected by OProfile: functions and binaries at the top of the profiles reflect that most of the chosen events happened within that code.

Additionally, each counter has a "count" value: this corresponds to how detailed the profile is. The lower the value, the more frequently profile samples are taken. A counter can choose to sample only kernel code, user-space code, or both (both is the default). Finally, some events have a "unit mask" - this is a value that further restricts the types of event that are counted. The event types and unit masks for your CPU are listed by opcontrol --list-events.

The opcontrol script provides the following actions :

--init

Loads the OProfile module if required and makes the OProfile driver interface available.

--setup

Followed by list arguments for profiling set up. List of arguments saved in /root/.oprofile/daemonrc. Giving this option is not necessary; you can just directly pass one of the setup options, e.g. opcontrol --no-vmlinux.

--start-daemon

Start the oprofile daemon without starting actual profiling. The profiling can then be started using --start. This is useful for avoiding measuring the cost of daemon startup, as --start is a simple write to a file in oprofilefs. Not available in 2.2/2.4 kernels.

--start

Start data collection with either arguments provided by --setup or information saved in /root/.oprofile/daemonrc. Specifying the addition --verbose makes the daemon generate lots of debug data whilst it is running.

--dump

Force a flush of the collected profiling data to the daemon.

--stop

Stop data collection (this separate step is not possible with 2.2 or 2.4 kernels).

--shutdown

Stop data collection and remove daemon.

--reset

Clears out data from current session, but leaves saved sessions.

--save=session_name

Save data from current session to session_name.

--deinit

Shuts down daemon. Unload the OProfile module and oprofilefs.

--list-events

List event types and unit masks.

--help

Generate usage messages.

There are a number of possible settings, of which, only --vmlinux (or --no-vmlinux) is required. These settings are stored in ~/.oprofile/daemon.

--buffer-size=num

Number of samples in kernel buffer.

--ctrN-event=[none,name]

Set counter N to measure symbolic event name, or "none" to disable this counter. The event names are listed by --list-events.

--ctrN-count=val

Number of events between samples for counter N.

--ctrN-unit-mask=val

Set unit mask for counter N (e.g. --ctr0-unit-mask=0xf). The possible unit mask values are listed by --list-events.

--ctrN-kernel=[0|1]

Whether to count kernel events for counter N.

--rtc-value=val

Set RTC counter value (see Section 3.2, “OProfile in RTC mode”).

--pid-filter=pid

Only profile the given process PID (only available for 2.4 version). Set to "none" to re-enable profiling of all PIDs.

--pgrp-filter=pgrp

Only profile the given process tty group (only avilable for 2.4 version). Set to "none" to re-enable profiling of all PGRPs.

--separate=[none,library,kernel,all]

By default, every profile is stored in a single file. Thus, for example, samples in the C library are all accredited to the /lib/libc.o profile. However, you choose to create separate sample files by specifying one of the below options.

none No profile separation (default)
library Create per-application profiles for libraries
kernel Create per-application profiles for the kernel and kernel modules
all Both of the above options

Note that --separate=kernel also turns on --separate=library. When using --separate=kernel, samples in hardware interrupts, soft-irqs, or other asynchronous kernel contexts are credited to the task currently running. This means you will see seemingly nonsense profiles such as /bin/bash showing samples for the PPP modules, etc.

On 2.2/2.4 only kernel threads already started when profiling begins are correctly profiled; newly started kernel thread samples are credited to the vmlinux (kernel) profile. On 2.5 there is no kernel thread profiling, all these samples are credited to the vmlinux profile. The -k option to op_time or oprofpp will show these per-application profiles.

--vmlinux=file

vmlinux kernel image.

--no-vmlinux

Use this when you don't have a kernel vmlinux file, and you don't want to profile the kernel. Note that overall profiling time through op_time always counts kernel samples.

1.1. Examples

1.1.1. Intel performance counter setup

Here, we have a Pentium III running at 800MHz, and we want to look at where data memory references are happening most, and also get results for CPU time.

# opcontrol --ctr0-event=CPU_CLK_UNHALTED --ctr0-count=400000
# opcontrol --ctr1-event=DATA_MEM_REFS --ctr1-count=10000
# opcontrol --vmlinux=/boot/2.5.66/vmlinux
# opcontrol --start

1.1.2. RTC mode

Here, we have an Intel laptop without support for performance counters, running on 2.4 kernels.

# op_help -r
CPU with RTC device
# opcontrol --vmlinux=/boot/2.5.66/vmlinux --rtc-value=1024
# opcontrol --start

1.1.3. Starting the daemon separately

If we're running 2.5 kernels, we can use --start-daemon to avoid the profiler startup affecting results.

# opcontrol --vmlinux=/boot/2.5.66/vmlinux
# opcontrol --start-daemon
# my_favourite_benchmark --init
# opcontrol --start ; my_favourite_benchmark --run ; opcontrol --stop

1.1.4. Separate profiles for libraries and the kernel

Here, we want to see a profile of the OProfile daemon itself, including when it was running inside the kernel driver, and its use of shared libraries.

# opcontrol --separate=kernel --vmlinux=/boot/2.5.66/vmlinux
# opcontrol --start
# my_favourite_stress_test --run
# oprofpp -kl -p /lib/modules/2.5.66/kernel /usr/local/bin/oprofiled

1.1.5. Profiling sessions

It can often be useful to split up profiling data into several different time periods. For example, you may want to collect data on an application's startup separately from the normal runtime data. You can use the simple command opcontrol --save to do this. For example :

# opcontrol --save=blah

will create a sub-directory in /var/lib/oprofile/samples containing the samples up to that point (the current session's sample files are moved into this directory). You can then pass this name as, for example, a parameter to op_time to only get data up to the point you named the session. If you do not want to save a session, you can do rm -rf /var/lib/oprofile/samples/blah or, for the current session, opcontrol --reset.

2. Using oprof_start

The oprof_start application provides a convenient way to start the profiler. Note that oprof_start is just a wrapper around the opcontrol script, so it does not provide more services than the script itself.

After oprof_start is started you can select the event type for each counter; the sampling rate and other related parameters are explained in Section 1, “Using opcontrol”. The "Configuration" section allows you to set general parameters such as the buffer size, kernel filename etc. The counter setup interface should be self-explanatory; Section 3.1, “Hardware performance counters” and related links contain information on using unit masks.

A status line shows the current status of the profiler: how long it has been running, and the average number of interrupts received per second and the total, over all processors. Note that quitting oprof_start does not stop the profiler.

Your configuration is saved in the same file as opcontrol uses; that is, ~/.oprofile/daemonrc. In addition, the per-event parameters are saved in ~/.oprofile/oprof_start_event.

3. Configuration details

3.1. Hardware performance counters

Note

Your CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in RTC mode in 2.4 (see Section 3.2, “OProfile in RTC mode”), or timer mode in 2.5 (see Section 3.3, “OProfile in timer interrupt mode”). You do not really need to read this section unless you are interested in using events other than the default event chosen by OProfile.

The hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/. The AMD Athlon/Duron implementation is detailed in http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf. These processors are capable of delivering an interrupt when a counter overflows. This is the basic mechanism on which OProfile is based. The delivery mode is NMI, so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current PC value and the current task are recorded into the profiling structure. This allows the overflow event to be attached to a specific assembly instruction in a binary image. The daemon receives this data from the kernel, and writes it to the sample files.

If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED (GLOBAL_POWER_EVENTS or INSTR_RETIRED, respectively, on the Pentium 4), we can use the overflow counts as an estimate of actual time spent in each part of code. Alternatively we can profile interesting data such as the cache behaviour of routines with the other available counters.

However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level as being perfectly accurate. If you are using an "event-mode" counter such as the cache counters, a count registered against it doesn't mean that it is responsible for that event. However, it implies that the counter overflowed in the dynamic vicinity of that instruction, to within a few instructions. Further details on this problem can be found in Chapter 5, Interpreting profiling results and also in the Digital paper "ProfileMe: A Hardware Performance Counter".

Each counter has several configuration parameters. First, there is the unit mask: this simply further specifies what to count. Second, there is the counter value, discussed below. Third, there is a parameter whether to increment counts whilst in kernel or user space. You can configure these separately for each counter.

A counter value with the --ctrX-count option, where X is the logical counter number (which differs between architectures - opcontrol --help tells you the maximum counter number). Using multiple counters is useful for profiling several aspects of the same running program. After each overflow event, the counter will be re-initialized such that another overflow will occur after this many events have been counted. Picking a good value for this parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event you have chosen. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer). Specifying too small a value can lead to higher performance overhead.

3.2. OProfile in RTC mode

Note

This section applies to 2.4 kernels only.

Some CPU types do not provide the needed hardware support to use the hardware performance counters. This includes some laptops, classic Pentiums, and other CPU types not yet supported by OProfile (such as Cyrix). On these machines, OProfile falls back to using the real-time clock interrupt to collect samples. This interrupt is also used by the rtc module: you cannot have both the OProfile and rtc modules loaded nor the rtc support compiled in the kernel.

RTC mode is less capable than the hardware counters mode; in particular, it is unable to profile sections of the kernel where interrupts are disabled. There is just one available event, "RTC interrupts", and its value corresponds to the number of interrupts generated per second (that is, a higher number means a better profiling resolution, and higher overhead). The current implementation of the real-time clock supports only power-of-two sampling rates from 2 to 4096 per second. Other values within this range are rounded to the nearest power of two.

Setting the value from the GUI should be straightforward. On the command line, you need to specify the --rtc-value option to opcontrol, e.g. :

opcontrol --rtc-value=256

3.3. OProfile in timer interrupt mode

Note

This section applies to 2.5 kernels only.

In 2.5 kernels on CPUs without OProfile support for the hardware performance counters, the driver falls back to using the timer interrupt for profiling. Like the RTC mode in 2.4 kernels, this is not able to profile code that has interrupts disabled. Note that there are no configuration parameters for setting this, unlike the RTC and hardware performance counter setup.

You can force use of the timer interrupt by using the timer=1 module parameter (or oprofile.timer=1 on the boot command line if OProfile is built-in).

3.4. Pentium 4 support

The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of one another, so OProfile only accesses those registers, treating them as a bank of 8 "normal" counters, similar to those in the P6 or Athlon families of CPU.

There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store (DS). Current support is limited to the conservative extension of OProfile's existing interrupt-based model described above. Performance monitoring hardware on Pentium 4 / Xeon processors with Hyperthreading enabled (multiple logical processors on a single die) is not supported in 2.4 kernels (you can use OProfile if you disable hyper-threading, though).

3.5. Intel Itanium 2 support

The Itanium 2 performance monitoring unit (PMU) organizes the counters as four pairs of performance event monitoring registers. Each pair is composed of a Performance Monitoring Configuration (PMC) register and Performance Monitoring Data (PMD) register. The PMC selects the performance event being monitored and the PMD determines the sampling interval. The IA64 Performance Monitoring Unit (PMU) triggers sampling with maskable interrupts. Thus, samples will not occur in sections of the IA64 kernel where interrupts are disabled.

None of the advance features of the Itanium 2 performance monitoring unit such as opcode matching, address range matching, or precise event sampling are supported by this version of OProfile. The Itanium 2 support only maps OProfile's existing interrupt-based model to the PMU hardware.

3.6. Dangerous counter settings

OProfile is a low-level profiler which allow continuous profiling with a low-overhead cost. If too low a count reset value is set for a counter, the system can become overloaded with counter interrupts, and seem as if the system has frozen. Whilst some validation is done, it is not foolproof.

Note

This can happen as follows: When the profiler count reaches zero an NMI handler is called which stores the sample values in an internal buffer, then resets the counter to its original value. If the count is very low, a pending NMI can be sent before the NMI handler has completed. Due to the priority of the NMI, the local APIC delivers the pending interrupt immediately after completion of the previous interrupt handler, and control never returns to other parts of the system. In this way the system seems to be frozen.

If this happens, it will be impossible to bring the system back to a workable state. There is no way to provide real security against this happening, other than making sure to use a reasonable value for the counter reset. For example, setting CPU_CLK_UNHALTED event type with a ridiculously low reset count (e.g. 500) is likely to freeze the system.

In short : Don't try a foolish sample count value. Unfortunately the definition of a foolish value is really dependent on the event type - if ever in doubt, e-mail

4. Other features

4.1. pid/pgrp filter

There are situations where you are only interested in the profiling results of a particular running process, or process tty group. You can set the pid/pgrp values via the --pid-filter and --pgrp-filter options to opcontrol, which will make the daemon ignore samples for processes that don't match the filter. These options are not available in 2.5 kernels.

4.2. Unloading the kernel module

Note

This section applies to 2.4 kernels only, OProfile in 2.5 can be unloaded safely.

The kernel module can be unloaded, but is designed to take very little memory when profiling is not underway. There is no need to unload the module between profiler runs.

lsmod and similar utilities will still show the module's use count as -1. However, this is not to be relied on - the module will become unloadable some short time after stopping profiling.

Note that by default module unloading is disabled when used on SMP systems. This is because of a small chance of a module unload race crashing the kernel. As the race is very small, it is allowed to re-enable the module unload by specifying the "allow_unload" parameter to the module :

modprobe oprofile allow_unload=1

This option can be DANGEROUS and should only be used on non-production systems.

Chapter 4. Obtaining results

OK, so the profiler has been running, but it's not much use unless we can get some data out. Fairly often, OProfile does a little too good a job of keeping overhead low, and no data reaches the profiler. This can happen on lightly-loaded machines. Remember you can force a dump at any time with :

opcontrol --dump

Remember to do this before complaining there is no profiling data ! Now that we've got some data, it has to be processed. That's the job of oprofpp, op_time, or op_to_source. These work on sample files in the /var/lib/oprofile/samples/ directory, along with the binary files being profiled, to produce human-readable data. Note that if the binary file changes after the sample file was created, you won't be able to get useful data out. This situation is detected for you. Several instances of a binary are merged into one sample file. By default, all samples from a dynamically linked library are merged into one sample file as well.

A different scenario happen when re-starting profiling with different parameters, as the old sample files from previous sessions don't get deleted (allowing you to build profiles over many distinct profiling sessions). If the last session is determined to be out of date due to the use of different profiling parameters, all the samples files are backed up in a sub-directory named session-#nr. If during profiling the daemon detects a change to a binary image and a sample file belonging to this binary exists, the sample file is silently deleted. So if during profiling, you change a binary, it is your responsibility to save the binary image and the sample files, if you need it.

1. oprofpp usage

The oprofpp utility can be used in three major modes; list symbol mode, detailed symbol mode, or gprof mode. The first gives sorted histogram output of sample counts against functions as shown in the walkthrough. The second can show individual sample counts against instructions inside a function, useful for detailed profiling, whilst the third mode is handy if you're used to gprof style output. Note that only flat gprof profiles are supported, however.

Some interesting options of the post-processor :

--samples-file filename, -f filename

The sample file to use. By default, the current sample file for the given binary is used; this option can be used to examine older sample files.

--image-file filename, -i filename

The binary image (shared library, kernel vmlinux, or application) to produce data for.

--demangle, -d

Demangle C++ symbol names.

--smart-demangle, -D

Demangle GNU C++ symbol names, then apply a set of regular expressions to simplify STL library demangled names. For example, foo(std::list<std::string, std::allocator<std::string>> &) becomes foo(list<string> &).

--counter nr

Which counter (0 - N) to extract information for. N is dependent on your cpu type: 1 for P6 generation CPUs, 3 for Athlon based CPUs, 7 for Pentium 4 / Xeon CPUs.

--list-symbols, -l

List a histogram of sample counts against symbols. Each line shows the function name, its starting address, the relative percentage of hits across that image, and the absolute number of samples in this function.

--list-symbol name, -s name

Provide a detailed listing for the specified symbol name. This shows, for each sample, the position of the address, and the number of samples.

--dump-gprof-file filename, -g filename

Dump output to the specified file in gprof format. If you specify gmon.out, you can then call gprof -p <binary>.

--list-all-symbols-details, -L

Provide a detailed listing for all symbols. Each line shows number of samples at the given address for all counters.

--output-linenr-info, -o

Show the function and line number for all samples. This requires that the image was compiled with debug symbols (-g), and is usable only with --list-all-symbols-details, --list-symbol and --list-symbols.

--exclude-symbol symbol[,symbol], -e symbol[,symbol]

Comma-separated list of symbols to ignore. This can be useful to ignore the leading contributors to the sample histogram, as the percentage values are re-calculated.

--show-shared-libs, -k

Show the details of shared library and kernel profiles specific to this application. This option is useful only if you have profiled with the --separate option and you don't specify on the oprofpp command line --dump-gprof-file.

--output-format vsSpPqQnlLiIh, -t vsSpPqQnlLiIh

Specify the output and formatting, as given in this table :

v VMA (symbol address)
s Sample count
S Accumulative sample count
p Sample percentage relative to image
P Accumulative sample percentage relative to image
q Sample percentage relative to all images
Q Accumulative sample percentage relative to all images
n Name of symbol
l Source filename and line number
L Short source filename and line number
i Image name
I Short image name
d Detailed sample results
h Print a columns header

This option is not relevant to --dump-gprof-file.

--session session-name

Specify the session name you want to use. The session name can be an absolute path where samples reside, or a session name relative to the sample files base directory. If you specify a samples filename with an absolute path this option is ignored.

--path path list ',' separated, -p path list ',' separated

Specify an alternate list of pathnames to locate image files. Use if the sample filenames do not match the image filenames, e.g. modules loaded at boot time through a RAM disk. The path list is a comma-separated list of directories. Each directory is scanned recursively.

2. op_time: overall view of all system binaries

You can get an overall summary of relative binary profiles using op_time. This utility displays the relative amount of samples for each application profiled sorted by decreasing order of samples count. So with op_time [--option] [image_name[,image_names]] you can get :

/lib/libc-2.1.2.so 19 32.7586%
/usr/X11R6/bin/XF86_SVGA 13 22.4138%
...
/usr/bin/grep 1 1.72414%
/usr/X11R6/lib/libXt.so.6.0 1 1.72414%

If you don't specify any image names on the command line op_time reports information about all profiled binary images. You can use shell wildcards like : op_time /usr/bin/*

Options allowed are :

--counter nr, -c

Use counter nr for sorting samples count.

--show-shared-libs, -k

Show the details for the shared libraries/kernel for each application. This option is useful only if you have profiled with the --separate option.

--list-symbols, -l

Show details for each symbol in each profiled file.

--demangle, -d

Demangle GNU C++ symbol names.

--smart-demangle, -D

Demangle GNU C++ symbol names, then apply a set of regular expressions to simplify STL library demangled names. For example, foo(std::list<std::string, std::allocator<std::string>> &) becomes foo(list<string> &).

--show-image-name, -n

Show the image name when using --list-symbols.

--reverse, -r

Sort by decreasing sample count instead of increasing count.

--path path list ',' separated, -p path list ',' separated

Specify an alternate list of pathnames to locate image files. Use if the sample filenames do not match the image filenames, e.g. modules loaded at boot time through a RAM disk. The path list is a comma-separated list of directories. Each directory is scanned recursively.

--output-format vsSpPnlLiIeEh, -t vsSpPnlLiIeEh

Specify the outputs and formatting, as given in this table :

v VMA (symbol address)
s Sample count
S Accumulative sample count
p Sample percentage relative to image
P Accumulative sample percentage relative to image
n Name of symbol
l Source filename and line number
L Short source filename and line number
i Image name
I Short image name
e Application name
E Basename of application
d Detailed sample results
h Print a columns header

Options e and E are useless unless you profile with --separate=library.

These options can only be used in conjunction with --list-symbols.

--exclude-symbol symbol[,symbol], -e symbol[,symbol]

Comma-separated list of symbols to ignore. This can be useful to ignore the leading contributors to the sample histogram, as the percentage values are re-calculated.

--session session-name

Specify the session name you want to use. The session name can be an absolute path where samples reside, or a session name relative to the sample files base directory.

3. op_to_source: outputting annotated source

op_to_source generates annotated source files or assembly listings, optionally mixed with source. If you want to see the source file, the profiled application needs to have debug information, and the source must be available through this debug information. For GCC, you must use the -g when you are compiling. If the binary doesn't contain sufficient debug information, you can still use op_to_source --assembly to get annotated assembly.

Note that for the reason explained in Section 3.1, “Hardware performance counters” the results can be inaccurate. The debug information itself can add other problems; for example, the line number for a symbol can be incorrect. Assembly instructions can be re-ordered and moved by the compiler, and this can lead to crediting source lines with samples not really "owned" by this line. Also see Chapter 5, Interpreting profiling results.

The options allowed are :

--assembly, -a

Output assembly code. Currently the assembly code is sorted in increasing order of the VMA address. The --sort-by-counter, --with-more-than-samples percent_nr and --until-more-than-samples percent_nr options can also be used with this option to provide filtering capabilities.

--demangle, -d

Demangle C++ symbol names.

--smart-demangle, -D

Demangle GNU C++ symbol names, then apply a set of regular expressions to simplify STL library demangled names. For example, foo(std::list<std::string, std::allocator<std::string>> &) becomes foo(list<string> &).

--source-dir dirname

This option is used in conjunction with --output-dir. You can use it to specify the base directory of the source which you wish to produce annotated output for. With this option, any source files outside the directory (for example, system header files) are ignored.

--output-dir dirname

Specify that you want to produce an annotated source tree, rather than getting all output to stdout. This creates a hierarchy of annotated source files, and is affected by the --source-dir, --output, and --no-output options.

--output patterns

Specify a set of comma-separated patterns for matching annotated source output filenames. If this option is present, a file is only output if it matches one of the given patterns (which applies to the filename and each components of the containing directory names). For example :

--output '*.c,user.h'

--no-output patterns

Specify a set of comma-separated patterns for filtering annotated source output filenames. If this option is present, a file is only output if it does not match one of the given patterns (which applies to the filename and each components of the containing directory names). For example :

--no-output 'boring.c,boring*.h'

--source-with-assembly, -s

Output assembly code mixed with the source file. Implies --assembly.

--objdump-params, -o

Pass the given, comma-separated, additional parameters to objdump. Check the objdump man page to see what options objdump accepts; e.g. -o '--disassembler-options=intel' to get Intel assembly syntax instead of att syntax. This option can be used only with --assembly or --source-with-assembly

--sort-by-counter counter_nr, -c counter_nr

Sort by decreasing number of samples on counter_nr. For assembly output this option provides only filtering and not a sort order.

--with-more-than-samples percent_nr, -w percent_nr

Output source file which contains at least percent_nr samples. Cannot be combined with --until-more-than-samples.

--until-more-than-samples percent_nr, -m percent_nr

Output source files until the amount of samples in these files reach percent_nr samples. Cannot be combined with --with-more-than-samples.

--samples-file filename, -f filename

Specify the samples file. At least one of the --samples-file or --image-file must be specified.

--image-file filename, -i filename

Specify the image file.

--exclude-symbol symbol[,symbol], -e symbol[,symbol]

Comma-separated list of symbols to ignore. This can be useful to ignore the leading contributors to the sample histogram, as the percentage values are re-calculated.

--include-symbol symbol[,symbol], -y symbol[,symbol]

Comma-separated list of symbols to include. This can be useful to only see the leading contributors to the sample histogram, as the percentage values are re-calculated.

--session session-name

Specify the session name you want to use. The session name can be an absolute path where samples reside, or a session name relative to the sample files base directory. If you specify a samples filename with an absolute path this option is ignored.

4. op_merge: merging samples files

op_merge is used to merge samples wich belongs to the same binary image. Its main purpose is to merge samples files created by profiling with the option --separate. So you can create one samples file containing all samples for a shared library : op_merge/usr/lib/ld-2.1.2.so will create a samples file named }usr}lib}ld-2.1.2.so ready to use with oprofpp or other post-profiling tools. Additionally you can merge a subset of samples files inside one sample file by specifying explicitly the sample files to merge. This allows you to use post-profile tools on shared libs for a subset of applications.

The options allowed are :

--counter nr, -c

use the given counter number to select the appropriate samples files

Chapter 5. Interpreting profiling results

The standard caveats of profiling apply in interpreting the results from OProfile: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you cannot rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;)

1. Profiling interrupt latency

This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.

double fun(double a, double b, double c)
{
 double result = 0;
 for (int i = 0 ; i < 10000; ++i) {
  result += a;
  result *= b;
  result /= c;
 }
 return result;
}

Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):

$ op_to_source -a -w 10 ./a.out

     88 15.38% : 8048337:       fadd   %st(3),%st
     48 8.391% : 8048339:       fmul   %st(2),%st
     68 11.88% : 804833b:       fdiv   %st(1),%st
    368 64.33% : 804833d:       inc    %eax
               : 804833e:       cmp    $0x270f,%eax
               : 8048343:       jle    8048337

The problem comes from the x86 hardware; when the counter overflows the IRQ is asserted but the hardware has features that can delay the NMI interrupt: x86 hardware is synchronous (i.e. cannot interrupt during an instruction); there is also a latency when the IRQ is asserted, and the multiple execution units and the out-of-order model of modern x86 CPUs also causes problems. This is the same function, with annotation :

$ op_to_source -w 10 ./a.out

               :double fun(double a, double b, double c)
               :{ /* _Z3funddd total:     572 100.0% */
               : double result = 0;
    368 64.33% : for (int i = 0 ; i < 10000; ++i) {
     88 15.38% :  result += a;
     48 8.391% :  result *= b;
     68 11.88% :  result /= c;
               : }
               : return result;
               :}

The conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly. This case can also occur for branches. Always bear in mind that samples can be delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it.

2. Kernel profiling

2.1. Interrupt masking

OProfile uses non-maskable interrupts (NMI) on the P6 generation, Pentium 4, Athlon and Duron processors. These interrupts can occur even in section of the Linux where interrupts are disabled, allowing collection of samples in virtually all executable code. The RTC, timer interrupt mode, and Itanium 2 collection mechanisms use maskable interrupts. Thus, the RTC and Itanium 2 data collection mechanism have "sample shadows", or blind spots: regions where no samples will be collected. Typically, the samples will be attributed to the code immediately after the interrupts are re-enabled.

2.2. Idle time

Your kernel is likely to support halting the processor when a CPU is idle. As the typical hardware events like CPU_CLK_UNHALTED do not count when the CPU is halted, the kernel profile will not reflect the actual amount of time spent idle. You can change this behaviour by booting with the idle=poll option, which uses a different idle routine. This will appear as poll_idle() in your kernel profile.

2.3. Exiting tasks

The internal implementation of the 2.5 OProfile code means that tasks that within the kernel do_exit() routine cannot be profiled.

2.4. Profiling kernel modules

OProfile profiles kernel modules by default. However, there are a couple of problems you may have when trying to get results. First, you may have booted via an initrd; this means that the actual path for the module binaries cannot be determined automatically. To get around this, you can use the -p option to the profiling tools to specify where to look for the kernel modules.

In 2.5, the information on where kernel module binaries are located has been removed. This means OProfile needs guiding with the -p option to find your modules. Normally, you can just use your standard module top-level directory for this. Note that due to this problem, OProfile cannot check that the modification times match; it is your responsibility to make sure you do not modify a binary after a profile has been created.

If you have run insmod or modprobe to insert a module in a particular directory, it is important that you specify this directory with the -p option first, so that it over-rides an older module binary that might exist in other directories you've specified with -p. It is up to you to make sure that these values are correct: 2.5 kernels simply do not provide enough information for OProfile to get this information.

3. Inaccuracies in annotated source

3.1. Side effects of optimizations

The compiler can introduce some pitfalls in the annotated source output. The optimizer can move pieces of code in such manner that two line of codes are interlaced (instruction scheduling). Also debug info generated by the compiler can show strange behavior. This is especially true for complex expressions e.g. inside an if statement:

	if (a && ..
	    b && ..
	    c &&)

here the problem come from the position of line number. The available debug info does not give enough details for the if condition, so all samples are accumulated at the position of the right brace of the expression. Using op_to_source -a can help to show the real samples at an assembly level.

3.2. Prologues and epilogues

The compiler generally needs to generate "glue" code across function calls, dependent on the particular function call conventions used. Additionally other things need to happen, like stack pointer adjustment for the local variables; this code is known as the function prologue. Similar code is needed at function return, and is known as the function epilogue. This will show up in annotations as samples at the very start and end of a function, where there is no apparent executable code in the source.

3.3. Inlined functions

You may see that a function is credited with a certain number of samples, but the listing does not add up to the correct total. To pick a real example :

               :internal_sk_buff_alloc_security(struct sk_buff *skb)
 353 2.342%    :{ /* internal_sk_buff_alloc_security total: 1882 12.48% */
               :
               :        sk_buff_security_t *sksec;
  15 0.0995%   :        int rc = 0;
               :
  10 0.06633%  :        sksec = skb->lsm_security;
 468 3.104%    :        if (sksec && sksec->magic == DSI_MAGIC) {
               :                goto out;
               :        }
               :
               :        sksec = (sk_buff_security_t *) get_sk_buff_memory(skb);
   3 0.0199%   :        if (!sksec) {
  38 0.2521%   :                rc = -ENOMEM;
               :                goto out;
  10 0.06633%  :        }
               :        memset(sksec, 0, sizeof (sk_buff_security_t));
  44 0.2919%   :        sksec->magic = DSI_MAGIC;
  32 0.2123%   :        sksec->skb = skb;
  45 0.2985%   :        sksec->sid = DSI_SID_NORMAL;
  31 0.2056%   :        skb->lsm_security = sksec;
               :
               :      out:
               :
 146 0.9685%   :        return rc;
               :
  98 0.6501%   :}

Here, the function is credited with 1,882 samples, but the annotations below do not account for this. This is usually because of inline functions - the compiler marks such code with debug entries for the inline function definition, and this is where op_to_source annotates such samples. In the case above, memset is the most likely candidate for this problem. Examining the mixed source/assembly output can help identify such results.

Furthermore, for some languages the compiler can implicitly generate functions, such as default copy constructors. Such functions are labelled by the compiler as having a line number of 0, which means the source annotation can be confusing.

3.4. Inaccuracy in line number information

Depending on your compiler you can fall into the following problem:

struct big_object { int a[500]; };

int main()
{
	big_object a, b;
	for (int i = 0 ; i != 1000 * 1000; ++i)
		b = a;
	return 0;
}

Compiled with gcc 3.0.4 the annotated source is clearly inaccurate:

               :int main()
               :{  /* main total: 7871 100% */
               :        big_object a, b;
               :        for (int i = 0 ; i != 1000 * 1000; ++i)
               :                b = a;
 7871 100%     :        return 0;
               :}

The problem here is distinct from the IRQ latency problem; the debug line number information is not precise enough; again, looking at output of op_to_source -as can help.

               :int main()
               :{
               :        big_object a, b;
               :        for (int i = 0 ; i != 1000 * 1000; ++i)
               : 80484c0:       push   %ebp
               : 80484c1:       mov    %esp,%ebp
               : 80484c3:       sub    $0xfac,%esp
               : 80484c9:       push   %edi
               : 80484ca:       push   %esi
               : 80484cb:       push   %ebx
               :                b = a;
               : 80484cc:       lea    0xfffff060(%ebp),%edx
               : 80484d2:       lea    0xfffff830(%ebp),%eax
               : 80484d8:       mov    $0xf423f,%ebx
               : 80484dd:       lea    0x0(%esi),%esi
               :        return 0;
    3 0.03811% : 80484e0:       mov    %edx,%edi
               : 80484e2:       mov    %eax,%esi
    1 0.0127%  : 80484e4:       cld
    8 0.1016%  : 80484e5:       mov    $0x1f4,%ecx
 7850 99.73%   : 80484ea:       repz movsl %ds:(%esi),%es:(%edi)
    9 0.1143%  : 80484ec:       dec    %ebx
               : 80484ed:       jns    80484e0
               : 80484ef:       xor    %eax,%eax
               : 80484f1:       pop    %ebx
               : 80484f2:       pop    %esi
               : 80484f3:       pop    %edi
               : 80484f4:       leave
               : 80484f5:       ret

So here it's clear that copying is correctly credited with of all the samples, but the line number information is misplaced. objdump -dS exposes the same problem. Note that maintaining accurate debug information for compilers when optimizing is difficult, so this problem is not suprising. The problem of debug information accuracy is also dependent on the binutils version used; some BFD library versions contain a work-around for known problems of gcc, some others do not. This is unfortunate but we must live with that, since profiling is pointless when you disable optimisation (which would give better debugging entries).

4. Assembly functions

Often the assembler cannot generate debug information automatically. This means that you cannot get a source report unless you manually define the neccessary debug information; read your assembler documentation for how you might do that. The only debugging info needed currently by OProfile is the line-number/filename-VMA association. When profiling assembly without debugging info you can always get report for symbols, and optionally for VMA, through oprofpp -l or oprofpp -L, but this works only for symbol with the right attributes. For gas you can get this by

.globl foo
	.type	foo,@function

whilst for nasm you must use

	  GLOBAL foo:function		; [1]

Note that OProfile does not need the global attribute, only the function attribute.

5. Other discrepancies

Another cause of apparent problems is the hidden cost of instructions. A very common example is two memory reads: one from L1 cache and the other from memory: the second memory read is likely to have more samples. There are many other causes of hidden cost of instructions. A non-exhaustive list: mis-predicted branch, TLB cache miss, partial register stall, partial register dependencies, memory mismatch stall, re-executed µops. If you want to write programs at the assembly level, be sure to take a look at the Intel and AMD documentation at http://developer.intel.com/ and http://www.amd.com/products/cpg/athlon/techdocs/.

Chapter 6. Profiling overhead

One of the major design criteria for OProfile was low overhead. In many cases profiling is hardly noticeable in terms of overhead (I regularly leave it turned on all the time). It achieves this by judicious use of kernel-side data structures to reduce the collection overhead to a bare runtime minimum. There are several things that unfortunately complicate the issue, so there are cases where the overhead is noticeable.

The worst-case scenario is where there are many short-lived processes. This can be seen in a kernel compile, for instance. Even in this worst case overhead is low compared to other profilers; only very detailed profiling of these workloads has an overhead of higher than 5%. Actual performance data is presented in CVS. In fact most situations have much fewer numbers of processes, leading to far better performance.

Some graphs of performance characteristics of OProfile are available on the website - see Section 3, “Internet resources”.

Chapter 7. Acknowledgments

Thanks to (in no particular order) : Arjan van de Ven, Rik van Riel, Juan Quintela, Philippe Elie, Phillipp Rumpf, Tigran Aivazian, Alex Brown, Alisdair Rawsthorne, Bob Montgomery, Ray Bryant, H.J. Lu, Jeff Esper, Will Cohen, Graydon Hoare, Cliff Woolley, Alex Tsariounov, Al Stone, Randolph Chung, Anton Blanchard, Richard Henderson, Andries Brouwer, Bryan Rittmeyer, Richard Reich (rreich@rdrtech.com), Dave Jones, Charles Filtness; and finally Pulp, for "Intro".