OProfile
Many processors include dedicated performance monitoring hardware. This hardware makes it possible to detect when certain events happen (such as the requested data not being in cache). The hardware normally takes the form of one or more counters that are incremented each time an event takes place. When the counter value increments, an interrupt is generated, making it possible to control the amount of detail (and therefore, overhead) produced by performance monitoring.
OProfile uses this hardware (or a timer-based substitute in cases where performance monitoring hardware is not present) to collect samples of performance-related data each time a counter generates an interrupt. These samples are periodically written out to disk; later, the data contained in these samples can then be used to generate reports on system-level and application-level performance.
Be aware of the following limitations when using OProfile:
-
Use of shared libraries — Samples for code in shared libraries are not attributed to the particular application unless the
--separate=libraryoption is used. -
Performance monitoring samples are inexact — When a performance monitoring register triggers a sample, the interrupt handling is not precise like a divide by zero exception. Due to the out-of-order execution of instructions by the processor, the sample may be recorded on a nearby instruction.
-
opreport does not associate samples for inline functions properly — opreport uses a simple address range mechanism to determine which function an address is in. Inline function samples are not attributed to the inline function but rather to the function the inline function was inserted into.
-
OProfile accumulates data from multiple runs — OProfile is a system-wide profiler and expects processes to start up and shut down multiple times. Thus, samples from multiple runs accumulate. Use the command opcontrol --reset to clear out the samples from previous runs.
-
Hardware performance counters do not work on guest virtual machines — Because the hardware performance counters are not available on virtual systems, you need to use the
timermode. Enter the command opcontrol --deinit, and then execute modprobe oprofile timer=1 to enable thetimermode. -
Non-CPU-limited performance problems — OProfile is oriented to finding problems with CPU-limited processes. OProfile does not identify processes that are asleep because they are waiting on locks or for some other event to occur (for example an I/O device to finish an operation).
Overview of Tools
OProfile Commands provides a brief overview of the most commonly used tools provided with the oprofile package.
| Command | Description |
|---|---|
ophelp |
Displays available events for the system’s processor along with a brief description of each. |
opimport |
Converts sample database files from a foreign binary format to the native format for the system. Only use this option when analyzing a sample database from a different architecture. |
opannotate |
Creates annotated source for an executable if the application was compiled with debugging symbols. See Using opannotate for details. |
opcontrol |
Configures what data is collected. See Configuring OProfile Using Legacy Mode for details. |
operf |
Recommended tool to be used in place of opcontrol for profiling. See Using operf for details. For differences between operf and opcontrol see operf vs. opcontrol. |
opreport |
Retrieves profile data. See Using opreport for details. |
oprofiled |
Runs as a daemon to periodically write sample data to disk. |
operf vs. opcontrol
There are two mutually exclusive methods for collecting profiling data with OProfile. You can either use the newer and preferred operf or the opcontrol tool.
This is the recommended mode for profiling. The operf tool uses the Linux Performance Events Subsystem, and therefore does not require the oprofile kernel driver. The operf tool allows you to target your profiling more precisely, as a single process or system-wide, and also allows OProfile to co-exist better with other tools using the performance monitoring hardware on your system. Unlike opcontrol, it can be used without the root privileges. However, operf is also capable of system-wide operations with use of the --system-wide option, where root authority is required.
With operf, there is no initial setup needed. You can invoke operf with command-line options to specify your profiling settings. After that, you can run the OProfile post-processing tools described in Analyzing the Data. See Using operf for further information.
This mode consists of the opcontrol shell script, the oprofiled daemon, and several post-processing tools. The opcontrol command is used for configuring, starting, and stopping a profiling session. An OProfile kernel driver, usually built as a kernel module, is used for collecting samples, which are then recorded into sample files by oprofiled. You can use legacy mode only if you have root privileges. In certain cases, such as when you need to sample areas with disabled interrupt request (IRQ), this is a better alternative.
Before OProfile can be run in legacy mode, it must be configured as shown in Configuring OProfile Using Legacy Mode. These settings are then applied when starting OProfile (Starting and Stopping OProfile Using Legacy Mode).
Using operf
operf is the recommended profiling mode that does not require initial setup before starting. All settings are specified as command-line options and there is no separate command to start the profiling process. To stop operf, press Ctrl+C. The typical operf command syntax looks as follows:
operf options range command args
Replace options with the desired command-line options to specify your profiling settings. Full set of options is described in operf(1) manual page. Replace range with one of the following:
--system-wide - this setting allows for global profiling, see Using operf in System-wide Mode
--pid=PID - this is to profile a running application, where PID is the process ID of the process you want to profile.
With command and args, you can define a specific command or application to be profiled, and also the input arguments that this command or application requires. Either command, --pid or --system-wide is required, but these cannot be used simultaneously.
When you invoke operf on a command line without setting the range option, data will be collected for the children processes.
|
Using operf in System-wide Mode
To run operf If you run operf kill -SIGINT operf-PID
When running operf |
Specifying the Kernel
To monitor the kernel, execute the following command:
operf --vmlinux=vmlinux_path
With this option, you can specify a path to a vmlinux file that matches the running kernel. Kernel samples will be attributed to this binary, allowing post-processing tools to attribute samples to the appropriate kernel symbols. If this option is not specified, all kernel samples will be attributed to a pseudo binary named "no-vmlinux".
Setting Events to Monitor
Most processors contain counters, which are used by OProfile to monitor specific events. As shown in OProfile Processors and Counters, the number of counters available depends on the processor.
The events for each counter can be configured via the command line or with a graphical interface. For more information on the graphical interface, see Graphical Interface. If the counter cannot be set to a specific event, an error message is displayed.
|
Older Processors and operf
Some older processor models are not supported by the underlying Linux Performance Events Subsystem kernel and therefore are not supported by operf. If you receive this message: Your kernel's Performance Events Subsystem does not support your processor type when attempting to use operf, try profiling with opcontrol to see if your processor type may be supported by OProfile’s legacy mode. |
|
Using operf on Virtual Systems
Since hardware performance counters are not available on guest virtual machines, you have to enable timer mode to use operf on virtual systems. To do so, type as opcontrol
modprobe oprofile
|
To set the event for each configurable counter via the command line, use:
operf --events=event1,event2…
Here, pass a comma-separated list of event specifications for profiling. Each event specification is a colon-separated list of attributes in the following form:
event-name:sample-rate:unit-mask:kernel:user
Event Specifications summarizes these options. The last three values are optional, if you omit them, they will be set to their default values. Note that certain events do require a unit mask.
| Specification | Description |
|---|---|
event-name |
The exact symbolic event name taken from ophelp |
sample-rate |
The number of events to wait before sampling again. The smaller the count, the more frequent the samples. For events that do not happen frequently, a lower count may be needed to capture a statistically significant number of event instances. On the other hand, sampling too frequently can overload the system. By default, OProfile uses a time-based event set, which creates a sample every 100,000 clock cycles per processor. |
unit-mask |
Unit masks, which further define the event, are listed in ophelp. You can insert either a hexadecimal value, beginning with "0x", or a string that matches the first word of the unit mask description in ophelp. Definition by name is valid only for unit masks having "extra:" parameters, as shown by the output of ophelp. This type of unit mask cannot be defined with a hexadecimal value. Note that on certain architectures, there can be multiple unit masks with the same hexadecimal value. In that case they have to be specified by their names only. |
kernel |
Specifies whether to profile kernel code (insert |
user |
Specifies whether to profile user-space code (insert |
The events available vary depending on the processor type. When no event specification is given, the default event for the running processor type will be used for profiling. See Default Events for a list of these default events. To determine the events available for profiling, use the ophelp command.
ophelp
Categorization of Samples
The --separate-thread option categorizes samples by thread group ID (tgid) and thread ID (tid). This is useful for seeing per-thread samples in multi-threaded applications. When used in conjunction with the --system-wide option, --separate-thread is also useful for seeing per-process (i.e., per-thread group) samples for the case where multiple processes are executing the same program during a profiling run.
The --separate-cpu option categorizes samples by CPU.
Configuring OProfile Using Legacy Mode
Before OProfile can be run in legacy mode, it must be configured. At a minimum, selecting to monitor the kernel (or selecting not to monitor the kernel) is required. The following sections describe how to use the opcontrol utility to configure OProfile. As the opcontrol commands are executed, the setup options are saved to the /root/.oprofile/daemonrc file.
Specifying the Kernel
First, configure whether OProfile should monitor the kernel. This is the only configuration option that is required before starting OProfile. All others are optional.
To monitor the kernel, execute the following command as root:
~]# opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/`uname -r`/vmlinux
|
Install the debuginfo package
In order to monitor the kernel, the debuginfo package which contains the uncompressed kernel must be installed. |
To configure OProfile not to monitor the kernel, execute the following command as root:
~]# opcontrol --setup --no-vmlinux
This command also loads the oprofile kernel module, if it is not already loaded, and creates the /dev/oprofile/ directory, if it does not already exist. See Understanding the /dev/oprofile/ directory for details about this directory.
Setting whether samples should be collected within the kernel only changes what data is collected, not how or where the collected data is stored. To generate different sample files for the kernel and application libraries, see Separating Kernel and User-space Profiles.
Setting Events to Monitor
Most processors contain counters, which are used by OProfile to monitor specific events. As shown in OProfile Processors and Counters, the number of counters available depends on the processor.
| Processor | cpu_type | Number of Counters |
|---|---|---|
AMD64 |
x86-64/hammer |
4 |
AMD Family 10h |
x86-64/family10 |
4 |
AMD Family 11h |
x86-64/family11 |
4 |
AMD Family 12h |
x86-64/family12 |
4 |
AMD Family 14h |
x86-64/family14 |
4 |
AMD Family 15h |
x86-64/family15 |
6 |
Applied Micro X-Gene |
arm/armv8-xgene |
4 |
ARM Cortex A53 |
arm/armv8-ca53 |
6 |
ARM Cortex A57 |
arm/armv8-ca57 |
6 |
IBM eServer System i and IBM eServer System |
timer |
1 |
IBM POWER4 |
ppc64/power4 |
8 |
IBM POWER5 |
ppc64/power5 |
6 |
IBM PowerPC 970 |
ppc64/970 |
8 |
IBM PowerPC 970MP |
ppc64/970MP |
8 |
IBM POWER5+ |
ppc64/power5+ |
6 |
IBM POWER5++ |
ppc64/power5++ |
6 |
IBM POWER56 |
ppc64/power6 |
6 |
IBM POWER7 |
ppc64/power7 |
6 |
IBM POWER8 |
ppc64/power7 |
8 |
IBM S/390 and IBM System |
timer |
1 |
Intel Core i7 |
i386/core_i7 |
4 |
Intel Nehalem microarchitecture |
i386/nehalem |
4 |
Intel Westmere microarchitecture |
i386/westmere |
4 |
Intel Haswell microarchitecture (non-hyper-threaded) |
i386/haswell |
8 |
Intel Haswell microarchitecture (hyper-threaded) |
i386/haswell-ht |
4 |
Intel Ivy Bridge microarchitecture (non-hyper-threaded) |
i386/ivybridge |
8 |
Intel Ivy Bridge microarchitecture (hyper-threaded) |
i386/ivybridge-ht |
4 |
Intel Sandy Bridge microarchitecture (non-hyper-threaded) |
i386/sandybridge |
8 |
Intel Sandy Bridge microarchitecture |
i386/sandybridge-ht |
4 |
Intel Broadwell microarchitecture (non-hyper-threaded) |
i386/broadwell |
8 |
Intel Broadwell microarchitecture (hyper-threaded) |
i386/broadwell-ht |
4 |
Intel Silvermont microarchitecture |
i386/silvermont |
2 |
TIMER_INT |
timer |
1 |
Use OProfile Processors and Counters to determine the number of events that can be monitored simultaneously for your CPU type. If the processor does not have supported performance monitoring hardware, the timer is used as the processor type.
If timer is used, events cannot be set for any processor because the hardware does not have support for hardware performance counters. Instead, the timer interrupt is used for profiling.
If timer is not used as the processor type, the events monitored can be changed, and counter 0 for the processor is set to a time-based event by default. If more than one counter exists on the processor, the counters other than 0 are not set to an event by default. The default events monitored are shown in Default Events.
| Processor | Default Event for Counter | Description |
|---|---|---|
AMD Athlon and AMD64 |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
AMD Family 10h, AMD Family 11h, AMD Family 12h |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
AMD Family 14h, AMD Family 15h |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
Applied Micro X-Gene |
CPU_CYCLES |
Processor Cycles |
ARM Cortex A53 |
CPU_CYCLES |
Processor Cycles |
ARM Cortex A57 |
CPU_CYCLES |
Processor Cycles |
IBM POWER4 |
CYCLES |
Processor Cycles |
IBM POWER5 |
CYCLES |
Processor Cycles |
IBM POWER8 |
CYCLES |
Processor Cycles |
IBM PowerPC 970 |
CYCLES |
Processor Cycles |
Intel Core i7 |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
Intel Nehalem microarchitecture |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
Intel Pentium 4 (hyper-threaded and non-hyper-threaded) |
GLOBAL_POWER_EVENTS |
The time during which the processor is not stopped |
Intel Westmere microarchitecture |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
Intel Broadwell microarchitecture |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
Intel Silvermont microarchitecture |
CPU_CLK_UNHALTED |
The processor’s clock is not halted |
TIMER_INT |
(none) |
Sample for each timer interrupt |
The number of events that can be monitored at one time is determined by the number of counters for the processor. However, it is not a one-to-one correlation; on some processors, certain events must be mapped to specific counters. To determine the number of counters available, execute the following command:
~]# ls -d /dev/oprofile/[0-9]*
The events available vary depending on the processor type. To determine the events available for profiling, execute the following command as root (the list is specific to the system’s processor type):
~]# ophelp
|
Make sure that OProfile is configured
Unless OProfile is properly configured, ophelp fails with the following error message: Unable to open cpu_type file for reading Make sure you have done opcontrol --init cpu_type 'unset' is not valid you should upgrade oprofile or force the use of timer mode To configure OProfile, follow the instructions in Configuring OProfile Using Legacy Mode. |
The events for each counter can be configured via the command line or with a graphical interface. For more information on the graphical interface, see Graphical Interface. If the counter cannot be set to a specific event, an error message is displayed.
To set the event for each configurable counter via the command line, use opcontrol:
~]# opcontrol --event=event-name:sample-rate
Replace event-name with the exact name of the event from ophelp, and replace sample-rate with the number of events between samples.
Sampling Rate
By default, a time-based event set is selected. It creates a sample every 100,000 clock cycles per processor. If the timer interrupt is used, the timer is set to the respective rate and is not user-settable. If the cpu_type is not timer, each event can have a sampling rate set for it. The sampling rate is the number of events between each sample snapshot.
When setting the event for the counter, a sample rate can also be specified:
~]# opcontrol --event=event-name:sample-rate
Replace sample-rate with the number of events to wait before sampling again. The smaller the count, the more frequent the samples. For events that do not happen frequently, a lower count may be needed to capture the event instances.
|
Sampling too frequently can overload the system
Be extremely careful when setting sampling rates. Sampling too frequently can overload the system, causing the system to appear frozen or causing the system to actually freeze. |
Unit Masks
Some user performance monitoring events may also require unit masks to further define the event.
Unit masks for each event are listed with the ophelp command. The values for each unit mask are listed in hexadecimal format. To specify more than one unit mask, the hexadecimal values must be combined using a bitwise or operation.
~]# opcontrol --event=event-name:sample-rate:unit-mask
Note that on certain architectures, there can be multiple unit masks with the same hexadecimal value. In that case they have to be specified by their names only.
Separating Kernel and User-space Profiles
By default, kernel mode and user mode information is gathered for each event. To configure OProfile to ignore events in kernel mode for a specific counter, execute the following command:
~]# opcontrol --event=event-name:sample-rate:unit-mask:0
Execute the following command to start profiling kernel mode for the counter again:
~]# opcontrol --event=event-name:sample-rate:unit-mask:1
To configure OProfile to ignore events in user mode for a specific counter, execute the following command:
~]# opcontrol --event=event-name:sample-rate:unit-mask:1:0
Execute the following command to start profiling user mode for the counter again:
~]# opcontrol --event=event-name:sample-rate:unit-mask:1:1
When the OProfile daemon writes the profile data to sample files, it can separate the kernel and library profile data into separate sample files. To configure how the daemon writes to sample files, execute the following command as root:
~]# opcontrol --separate=choice
The choice argument can be one of the following:
-
none— Do not separate the profiles (default). -
library — Generate per-application profiles for libraries.
-
kernel — Generate per-application profiles for the kernel and kernel modules.
-
all — Generate per-application profiles for libraries and per-application profiles for the kernel and kernel modules.
If --separate=library is used, the sample file name includes the name of the executable as well as the name of the library.
|
Restart the OProfile profiler
These configuration changes will take effect when the OProfile profiler is restarted. |
Starting and Stopping OProfile Using Legacy Mode
To start monitoring the system with OProfile, execute the following command as root:
~]# opcontrol --start
Output similar to the following is displayed:
Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running.
The settings in /root/.oprofile/daemonrc are used.
The OProfile daemon, oprofiled, is started; it periodically writes the sample data to the /var/lib/oprofile/samples/ directory. The log file for the daemon is located at /var/lib/oprofile/oprofiled.log.
|
Disable the nmi_watchdog registers
On a 29 system, the To resolve this, either boot with the nmi_watchdog=0 kernel parameter set, or run the following command as ~]# echo 0 > /proc/sys/kernel/nmi_watchdog To re-enable ~]# echo 1 > /proc/sys/kernel/nmi_watchdog |
To stop the profiler, execute the following command as root:
~]# opcontrol --shutdown
Saving Data in Legacy Mode
Sometimes it is useful to save samples at a specific time. For example, when profiling an executable, it may be useful to gather different samples based on different input data sets. If the number of events to be monitored exceeds the number of counters available for the processor, multiple runs of OProfile can be used to collect data, saving the sample data to different files each time.
To save the current set of sample files, execute the following command, replacing name with a unique descriptive name for the current session:
~]# opcontrol --save=name
The command creates the directory /var/lib/oprofile/samples/name/ and the current sample files are copied to it.
To specify the session directory to hold the sample data, use the --session-dir option. If not specified, the data is saved in the oprofile_data/ directory on the current path.
Analyzing the Data
The same OProfile post-processing tools are used whether you collect your profile with operf or opcontrol in legacy mode.
By default, operf stores the profiling data in the current_dir/oprofile_data/ directory. You can change to a different location with the --session-dir option. The usual post-profiling analysis tools such as opreport and opannotate can be used to generate profile reports. These tools search for samples in current_dir/oprofile_data/ first. If this directory does not exist, the analysis tools use the standard session directory of /var/lib/oprofile/. Statistics, such as total samples received and lost samples, are written to the session_dir/samples/operf.log file.
When using legacy mode, the OProfile daemon, oprofiled, periodically collects the samples and writes them to the /var/lib/oprofile/samples/ directory. Before reading the data, make sure all data has been written to th