mercredi 16 novembre 2011

Tracing MPI application

While preparing labs for the Parallel Systems course at Polytechnique, I had the chance to refresh my MPI knowledge. One of the activity consist to compare communication patterns in a MPI application. For this purpose, the MPI library can be linked with special library to record all communication. Jumpshot is the graphical viewer for the trace. I wanted to compared traces obtained from MPE and compare it to a kernel trace.

First, some technical details on how to instrument MPI program, because it's not so obvious after all and documentation is sparse. I tried MPICH and OpenMPI as MPI implementation. I would recommend MPICH because it seems to provides better error message, which are very valuable to understand failures. In addition, on Fedora, MPE library is built-in in the MPICH package, which is not the case for OpenMPI. On Ubuntu, MPE library must be downloaded and installed by hand [1] because there is no binary package for it. As for the autoconf for the project, I reused m4 macro cs_mpi.m4 from Code Saturne [2] with the following options to configure. By linking to llmpe, the library will ouput suitable file for Jumpshot.

CC=mpicc LDFLAGS="-L/usr/local/lib" LIBS="-llmpe -lmpe" ./configure

The experiment consists in a small MPI application that exchange grid boundaries in 1D cartesian decomposition using blocking send and receive. 4 process are launched on the same host. The first figure shows a zoomed view of execution and messages exchanged between process according to time. The second figure shows a statistic view for an interval, in which the time spent in each state is summed. In one look, one can see what is the relative taken for the communication, which is very handy to understand performance.

The same application has been traced with the LTTng kernel tracer. While the program is computing and exchanging data with neighbors, no system calls are performed to transfer the data. In fact, between process on the same hosts, MPICH uses shared memory for communication, obviously to achieve better performance. This example shows that it may not be possible to recover relationship between processes that communicates through shared memory. One thing to note is that numerous polls on file descriptors are performed, probably for synchronization. Maybe the relationship can be recovered from there.

Jumshot Timeline view (green: recv, blue: send)

Jumshot Histogram view (green: recv, blue: send)

LTTv control flow view (green: userspace)

Would it be possible to trace shared memory accesses directly? A recent kernel feature allows to trace memory map I/O (MMIOTRACE). It works by resetting the flag that indicates a page is in memory, such that a page fault is triggered each time a page is accessed. The page fault handler is instrumented to record those events. This technique may be a hint on how to recover shared memory process relationship. More on this later.

In conclusion, MPI instrumentation with MPE helps to understand runtime performance behaviour of MPI application. The same results can't be obtained with the current kernel tracing infrastructure found in LTTng.

[1] ftp://ftp.mcs.anl.gov/pub/mpi/mpe/mpe2.tar.gz
[2] http://research.edf.com/research-and-the-scientific-community/software/code-saturne/introduction-code-saturne-80058.html

mardi 26 juillet 2011

Common Trace Format Bitfields

The Common Trace Format (CTF) will be the default format of the next release of LTTng 2.0. This new trace format is able to write data at the bit level to optimize space usage of each field. Such sub-byte packed bits is called a bitfield.

Endianness of the bitfield follows the native endianness of the architecture where the trace is obtained. Big endian (BE) and little endian (LE) are supported. But how does the bits are written into the trace? According to the documentation in include/babeltrace/bitfield.h:

The inside of a bitfield is from high bits to low bits.

On big endian, bytes are places from most significant to less significant.
Also, consecutive bitfields are placed from higher to lower bits.

On little endian, bytes are placed from the less significant to the most
significant. Also, consecutive bitfields are placed from lower bits to higher
bits.

To make sure I understood correctly, I created an example based on the current implementation in babeltrace. The structure has three fields. Each field is followed by it's length in bits.

struct x {
    unsigned int a : 4;
    unsigned int b : 32;
    unsigned int c : 4;
}

The total size of the structure is 40 bits or 5 bytes. For the example, following values are assigned to the fields to be able to spot them easily:

x.a = 0x0000000E;
x.b = 0xA1B2C3D4;
x.c = 0x0000000F;

Then, writing those values in a byte array in big and little endian gives the results in the table below. The binary and hexadecimal reprensentation of consecutive bytes are displayed.


Big endian
0b11101010 0xEA
0b00011011 0x1B
0b00101100 0x2C
0b00111101 0x3D
0b01001111 0x4F
0b00000000 0x00
[...]

   

Little endian
0b01001110 0x4E
0b00111101 0x3D
0b00101100 0x2C
0b00011011 0x1B
0b11111010 0xFA
0b00000000 0x00
[...]

Big endian is simple, because the bitfield is written sequentially. The first byte 0xEA starts with the four bits if the a field, followed by the first four bits of the b field. Notice that in consequence all subsequent bytes of the b field are shifted. The last byte starts with the last four bits of the b field. The c field fills the rest of the last byte. All bytes that follow are untouched.

As to little endian, the first byte ends with the a field. Then, the b field is written, but the most significant bit is found on the second half of the last byte. Notice that tbe bytes and written backwards, also shifted by four bits. The third field c is placed in the upper half of the last byte.

Happy hacking!

jeudi 16 juin 2011

Summary of Linux Symposium 2011

Linux Symposium took place in Ottawa once again in 2011. I went there for a talk about recovering system metrics from kernel trace. There were interesting talks, here is a summary.

Hitoshi Mitake explained scalability issues for virtualization on multi-core architecture with real-time constraints. He mentioned that when one of the virtual CPU is interrupted, another virtual CPU may tries to enter section protected by a spin lock, which decrease performance under load. Alternative scheduling of virtual CPU is proposed to avoid such situation.

Sergey Blagodurov talked about scheduling in NUMA multicore architecture. I learned the numactl system calls that can be used at the userspace level to move process and memory between domains. He was developing scheduler algorithms to optimize performance on this architecture. Since the Linux scheduler can migrate a process, but not the related memory pages, then performance is decreased because the inter-domain communication is increased and is much slower. One of his goal was to increase memory locality for pages that were used frequently.

One talk has been given about Tilera architecture by Jon C Masters, that includes on a single chip up to 100 cores! This piece of hardware is undoubtedly very special, and I would love to get one on my desk! I wonder how does the Linux kernel can scale up to that number of cores.

Trinity, which is a system call fuzzer has been presented by Dave Jones. It's basically a tool that perform system calls with randomized arguments, but still looking at edge cases. They found and fixed around 20 weired bugs this way. Code coverage report could be useful in future.

What about redundant data on a disk? There were a talk about block device deduplication by Kuniyasu Suzaki. The principle is that two identical blocks should not be saved twice. Since it operates at the block level, mileage vary between file systems. The presenter showed principally how much space can be recovered according to block size and file system type. I'm still curious what is the performance impact and CPU overhead of performing deduplication online.

The last talk I wanted to mention is the one given by Alexandre Lissy, about verifications of the Linux kernel. Static checks like Coccinelle are available right now by running make coccicheck in your linux base source tree. While model checking on the entire kernel would be great to get more guarantees about it's well behavior, the feasibility of it with SAT model checker is another story, because of the size and complexity of the Linux source tree, which is quite an undertaking challenge. Maybe an approach like SLAM from Microsoft, that validates interface usage, could be use for Linux drivers too?

This Linux Symposium was the opportunity to meet a lot of people involved in the kernel community and share ideas. Here are few links related to the conference.

samedi 21 mai 2011

Performance analysis with kernel trace

In this blog, I want to show some progress I made toward providing performance analysis tools for developpers and system administrators. The idea is to answer questions like:

  • What is the cause of lattency in the system?
  • How much resource a task require?
  • What is the relationship between process?
  • What is the critical path of a task?
  • Which subsystem is the bottleneck of the system?

To answer such questions, a kernel trace is an excellent source of system performance data. It has low overhead and provide precise timings information about events occuring on the system. But, get an overview of a system behavior by looking at each individual events is a challenge. We need an analysis toolbox. In the rest of this blog, Four tools are presented with which you can experiement yourself. A summary example is presented to show the integration of various analysis made with flightbox, the prototype implementation.

Resource usage recovery

The basic metric needed to perform various other analysis is resource usage. CPU usage recovery has already been presented from previous post. The same kind of analysis can be done with memory, disk and network usage.

In the case of CPU usage, kernel scheduling events indicates the process that drives the CPU. The CPU usage of a process is thus the sum of scheduled time. To display on a chart, CPU usage intervals are decomposed in fixed time range. An average usage is then easy to compute, and the precision can be increased by reducing the duration of time interval. The technique is presented on the following figure.


This technique is known as geometric integration. This computation behaves somewhat like a low-pass filter, by smoothing the raw CPU usage, making it easier to understand. The size of the resulting data is proportional to the trace duration times the desired precision times the number of process, no matter the trace size. In consequence, the CPU usage of a large trace can be summarized with a small array of doubles.

Blocking analysis

A process is blocked when a resource is not available and it can't progress. For example, we a process perform a read on a socket while the data is not yet received, the process block and the sceduler remove the waiting task from the running queue. Blocking can only occur in kernel mode. The main cause of blocking is in system calls. It can also occur during a page fault for which the page is swapped for example. The following picture shows typical example of blocking. Colors are matching legend in LTTV, green is for running in user mode, blue is for running in system call and red is for blocked on IO. Finally, orange represent the interval while the data is ready, but the process is not yet scheduled.


Some blocking are not necesserely bad. Waiting for user input or a server that waits for a request to come is not important. We are mostly interested on blocking on the critical path of an application, meaning blocking that induce processing delay from the user's point of view.

While a task is blocked, a subtask can perform some work. This subtask can also block, thus blocking among tasks is hierarchical.

Inter-process relationship


Under Linux, process can cooperate with various Inter-Process Communication mechanism, namely:

  • Socket
  • Pipe
  • Fifo
  • Mutex
  • Signal
  • Shared memory
  • File lock

Those mechanism make possible to share resource, send message and data between process. Sockets are well instrumented in the Linux kernel. Hence, it's possible to recover the link between process that use socket to communicate. It works at the host level, but provided traces from many hosts, it would be possible to recover links between distributed components. For example, we could recover process relationship for an httpd client, shown at the following figure.


Resource usage accounting


The idea behing resource usage accounting is to associate resource usage of a server to the client that does the request. This analysis is able to compute the cost of some request accros distributed process as if it was performed by only one thread. To perform this analysis, ressource usage, inter-process relationship and blocking analysis are required.

For example, say that an http client perform a request on a web server. The resulting web page is returned to the client. While waiting for the web page, the http client blocks. By associating the CPU usage of the httpd thread that process the request of the client while it's blocked, the resource usage is effectively accounted to the right process. This is shown by the following figure.


Example


In the following example, we will use a special RPC client and server, srvhog and clihog from workload-kit. The binary srvhog has two commands, sleep and hog. Each operation takes 100ms. Sleep command is implemented with a timer and doesn't consume CPU, while hog command perform computation that last for the duration. A calibration procedure is performed when srvhog starts to ensure that the hog computation duration is close to 100ms. Communication between client and server is done by a simple TCP socket.

The following picture shows the CPU usage of the traced system, a host that has two cores, for the hog scenario. The overall CPU usage is shown in blue, while the CPU usage of srvhog is red. The calibration phase of srvhog last for about one second followed by a burst of about 100ms.


The second analysis shows blocking of clihog. It waits on system call read for 101ms, that correspond to the computation time of the server before reply is sent. Other blocking are very small and negligeable.


By analyzing socket operations of srvhog and clihog, the following graph can be computed. It shows that the client perform a connection to the server.


Finally, by computing resource accounting, the following report is obtained, showing that the CPU usage of srvhog while processing the request should be really accounted to clihog. CPU usage is almost 90ms, which is close to the expected 100ms and waiting time. The exact CPU usage is closer to 100ms and the error can be explained by the lost of precision while computing average CPU usage. To increase precision, smaller CPU usage intervals can be used.

Report for task pid=2668 cmd=/usr/local/bin/clihog

File descriptor blocking summary
FD                                       N    Sum (ms)
------------------------------------------------------
132.207.224.50:9876                      2     101,225

CPU accounting
PID                              Self (ms)    Sub (ms)  Total (ms)
------------------------------------------------------------------
2668 clihog                          2,675      89,797      92,471
 \_ 2666 srvhog                     89,797       0,000      89,797

Conclusion


Kernel analysis goes well beyong kernel developpers. Kernel trace can be very useful to system administrators and application developpers provided simple analysis tools that abstract trace events and produce meaningfull results to users.

Try it for yourself, see flightbox and workload-kit on github.

More to come, stay tuned!

mardi 10 mai 2011

Tracing Workshop '11

I assisted to the Tracing Workshop held at the École Polytechnique de Montréal on 9-10 may 2011, here are some highlights of the event. Partners presented their challenges followed by current research conducted in the fields from many Canadian universities.

Partners

Partners mentioned the interest in security, debugging hardware, real-time systems, embedded systems and large scale distributed infrastructure.

On the security aspects, the goal would be to increase the accuracy of detecting intrusion. It's reported that simple attacks are often working, like sending an email with a link that loads some browser extension or other web site, up to get the user credentials.

The need to model normal behavior and alert when the system is going outside of it without false positive has been mentioned a lot.

In the field of large scale infrastructure, there are needs for understanding the system wide behavior of a distributed system at runtime. This is at the heart of finding problems fast on hard to reproduce problems, including kernel, and apps crash and performance slowdowns.

Monitoring a system modify the system behavior and the measure itself, even in the computer domain. In the embedded and real-time, first there is the challenge to put an upper bound to tracing impact, to make sure to respect timing constraints, even in the case tracing is enabled.

Research

Sebastian Fischmeister from the University of Waterloo presented research about using a sampling approach to tracing for real-time systems. He presented an approach to add hardware instruction for tracing, that would run it only if the timing requirements would be met.

Ashvim Goel from the University of Toronto talked about increasing security by recording all system calls to allow off-line analysis. Data is indexed to allow fast queries.

Greg Frank from Carleton University presented performance model based on traces. Execution paths at the system level can be modeled by a graph. Graph can be annotated with resource consumption.

Del Mayers from the University of Victoria presented a reverse engineering tool using tracing. By running the program with the desired functionality, then repeating the scenario without using that particular activity, it's possible to do a difference set operation on called methods and isolate with low noise level the classes and methods that implements this particular functionality.

mercredi 23 février 2011

Managing Apache config with Augeas

Release 0.8.0 of Augeas includes a special lens for managing Apache web server configuration files. First, we will see how the lens maps sections and directives to the tree. Then, we will review some recepies to modify the configuration and show results.

Quick start

Apache configuration has two main components. The first component is section, which is similar to XML elements. Sections can be nested and contain directives. Directives are composed of a name and positional arguments.

As you may know, Apache is very modular. A module can define its own directives and Apache will run the module's callback to parse arguments. This is reflected in the way the configuration is handled in Augeas. The lens is mostly generic and allows arbitrary sections and directives, with arbitrary arguments. Apache allow case insensitive sections are directive. The actual version enforce that open and close tags to be the same case. I recommend to standardize the casing of the configuration for simplicity. Now, let's look at a small example, with one section and one directive.

<VirtualHost *:80>
  ServerAdmin foo@example.org
</VirtualHost>

This configuration produces the following tree.

{ "VirtualHost"
    { "arg" = "*:80" }
    { "directive" = "ServerAdmin"
      { "arg" = "foo@example.org" }
    }
  }

Let's say we want to modify the email address of the server admin, we can do a simple set:
augtool> set "/VirtualHost/*[self::directive='ServerAdmin']/arg" "bar@example.com"
To select the ServerAdmin directive, we need to search for it as a child of VirtualHost. We then set the first argument of the corresponding directive to a new value. Here is the resulting diff from the original file. If there are more than one VirtualHost entries in the file, then a simple set will fail, a set multiple (setm) is required to change them all at once.

$ diff -u httpd.conf httpd.conf.augnew 
--- httpd.conf 2011-02-22 22:03:34.000000000 -0500
+++ httpd.conf.augnew 2011-02-22 22:04:34.844099001 -0500
@@ -1,3 +1,3 @@
 <VirtualHost *:80>
-  ServerAdmin foo@example.org
+  ServerAdmin bar@example.org
 </VirtualHost>

Tree structure

There are 4 special things to know about the tree structure of apache configuration.

  • Section names are labels of the tree, they appear on the path.
  • Directives are held in value of a "directive" node.
  • Arguments of either sections or directives are held in values of "arg" node.
  • Arguments nodes must be before any other children.

To demonstrate this, let's create a configuration from scratch, and see it build at each step in augtool. First, launch augtool. I suggest to use the -n option to save to a new file.
$ sudo augtool -n
The variable $conf holds the path to the target empty file, for example under Ubuntu it would be:
augtool> defvar conf /files/etc/apache2/sites-available/foo

Step 1: create VirtualHost section

The clear command empty the value of a node and it create it if the node doesn't exists yet.
Command:
augtool> clear $conf/VirtualHost
Result:
<VirtualHost>
</VirtualHost>

Step 2: Set the argument of VirtualHost section.

Command:
augtool> set $conf/VirtualHost/arg "172.16.0.1:80"
Result:
<VirtualHost 172.16.0.1:80>
</VirtualHost>

Step 3: Create the ServerAdmin directive under VirtualHost.

Command:
augtool> set $conf/VirtualHost/directive "ServerAdmin"
Result:
<VirtualHost 172.16.0.1:80>
ServerAdmin
</VirtualHost>

Step 4: Set the ServerAdmin first argument to configure the email.

Command:
augtool> set $conf/VirtualHost/*[self::directive='ServerAdmin']/arg "baz@example.org"
Result:
<VirtualHost 172.16.0.1:80>
ServerAdmin baz@example.org
</VirtualHost>

The order matters

What if we perform step 3 before step 2? Here is the result:
augtool> set $conf/VirtualHost/directive "ServerAdmin"
augtool> set $conf/VirtualHost/arg "172.16.0.1:80"
augtool> print $conf
/files/etc/apache2/sites-available/foo
/files/etc/apache2/sites-available/foo/VirtualHost
/files/etc/apache2/sites-available/foo/VirtualHost/directive = "ServerAdmin"
/files/etc/apache2/sites-available/foo/VirtualHost/arg = "172.16.0.1:80"
augtool> save
Saving failed
augtool> print /augeas/files/etc/apache2/sites-available/foo/error/message = 

Failed to match ({ /arg/ = /[...]/ }({ /arg/ = /[...]/ })*)?(<<rec>>
| { /directive/ = /[...]/ } | { /#comment/ = /[...]/ } | { })*  

with tree

{ \"directive\" = \"ServerAdmin\" } { \"arg\" = \"172.16.0.1:80\" }"

I reformated the error message to highlight why it failed. The lens require "arg" nodes to appear before "directive", "#comment" and empty nodes. In this case, the node "directive" is before the "arg" node, and it doesn't match the tree structure. So, how to make sure that we insert nodes before any other? Here is the twist. First, we will remove the arg node and create it at the right place.

augtool> rm $conf/VirtualHost/arg
rm : $conf/VirtualHost/arg 1
augtool> ins arg before  $conf/VirtualHost/*[1]
augtool> set $conf/VirtualHost/arg "baz@example.org"
augtool> save
Saved 1 file(s)

And the corresponding result is the one expected. The predicate [1] will always select the first node in a node set and is equivalent to [position()=1]. There is one gotcha, that is if there are no children under VirtualHost, no node will match and the insert will fail. In this case, a simple set is required.

Also, take care to set a value to "arg" nodes! Otherwise, Augeas use the section name lens to render it. Here is the result of having an "arg" node with a null value.

<VirtualHost>
<arg>
</arg>
</VirtualHost>

Use cases

Modify document root

The first use case I show is to modify document root in the default configuration of Apache under Ubuntu. The default document root is "/var/www", but say we want to change it to "/srv/www". We have to modify the DocumentRoot directive and one Directory section. I'm using the match command to verify that the right node is selected before applying changes. (sorry for long lines...)

augtool> match $conf2/VirtualHost/*[self::directive="DocumentRoot"]
/files/etc/apache2/sites-available/default/VirtualHost/directive[2] = DocumentRoot
augtool> set $conf2/VirtualHost/*[self::directive="DocumentRoot"]/arg /srv/www
augtool> match $conf2/VirtualHost/Directory/arg[. = "/var/www/"]
/files/etc/apache2/sites-available/default/VirtualHost/Directory[2]/arg = /var/www/
augtool> set $conf2/VirtualHost/Directory/arg[. = "/var/www/"] "/srv/www/"
augtoo> save
Saved 1 file(s)

Here is the resulting diff.

--- /etc/apache2/sites-available/default 2011-02-22 23:30:48.000000000 -0500
+++ /etc/apache2/sites-available/default.augnew 2011-02-22 23:48:10.000000000 -0500
@@ -1,12 +1,12 @@
 
  ServerAdmin webmaster@localhost
 
- DocumentRoot /var/www
+ DocumentRoot /srv/www
  <Directory>
   Options FollowSymLinks
   AllowOverride None
  </Directory>
- <Directory /var/www/>
+ <Directory /srv/www/>
   Options Indexes FollowSymLinks MultiViews
   AllowOverride None
   Order allow,deny

Changing permissions


On Ubuntu, Apache is configured to serve documentation on http://localhost/doc/. We will share this documentation with everybody by changing Order and Allow from directive and delete Deny directive of the doc directory. We are using the regexp function to select the node with or without leading slash and with or without quotes. We remove all arguments of "Allow from" to make sure that arguments are sane.

augtool> match $conf2/VirtualHost/Directory[arg =~ regexp(".*/usr/share/doc.*")]/
/files/etc/apache2/sites-available/default/VirtualHost/Directory[4] = (none)
augtool> defvar dir $conf2/VirtualHost/Directory[arg =~ regexp(".*/usr/share/doc.*")]/
augtool> rm $dir/*[self::directive="Allow"]/arg
augtool> set $dir/*[self::directive="Allow"]/arg[1] from
augtool> set $dir/*[self::directive="Allow"]/arg[2] all
augtool> set $dir/*[self::directive="Order"]/arg[1] "allow,deny"
augtool> rm $dir/*[self::directive="Deny"]
rm : $dir/*[self::directive="Deny"] 3
augtool> save
Saved 1 file(s)

In this example, the predicate "*[self::directive='Allow']" can be rewritten "directive[ . = 'Allow']", both are equivalent.

Here is the resulting diff.
--- /etc/apache2/sites-available/default 2011-02-22 23:30:48.000000000 -0500
+++ /etc/apache2/sites-available/default.augnew 2011-02-23 00:10:14.000000000 -0500
@@ -33,9 +33,8 @@
     <Directory "/usr/share/doc">
         Options Indexes MultiViews FollowSymLinks
         AllowOverride None
-        Order deny,allow
-        Deny from all
-        Allow from 127.0.0.0/255.0.0.0 ::1/128
+        Order allow,deny
+        Allow from all
     </directory>

Conclusion

Augeas is a very powerful and precise tool to modify Apache configuration file. We can do about any changes with a very simple tree structure. As always, don't forget to reload Apache configuration for changes to take effect.

Thanks to David Lutterkort, who designed the amazing recursive lens that Apache lens uses and feedback for the square lens, and Raphaël Pinson for the feedback and stress test of the lens on more than 2000 Apache configurations and the help for this blog post.

Happy hacking!

jeudi 27 janvier 2011

Visiting Event Tracing for Windows

In previous posts, we discussed Linux kernel tracing with LTTng. On the Microsoft side, Event Tracing for Windows allows to gather a kernel trace and user space trace. It's advertised as low-impact, lock-less, per-cpu buffers tracer, with live mode and circular buffers option. Quite interesting.
Here is a quick introduction to how to gather a simple trace on Windows 7 Professional and the different analysis we can get.

Installation

Obviously, you will need a Windows 7 Professional host. I installed it in a Linux KVM virtual machine, with 20GB of storage and with two CPUs emulated, with all default options.
First, .NET framework 4 has to be installed, required to install tracing tools bundled with Windows SDK.
In summary, you need to install:
  • dotNetFx40_Full_x86_x64
  • winsdk_web
It will install required tools for tracing: xperf, xperfview, tracelog and tracerpt.

Get a trace

To start tracing, open a console in administration mode, as shown in the following screen shot.


We create a trace directory for our experiments, and start and stop a simple trace with tracelog, as shown in the following screen shot. While tracing, you can start programs to create some activity on the system. I started IE and did some random browsing.

The trace file output is set with "-f" option. If a lot of events occurs, some events can be lost. If such a thing occur, increase the buffers size with "-b" option. Other options are there for some extra trace events, refer to help for more information. To give an idea of the disk requirement, I got a trace of about 20MB for a minute of tracing under light load. The trace file can hence become very big.


If the user doesn't have enough privileges for tracing, the following error message will be raised. Make sure the console is launched as administrator.

C:\trace>tracelog -f kernel.etl -start
Setting log file to: C:\trace\kernel.etl
Could not start logger: NT Kernel Logger
Operation Status: 5L
Access is denied.



Analyze the trace

Launch xperfview to start the Windows Performance Analyzer. Then, load the trace kernel.etl obtained previously. Here is what it looks like.
Disk access graph and CPU usage according to time and process in xperfview.

Process life graph and page faults by process according to time in xperfview.
Offset of disk access to detect system-wide bad seek behavior.

Summary of events count obtained from tracerpt
What I like about xperf is the ability to isolate resource usage per process. Also, zooming in a trace is intuitive by selecting with the mouse an interval that seems interesting. Trace events are completely abstracted as chart and statistics. There views are appropriate for understanding overall system performance. Also, when executable symbols are loaded, the call trace performance is available, which may be very handy for the software optimization. 
The tracerpt utility outputs reports from a trace file, one is the count of all events and the report in HTML gives the statistics for the whole trace by subsystem. Here is an example of report.html generated with tracerpt.

vendredi 7 janvier 2011

Tracing a linux cluster

Gather traces from multiple machines and then showing them all at once is a real challenge, because each computer has it's own independent clock. It means that two events occurring at the same time on two computers have a low probability to have the same time stamp in the trace.

Linux Trace Toolkit have a synchronization feature to adjust trace time stamps to a reference trace. It works by matching sent and received network packets found in traces. Two parameters are computed: an offset, but also a drift, that is the rate of clock change.

To experiment it, I did a virtual Ubuntu cluster of 10 nodes installed with LTTng. Virtual machines are sending each other TCP packets with the small utility pingpong while tracing is enabled. All traces are then sync to the host for analysis. Here are the results:

$ lttv -m sync_chain_batch --sync --sync-stats -t trace1 -t trace2 [...]
[...]
Resulting synchronization factors:
 trace 0 drift= 1 offset= 6.80407e+09 (2.551684) start time= 2433.182802117
 trace 1 drift= 1 offset= 4.12691e+09 (1.547687) start time= 2433.254700670
 trace 2 drift= 0.999999 offset= 0 (0.000000) start time= 2433.229372444
 trace 3 drift= 1 offset= 1.53079e+09 (0.574083) start time= 2433.366687088
 trace 4 drift= 1 offset= 2.31812e+10 (8.693471) start time= 2433.217954184
 trace 5 drift= 1 offset= 1.48957e+10 (5.586227) start time= 2433.280525047
 trace 6 drift= 1 offset= 1.76688e+10 (6.626211) start time= 2433.291125350
 trace 7 drift= 1 offset= 2.0463e+10 (7.674116) start time= 2433.325424020
 trace 8 drift= 0.999999 offset= 1.2206e+10 (4.577534) start time= 2433.204361289
 trace 9 drift= 1 offset= 9.55947e+09 (3.585022) start time= 2433.293578996
[...]

The trace 2 has been selected as the reference trace. We see inside parenthesis the offset between the current trace and the reference trace. We see a delay between each trace that match the delay between virtual machines startup. When loading these traces into lttv with --sync option, all traces events align perfectly, which is not the case without the sync option.

If you want to know more about trace synchronization, I recommend the paper Accurate Offline Synchronization of Distributed Traces Using Kernel-Level Events from Benjamin Poirier.

One simple note if you want to do this analysis, you need extended network trace events. To enable them, arm ltt with

$ sudo ltt-armall -n

I found it really interesting, because it allows to see cluster-wide system state.