First, some technical details on how to instrument MPI program, because it's not so obvious after all and documentation is sparse. I tried MPICH and OpenMPI as MPI implementation. I would recommend MPICH because it seems to provides better error message, which are very valuable to understand failures. In addition, on Fedora, MPE library is built-in in the MPICH package, which is not the case for OpenMPI. On Ubuntu, MPE library must be downloaded and installed by hand  because there is no binary package for it. As for the autoconf for the project, I reused m4 macro cs_mpi.m4 from Code Saturne  with the following options to configure. By linking to llmpe, the library will ouput suitable file for Jumpshot.
CC=mpicc LDFLAGS="-L/usr/local/lib" LIBS="-llmpe -lmpe" ./configure
The experiment consists in a small MPI application that exchange grid boundaries in 1D cartesian decomposition using blocking send and receive. 4 process are launched on the same host. The first figure shows a zoomed view of execution and messages exchanged between process according to time. The second figure shows a statistic view for an interval, in which the time spent in each state is summed. In one look, one can see what is the relative taken for the communication, which is very handy to understand performance.
The same application has been traced with the LTTng kernel tracer. While the program is computing and exchanging data with neighbors, no system calls are performed to transfer the data. In fact, between process on the same hosts, MPICH uses shared memory for communication, obviously to achieve better performance. This example shows that it may not be possible to recover relationship between processes that communicates through shared memory. One thing to note is that numerous polls on file descriptors are performed, probably for synchronization. Maybe the relationship can be recovered from there.
|Jumshot Timeline view (green: recv, blue: send)|
|Jumshot Histogram view (green: recv, blue: send)|
|LTTv control flow view (green: userspace)|
Would it be possible to trace shared memory accesses directly? A recent kernel feature allows to trace memory map I/O (MMIOTRACE). It works by resetting the flag that indicates a page is in memory, such that a page fault is triggered each time a page is accessed. The page fault handler is instrumented to record those events. This technique may be a hint on how to recover shared memory process relationship. More on this later.
In conclusion, MPI instrumentation with MPE helps to understand runtime performance behaviour of MPI application. The same results can't be obtained with the current kernel tracing infrastructure found in LTTng.