Multivax: lttng

Affichage des articles dont le libellé est lttng. Afficher tous les articles

dimanche 19 octobre 2014

LinuxCon 2014, Düsseldorf

Distributed Active Path from Kernel Trace

I presented recent work about the recovery of active path of a distributed program, only using a kernel trace. The result is a time-view of the execution across computers. One of the most interesting result is the Django Poll three-tiers app, including an Apache web server and a PostgreSQL database. Client is simulated by scripting with Mechanize. The screen shot shows the execution for the POST, where a large sync is visible near the end of the sequence, necessary for the update SQL statement.

Declarative views from user-space trace with Trace Compas

Geneviève Bastien presented how to use LTTng-UST and Trace Compas to create a time view of the state of an MPI computation. The MPI program simulates unbalanced computation, where the green represent work, and red is wait at the barrier. Multiple cycles are shown, and all the views are synchronized, which is handy to analyse the application at multiple level at once. In the following figure, we see below that the wait at the barrier (in red) is actually implemented using a busy loop calling non-blocking poll() system call, shown on top. We observed this behavior with the OpenMPI library, and we think it is intended to reduce the latency at the expense of power and resource efficiency.

Below: MPI imbalance application state (green: compute, red: wait). Above: corresponding kernel trace, including a zoom on the busy wait.

Preemption analysis across host and virtual machines

Execution in virtual machines is subject to higher level of variation in response time. Geneviève Bastien presented the virtual machine analysis, that shows the preemption crossing virtual machines boundaries, the research result of former student Mohamad Gebai. A process in the VM may appear to run while it is not, because the process representing its virtual CPU on the host is be preempted. The host and the VMs are traced at the same time, and traces are synchronized. The global view of the resources allows to pinpoint preemption cause given a task, whenever the other task is local, on the host or inside another virtual machine. The following figure shows qemu processes on the host and traces from within the host. We clearly see a gettimeofday() that is much longer that one would expect (13ms instead of ~200ns), and the reason is that there is another process, in another VM, that was running for that time, and the preemption cause is shown below.

Total view of the system, crossing virtual machine boundaries.

ftrace

Sten Rosted from RedHat presented the inner working of ftrace. It is a framework to hook into each function in the kernel. The kernel is compiled with gcc -pg option, that adds a call to mcount() at the beginning of each function, but this call is replaced by nops. The space left in each function header is patched at runtime to jump to the instrumentation. This framework is used by kprobe, but the ftrace backend can be used directly instead.

folly

Ben Maurer from Facebook presented efficiency and scalability issues they face in their environment. Performance in accepting connections is critical. He mentioned that getting the next available file descriptor is a linear search that hurts the performance. He also shows a method to decrease the number of scheduling switch of workers by replacing pthread condition wait, and improving load distribution among threads. I liked his quantitative approach, always evaluating both kernel and user-space solutions to the same problem.

Cycle accurate profiling using CoreSight hardware trace

Pawel Moll from ARM presented the CoreSight instruction tracing capabilities. The processor records branch taken and the number of cycles taken by each instructions. He demonstrated a prototype of cycle accurate (i.e. exact cycle count) profiler. The trace decoding is somewhat cumbersome, because the code memory must be reconstructed offline to resolve address symbols, and the complete toolchain requires proper integration, but still an impressive demonstration.

Thanks to NVIDIA!

I also got a JETSON TK1 big.LITTLE board from NVIDIA. I will be able to test heterogenous traces experiments (x86, arm), a very common case for mobile and cloud computing. I will also be able to test the CoreSight tracing. Thanks NVIDIA!

jeudi 5 décembre 2013

System-level profiling of APT

APT is the package manager of Debian based distributions. Understanding where the time is spent installing a package is difficult, because there are numerous process involved. We discuss limitations of profiler tools and present system-level trace analysis that highlight waiting between tasks.

Callgrind is a profiler that helps to find hotspot in the application. One limitation of a profiler is that the time spent waiting is not taken into account, for instance I/O, resources or other tasks. This situation is shown in figure 1, where we traced the command "sudo apt-get install tree --yes". The actual computation done in the apt-get process accounts for 63% of the elapsed time, while waiting time is about 37%. The callgrind results covers only instructions from the application, not the waiting time.

Figure 1: State of apt-get process according to time.

How to effectively understand what the application was waiting for? For this matter, we performed a waiting dependency analysis on all tasks on the system. The event sched_wakeup indicates the waiting source. Figure 2 shows the result of waiting dependencies among tasks for the command.

Figure 2: Overview of apt-get waiting dependencies according to time.

The analysis resolves waiting time recursively. In this run, apt-get process is waiting for dpkg, and it accounts for about 20% of the execution time. Figure 3 shows the wait I/O that occurs in dpkg, up to the kernel thread responsible for the file system journal of the partition on which the I/O is performed.

Figure 3: I/O wait of dpkg according to time.

Another interesting part of the execution is the update of man database performed by mandb. Figure 4 shows that about 200 threads are spawned. In this case, the overhead of starting the threads and setup communication pipe is high compared to the amount of actual work, in consequence running time could be greatly improved by using a thread pool instead.

Figure 4: Execution of mandb according to time.

We located the code responsible for creating all these threads and processes using perf callchain. It seems to be related to how the library libpipeline is working. Here is an example of an interesting call stack:

7ffff729a30b __execvpe (/lib/x86_64-linux-gnu/libc-2.17.so)

7ffff75a4de4 pipecmd_exec (/usr/lib/x86_64-linux-gnu/libpipeline.so.1.2.2)

409813 [unknown] (/usr/local/bin/mandb)

7ffff75a4e59 pipecmd_exec (/usr/lib/x86_64-linux-gnu/libpipeline.so.1.2.2)

7ffff75a662a pipeline_start (/usr/lib/x86_64-linux-gnu/libpipeline.so.1.2.2)

408d96 find_name (/usr/local/bin/mandb)

40449c test_manfile (/usr/local/bin/mandb)

404708 testmandirs (/usr/local/bin/mandb)

405035 update_db (/usr/local/bin/mandb)

40a1c6 mandb (/usr/local/bin/mandb)

40a4fb process_manpath (/usr/local/bin/mandb)

403777 main (/usr/local/bin/mandb)

7ffff71f9ea5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.17.so)

403b09 _start (/usr/local/bin/mandb)

At the end of apt-get execution, there is a sleep of 500 ms, shown in blue in the figure 2. Here is the location in the code of this nanosleep, again using perf callchain feature:

ffffffff816cc0c0 __schedule ([kernel.kallsyms])

ffffffff816cc799 schedule ([kernel.kallsyms])

ffffffff816cb66c do_nanosleep ([kernel.kallsyms])

ffffffff810827f9 hrtimer_nanosleep ([kernel.kallsyms])

ffffffff8108292e sys_nanosleep ([kernel.kallsyms])

ffffffff816d616f tracesys ([kernel.kallsyms])

7ffff72978c0 __GI___libc_nanosleep (/lib/x86_64-linux-gnu/libc-2.17.so)

7ffff7b8ac95 pkgDPkgPM::DoTerminalPty(int) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)

7ffff7b93720 pkgDPkgPM::Go(int) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)

7ffff7b26585 pkgPackageManager::DoInstallPostFork(int) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)

415a1b [unknown] (/usr/bin/apt-get)

418dfc [unknown] (/usr/bin/apt-get)

7ffff7af96c2 CommandLine::DispatchArg(CommandLine::Dispatch*, bool) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)

40a2ec [unknown] (/usr/bin/apt-get)

Here is the detail of pkgDPkgPM::DoTerminalPty() function from apt/apt-pkg/deb/dpkgpm.cc:

* read the terminal pty and write log

void pkgDPkgPM::DoTerminalPty(int master)

{

unsigned char term_buf[1024] = {0,0, };

ssize_t len=read(master, term_buf, sizeof(term_buf));

if(len == -1 && errno == EIO)

{

// this happens when the child is about to exit, we

// give it time to actually exit, otherwise we run

// into a race so we sleep for half a second.

struct timespec sleepfor = { 0, 500000000 };

nanosleep(&sleepfor, NULL);

return;

}

if(len <= 0)

return;

FileFd::Write(1, term_buf, len);

if(d->term_out)

fwrite(term_buf, len, sizeof(char), d->term_out);

}

The comment mentions a possible race, but the caller may be fixed to account for this situation. IMHO, this function should not sleep if an error occurs while reading the file descriptor, and should instead return immediately.

This was a small example of what can be achieved with the waiting dependency analysis from a kernel trace. The analysis plug-in will be available soon in Eclipse Linux Tools, and uses standard tracepoints from the Linux kernel, recorded with LTTng. I hope that this information will help the community to improve performance of installing packages with APT.

Cheers!

The APT licence plate! (thanks to Yannick)

vendredi 7 janvier 2011

Tracing a linux cluster

Gather traces from multiple machines and then showing them all at once is a real challenge, because each computer has it's own independent clock. It means that two events occurring at the same time on two computers have a low probability to have the same time stamp in the trace.

Linux Trace Toolkit have a synchronization feature to adjust trace time stamps to a reference trace. It works by matching sent and received network packets found in traces. Two parameters are computed: an offset, but also a drift, that is the rate of clock change.

To experiment it, I did a virtual Ubuntu cluster of 10 nodes installed with LTTng. Virtual machines are sending each other TCP packets with the small utility pingpong while tracing is enabled. All traces are then sync to the host for analysis. Here are the results:

$ lttv -m sync_chain_batch --sync --sync-stats -t trace1 -t trace2 [...]
[...]
Resulting synchronization factors:
 trace 0 drift= 1 offset= 6.80407e+09 (2.551684) start time= 2433.182802117
 trace 1 drift= 1 offset= 4.12691e+09 (1.547687) start time= 2433.254700670
 trace 2 drift= 0.999999 offset= 0 (0.000000) start time= 2433.229372444
 trace 3 drift= 1 offset= 1.53079e+09 (0.574083) start time= 2433.366687088
 trace 4 drift= 1 offset= 2.31812e+10 (8.693471) start time= 2433.217954184
 trace 5 drift= 1 offset= 1.48957e+10 (5.586227) start time= 2433.280525047
 trace 6 drift= 1 offset= 1.76688e+10 (6.626211) start time= 2433.291125350
 trace 7 drift= 1 offset= 2.0463e+10 (7.674116) start time= 2433.325424020
 trace 8 drift= 0.999999 offset= 1.2206e+10 (4.577534) start time= 2433.204361289
 trace 9 drift= 1 offset= 9.55947e+09 (3.585022) start time= 2433.293578996
[...]

The trace 2 has been selected as the reference trace. We see inside parenthesis the offset between the current trace and the reference trace. We see a delay between each trace that match the delay between virtual machines startup. When loading these traces into lttv with --sync option, all traces events align perfectly, which is not the case without the sync option.

If you want to know more about trace synchronization, I recommend the paper Accurate Offline Synchronization of Distributed Traces Using Kernel-Level Events from Benjamin Poirier.

One simple note if you want to do this analysis, you need extended network trace events. To enable them, arm ltt with

$ sudo ltt-armall -n

I found it really interesting, because it allows to see cluster-wide system state.

jeudi 9 décembre 2010

CPU history from kernel trace

Everybody is familiar with CPU history usage, for example with GNOME monitor. We see the average usage in time.

The GNOME monitor poll /proc/stats at regular interval to compute average usage. This file is updated by the kernel. The values comes from sampling that the kernel does for accounting various process.

How could we compute the CPU usage history from the actual kernel trace? We only have to look at sched_schedule events. The event inform us that in a certain point in time, the kernel switched the task on a given CPU. If the new task has PID zero, we know that's the idle thread from the kernel. By splitting running and idle time in fixed intervals, we can compute CPU average usage.

I show one small examples done on my computer that is dual core. First, we see the execution of three cpuburn process started with one second of delay. We can see the result in the control flow viewer in LTTV, and the resulting CPU usage view.

We see a peek at the beginning of the trace. This is likely to be related to the metadata flushing when a trace is started. The two steps after are related to each cpuburn process, and we see that when the third process is started, the CPU usage stays at 100%.

For testing, I added scripts to generate those traces automatically on your computer. It's written in Java, and you will need some extra packages to run it, see the README file. Get the code on github:

https://github.com/giraldeau/flightbox

Happy hacking!

vendredi 12 novembre 2010

Introduction to Linux Tracing Toolkit

The Linux Tracing Toolkit next generation (LTTng) is absolutely amazing. In this introduction, we will look at how to interpret the kernel trace of a small program by comparing it to strace. If you want to install components and get a trace on your own machine, you can get packages for the patched kernel and other tools for Ubuntu Maverick in LTTng's PPA on Launchpad. You will also need build-essential to compile the small C executable, strace and hexdump.

Here is the small program. It opens a file and write a integer into it. Then, it adds number in a for loop.

#include <stdio.h>
#define INT_MAX 2147483647
int main(volatile int argc, char **argv) {
    int i = 3;
    FILE *fd = fopen("test.out", "w");
    fwrite(&i, sizeof(int), 1, fd);
    fclose(fd);
    volatile long a;
    int x;
    for (x = 0; x<INT_MAX; x++) {
        a++;
    }
    return 0;
}

Compile it, test it and run it under strace:

$ gcc -o usecase usecase.c
$ ./usecase
$ strace -o usecase.strace ./usecase

Ok. Now, let's run the program under tracing, but before, we have to enable all trace points in the kernel.

$ sudo ltt-armall

To get a tight window around the command, I'm using the small script trace-cmd.sh (see at the end of the post). The script simply starts tracing, run the program and stop the trace. Invoke it like this (don't forget to set the script executable):

./trace-cmd.sh ./usecase

Now, we are ready to launch lttv-gui, which is the utility to look at the trace. To load a trace, select File->Add trace, then select the "trace-cmd" directory created in the same directory as the trace-cmd.sh script. Then, scroll down to the "./usecase" process in the control flow viewer, which is in the middle of the window. Each trace events are displayed under the graph. I will present three phase of the process. First, the creation and file access, then computation phase and the process exit. Trace is rotated to make it more readable. Events name and comments are embedded in the graph. strace events are in red, while related code is in blue.

In addition to system calls, we can look at the paging of an application and see process starvation. This information is not shown by strace. Another cool side effect is that, contrary to strace, the running program can't detect it's under tracing, then we can trace viruses that otherwise quit before they can be analyzed. Tracing reports much more information about kernel internals, which helps a lot to understand what's going on. Events timestamps of the kernel trace are in nanoseconds, so the precision is really high, and since trace points are static (simple function calls compiled right into the kernel), performance impact is ridiculously low, below 3% for core trace points! I'm running LTTng in background on my system without noticeable impact, in flight recorder mode.

I will be working on advanced trace analysis for the next three years or so for my PhD. Since LTTng got all the feature I need for tracing, I will use this tools. My first goal is to create a CPU usage view. Say that something was eating your CPU a minute ago, then with kernel tracing you would be able to go in the "flight recorder" history and see what was running at that time, and understand what was going on.

You can help me in two ways. First, by submitting your ideas about cool trace analysis we could perform and/or you need. Second, you can write to Linus Torvalds and ask him to merge LTTng in mainline. LTTng is the right stuff we need for tracing and it's a bit a shame that this not merged yet while other operating systems have this since a decade.

Next post will be about TMF, an eclipse plugin for trace analysis! Stay tuned!

Script trace-cmd.sh:

#!/bin/sh

if [ -z "$@" ]; then
    echo "missing command argument"
    exit 1
fi

cmd="$@"

name="cmd"
dir="$(pwd)/trace-$name"
sudo rm -rf $dir
sudo lttctl -o channel.all.bufnum=8 -C -w $dir $name
echo "executing $cmd..."
$cmd
echo "return code: $?" 
sudo lttctl -D $name