jeudi 5 décembre 2013

System-level profiling of APT

APT is the package manager of Debian based distributions. Understanding where the time is spent installing a package is difficult, because there are numerous process involved. We discuss limitations of profiler tools and present system-level trace analysis that highlight waiting between tasks.

Callgrind is a profiler that helps to find hotspot in the application. One limitation of a profiler is that the time spent waiting is not taken into account, for instance I/O, resources or other tasks. This situation is shown in figure 1, where we traced the command "sudo apt-get install tree --yes". The actual computation done in the apt-get process accounts for 63% of the elapsed time, while waiting time is about 37%. The callgrind results covers only instructions from the application, not the waiting time.

Figure 1: State of apt-get process according to time.

How to effectively understand what the application was waiting for? For this matter, we performed a waiting dependency analysis on all tasks on the system. The event sched_wakeup indicates the waiting source. Figure 2 shows the result of waiting dependencies among tasks for the command.

Figure 2: Overview of  apt-get waiting dependencies according to time.
The analysis resolves waiting time recursively. In this run, apt-get process is waiting for dpkg, and it accounts for about 20% of the execution time. Figure 3 shows the wait I/O that occurs in dpkg, up to the kernel thread responsible for the file system journal of the partition on which the I/O is performed.

Figure 3: I/O wait of dpkg according to time.
Another interesting part of the execution is the update of man database performed by mandb. Figure 4 shows that about 200 threads are spawned. In this case, the overhead of starting the threads and setup communication pipe is high compared to the amount of actual work, in consequence running time could be greatly improved by using a thread pool instead.

Figure 4: Execution of mandb according to time.
We located the code responsible for creating all these threads and processes using perf callchain. It seems to be related to how the library libpipeline is working. Here is an example of an interesting call stack:

7ffff729a30b     __execvpe (/lib/x86_64-linux-gnu/
7ffff75a4de4     pipecmd_exec (/usr/lib/x86_64-linux-gnu/
409813           [unknown] (/usr/local/bin/mandb)
7ffff75a4e59     pipecmd_exec (/usr/lib/x86_64-linux-gnu/
7ffff75a662a     pipeline_start (/usr/lib/x86_64-linux-gnu/
408d96           find_name (/usr/local/bin/mandb)
40449c           test_manfile (/usr/local/bin/mandb)
404708           testmandirs (/usr/local/bin/mandb)
405035           update_db (/usr/local/bin/mandb)
40a1c6           mandb (/usr/local/bin/mandb)
40a4fb           process_manpath (/usr/local/bin/mandb)
403777           main (/usr/local/bin/mandb)
7ffff71f9ea5     __libc_start_main (/lib/x86_64-linux-gnu/
403b09           _start (/usr/local/bin/mandb)

At the end of apt-get execution, there is a sleep of 500 ms, shown in blue in the figure 2. Here is the location in the code of this nanosleep, again using perf callchain feature: 

ffffffff816cc0c0 __schedule ([kernel.kallsyms])
ffffffff816cc799 schedule ([kernel.kallsyms])
ffffffff816cb66c do_nanosleep ([kernel.kallsyms])
ffffffff810827f9 hrtimer_nanosleep ([kernel.kallsyms])
ffffffff8108292e sys_nanosleep ([kernel.kallsyms])
ffffffff816d616f tracesys ([kernel.kallsyms])
7ffff72978c0     __GI___libc_nanosleep (/lib/x86_64-linux-gnu/
7ffff7b8ac95     pkgDPkgPM::DoTerminalPty(int) (/usr/lib/x86_64-linux-gnu/
7ffff7b93720     pkgDPkgPM::Go(int) (/usr/lib/x86_64-linux-gnu/
7ffff7b26585     pkgPackageManager::DoInstallPostFork(int) (/usr/lib/x86_64-linux-gnu/
415a1b           [unknown] (/usr/bin/apt-get)
418dfc           [unknown] (/usr/bin/apt-get)
7ffff7af96c2     CommandLine::DispatchArg(CommandLine::Dispatch*, bool) (/usr/lib/x86_64-linux-gnu/
40a2ec           [unknown] (/usr/bin/apt-get)

Here is the detail of pkgDPkgPM::DoTerminalPty() function from apt/apt-pkg/deb/

 * read the terminal pty and write log
void pkgDPkgPM::DoTerminalPty(int master)
   unsigned char term_buf[1024] = {0,0, };

   ssize_t len=read(master, term_buf, sizeof(term_buf));
   if(len == -1 && errno == EIO)
      // this happens when the child is about to exit, we
      // give it time to actually exit, otherwise we run
      // into a race so we sleep for half a second.
      struct timespec sleepfor = { 0, 500000000 };
      nanosleep(&sleepfor, NULL);
   if(len <= 0)
   FileFd::Write(1, term_buf, len);
      fwrite(term_buf, len, sizeof(char), d->term_out);

The comment mentions a possible race, but the caller may be fixed to account for this situation. IMHO, this function should not sleep if an error occurs while reading the file descriptor, and should instead return immediately.

This was a small example of what can be achieved with the waiting dependency analysis from a kernel trace. The analysis plug-in will be available soon in Eclipse Linux Tools, and uses standard tracepoints from the Linux kernel, recorded with LTTng. I hope that this information will help the community to improve performance of installing packages with APT. 


The APT licence plate! (thanks to Yannick)

vendredi 7 juin 2013

How slow is Java reflection?

I use Java reflection for trace analysis, as a way to call hooks when specific events are encountered. I wanted to know the overheard related to using such dynamic invocation. First, here is a small example of a trace event handler.

public class FooTraceEventHandler extends TraceEventHandlerBase {

  public FooTraceEventHandler() {
    this.hooks.add(new TraceHook("sched_switch"));

  public void handle_sched_switch(TraceReader reader,
                                  CtfTmfEvent event) {
    // Process event

By convention, the function "handle_sched_switch" is called only with an instance of "sched_switch" event as argument. The magic happens by first getting a reference the method by it's name (done once), and then call it dynamically later in the processing path, similar to the following small example.

method = dummy.getClass().getMethod("foo", (Class[]) null);

To compare dynamic invocation to compile-time method call, I did a trivial benchmark program. Here are the results for 10 billion calls.

StaticMethodBenchmark     9,653000
DynamicMethodBenchmark    47,625000

This experiment shows that dynamic invocation is roughly 5x slower than compile-time method call. Dynamic invocations is also weaker, because a missing method for the corresponding hook will be detected only at runtime.

There are benefits to using hooks. The test for the event type is done only once for all handlers. It avoid quite a lot of repetition. Also, it allows to list events required for a specific analysis, and then enable only this subset in the tracing session, reducing the runtime overhead and the trace size compared to enabling all events. So, we need to preserve these properties, but with compile-time invocations.

mardi 4 juin 2013

How-to setup LTTng on Fedora 19

Here is how to install lttng including kernel and userspace tracing on Fedora 19 beta.

Instructions uses sudo. To configure sudo on your fedora system, use this command as root, and replace  [username] with your username, then logout and login again:

usermod [username] -a -G wheel

1 Build and install kernel modules

1.1 Perform any pending updates:

sudo yum update

If the kernel is updated, reboot the machine.

1.2 Install development tools:

sudo yum group install "Development tools"

1.3 Fix kernel headers location

At the time of writing this How-to, the version of the kernel-devel package doesn't exactly matches the running kernel (uname -r). We worked around this issue by copying headers in the directory matching the version of the running kernel. The version numbers are likely to change (and this issue may be solved), so adapt it your situation.

First, check if this applies to your system:

rpm -qa | grep kernel-devel

uname -r

If versions do not match, you can fix (hack) it like this:

sudo su -
cd /usr/src/kernels
cp -a 3.9.4-300.fc19.x86_64 3.9.4-301.fc19.x86_64

1.4 Download lttng-modules

We need to install lttng-modules 2.2.0-rc2 (or later) for compatibility with linux 3.9.4.

tar -xjvf lttng-modules-2.2.0-rc2.tar.bz2
cd lttng-modules-2.2.0-rc2/
KERNELDIR=/usr/src/kernels/$(uname -r) make
sudo KERNELDIR=/usr/src/kernels/$(uname -r) make modules_install
sudo depmod -a
sudo modprobe lttng-tracer

The module lttng_tracer should be loaded and visible with lsmod. 

Step 2: Install lttng packages

sudo yum install lttng-tools lttng-ust babeltrace lttng-ust-devel

Now, reboot the machine. The service lttng-sessiond should be started automatically, and all required modules should be loaded. Here is some checks that can be done:

$ lsmod | grep lttng_tracer
lttng_tracer          558145  24 lttng_ring_buffer_client_discard,lttng_probe_jbd2,lttng_probe_kmem,lttng_probe_napi,lttng_probe_scsi,lttng_probe_sock,lttng_ring_buffer_client_mmap_overwrite,lttng_probe_statedump,lttng_ring_buffer_metadata_client,lttng_ring_buffer_client_mmap_discard,lttng_probe_irq,lttng_probe_kvm,lttng_probe_net,lttng_probe_skb,lttng_probe_udp,lttng_ring_buffer_metadata_mmap_client,lttng_probe_module,lttng_probe_signal,lttng_probe_vmscan,lttng_probe_block,lttng_ring_buffer_client_overwrite,lttng_probe_power,lttng_probe_sched,lttng_probe_timer
lttng_lib_ring_buffer    50443  7 lttng_ring_buffer_client_discard,lttng_ring_buffer_client_mmap_overwrite,lttng_ring_buffer_metadata_client,lttng_ring_buffer_client_mmap_discard,lttng_tracer,lttng_ring_buffer_metadata_mmap_client,lttng_ring_buffer_client_overwrite
lttng_statedump        30537  1 lttng_tracer
lttng_ftrace           13097  1 lttng_tracer
lttng_kprobes          12911  1 lttng_tracer
lttng_kretprobes       13224  1 lttng_tracer

$ sudo service lttng-sessiond status
Redirecting to /bin/systemctl status  lttng-sessiond.service
lttng-sessiond.service - LTTng 2.x central tracing registry session daemon
   Loaded: loaded (/usr/lib/systemd/system/lttng-sessiond.service; enabled)
   Active: active (running) since Tue 2013-06-04 14:48:53 EDT; 1min 8s ago
  Process: 385 ExecStart=/usr/bin/lttng-sessiond -d (code=exited, status=0/SUCCESS)
 Main PID: 418 (lttng-sessiond)
   CGroup: name=systemd:/system/lttng-sessiond.service
           └─418 /usr/bin/lttng-sessiond -d

Step 3: Make traces with workload-kit

One way to make traces is to use workload-kit. Here is how to install it from sources.
tar -xjvf workload-kit-0.1.13.tar.bz2 
cd workload-kit-0.1.13/
./configure --disable-cputemp
make -j12 
sudo make install
Then, the tool lttng-simple can be used to generate a UST trace or a kernel trace for instance:
lttng-simple -k -s -- sleep 1
babeltrace sleep-k/

lttng-simple -u -- /usr/local/share/workload-kit/scripts/ust/wk-heartbeat
babeltrace wk-heartbeat-u/
Happy tracing!

vendredi 31 mai 2013

Tracing userspace memory allocation

You may wonder what are the actual memory allocations in a program according to time. Here is a method to trace these operations with the LTTng User Space Tracer (UST). Events are recorded by overloading malloc() and free() with a library that is preloaded with LD_PRELOAD. This library is called "". To easy the actual tracing session setup, this feature has been added to the lttng-simple utility of workload-kit v0.1.13. By specifying the option "--enable-libc-wrapper", then the environment is set automatically. Here is quick demo to look at the memory allocations of gnome-calculator. (output of babeltrace is simplified for clarity)

// trace the program
$ lttng-simple -u --enable-libc-wrapper -- gnome-calculator

// display events
$ babeltrace gnome-calculator-u/
[10:48:49.808255177] malloc: { size = 40, ptr = 0xCA8220 }
[10:48:49.808256389] free: { ptr = 0xCA8220 }

// count events and look for lost events
$ javeltrace analyze -i ctf gnome-calculator-u/...
number of streams                             1
number of packets                         1,058
number of events                        181,879
number of CPUs                                8
number of lost events                         0

This experiment shows that there are about 182 thousands memory allocation events to start a calculator!

Checking for lost events with javeltrace is recommended. The default buffer size in UST is quite small, and in consequence lost events occurs even with very light load, like starting the calculator with memory allocation tracing. The tool lttng-simple setups buffers of 16MB to make it safer.

The next step is of course to display this data according to time and compare it to physical pages allocated by the operating system with a kernel trace.

Happy hacking!

jeudi 28 mars 2013

Augedit prototype

Augeas is a great tool for string to tree transformation. I created a GUI to make it visual and more interactive. The prototype app is called Augedit. It displays the Augeas tree on the left side, and on the the string representation on the right side. By selecting a node in the tree, the label and the value matched are highlighted in the text view. Here is an example with a trivial key/value lens, one record per line.

Behind the scene, it uses span information of the selected path to recover the file content and highlight the key and the value. The app is in Vala and uses Gtk toolkit.

The prototype as of now is only a viewer. This could be extended in many ways:

  • Edition capabilities, where the tree and the text are synchronized as the user types
  • Providing a view to query and display results
  • XPath expression builder, so that they can be validated and then used in scripts
  • Highlight the whole match for a sub-tree  The information is available from the span, but I haven't been able to set alpha channel for colors in Gtk.TextView. Using pango/cairo/webkit based widget to display the text may be the solution.
Vala is quite nice, I like the idea of having bare metal binaries and the integration with Augeas with the bindings works great, documentation on is excellent. However, GCC complains because of the C code generated and it's a bit annoying  part of it is my fault (using deprecated stuff, like Gtk.Box() for instance). For more code examples, I'm using yorba projects as reference, like geary and shotwell. Finally, debugging is not as convenient, because of the additional layer of code generation.

The code is on github:

Have fun!

Presenter assists

I was a volunteer at EclipseCon 2013. One of the task consists to help the speaker keep on time, by displaying cardboards showing the remaining time of the session. For fun, I and François Rajotte made a small QML app for that purpose (and most of all to experiment with QML). The app is based on a timer, with color transition when remaining time goes under some threshold. It is really hacky, but it works!


  • Start or reset the timer: click anywhere in the window
  • Full screen: press "F"
  • Quit: press "Q"

We definitely should be able to set the start time (not all sessions have the same length) and the list of threshold with custom colors should be customizable. It requires an additional UI component, and those settings should be saved.

As for QML, it's quite cool. The rendering is flicker free and it's quite easy to make animations. We had one issue with keyboard events. We extended a QAction and overloaded the method eventFilter to get key events. The problem is the method always returned "true", as if we should stop forwarding the event to other event filters. It turns out that it was preventing QML to behave properly, because its handler runs after our own. Be aware!

The code is on github: