Multivax: linux

Affichage des articles dont le libellé est linux. Afficher tous les articles

dimanche 19 octobre 2014

LinuxCon 2014, Düsseldorf

Distributed Active Path from Kernel Trace

I presented recent work about the recovery of active path of a distributed program, only using a kernel trace. The result is a time-view of the execution across computers. One of the most interesting result is the Django Poll three-tiers app, including an Apache web server and a PostgreSQL database. Client is simulated by scripting with Mechanize. The screen shot shows the execution for the POST, where a large sync is visible near the end of the sequence, necessary for the update SQL statement.

Declarative views from user-space trace with Trace Compas

Geneviève Bastien presented how to use LTTng-UST and Trace Compas to create a time view of the state of an MPI computation. The MPI program simulates unbalanced computation, where the green represent work, and red is wait at the barrier. Multiple cycles are shown, and all the views are synchronized, which is handy to analyse the application at multiple level at once. In the following figure, we see below that the wait at the barrier (in red) is actually implemented using a busy loop calling non-blocking poll() system call, shown on top. We observed this behavior with the OpenMPI library, and we think it is intended to reduce the latency at the expense of power and resource efficiency.

Below: MPI imbalance application state (green: compute, red: wait). Above: corresponding kernel trace, including a zoom on the busy wait.

Preemption analysis across host and virtual machines

Execution in virtual machines is subject to higher level of variation in response time. Geneviève Bastien presented the virtual machine analysis, that shows the preemption crossing virtual machines boundaries, the research result of former student Mohamad Gebai. A process in the VM may appear to run while it is not, because the process representing its virtual CPU on the host is be preempted. The host and the VMs are traced at the same time, and traces are synchronized. The global view of the resources allows to pinpoint preemption cause given a task, whenever the other task is local, on the host or inside another virtual machine. The following figure shows qemu processes on the host and traces from within the host. We clearly see a gettimeofday() that is much longer that one would expect (13ms instead of ~200ns), and the reason is that there is another process, in another VM, that was running for that time, and the preemption cause is shown below.

Total view of the system, crossing virtual machine boundaries.

ftrace

Sten Rosted from RedHat presented the inner working of ftrace. It is a framework to hook into each function in the kernel. The kernel is compiled with gcc -pg option, that adds a call to mcount() at the beginning of each function, but this call is replaced by nops. The space left in each function header is patched at runtime to jump to the instrumentation. This framework is used by kprobe, but the ftrace backend can be used directly instead.

folly

Ben Maurer from Facebook presented efficiency and scalability issues they face in their environment. Performance in accepting connections is critical. He mentioned that getting the next available file descriptor is a linear search that hurts the performance. He also shows a method to decrease the number of scheduling switch of workers by replacing pthread condition wait, and improving load distribution among threads. I liked his quantitative approach, always evaluating both kernel and user-space solutions to the same problem.

Cycle accurate profiling using CoreSight hardware trace

Pawel Moll from ARM presented the CoreSight instruction tracing capabilities. The processor records branch taken and the number of cycles taken by each instructions. He demonstrated a prototype of cycle accurate (i.e. exact cycle count) profiler. The trace decoding is somewhat cumbersome, because the code memory must be reconstructed offline to resolve address symbols, and the complete toolchain requires proper integration, but still an impressive demonstration.

Thanks to NVIDIA!

I also got a JETSON TK1 big.LITTLE board from NVIDIA. I will be able to test heterogenous traces experiments (x86, arm), a very common case for mobile and cloud computing. I will also be able to test the CoreSight tracing. Thanks NVIDIA!

jeudi 16 juin 2011

Summary of Linux Symposium 2011

Linux Symposium took place in Ottawa once again in 2011. I went there for a talk about recovering system metrics from kernel trace. There were interesting talks, here is a summary.

Hitoshi Mitake explained scalability issues for virtualization on multi-core architecture with real-time constraints. He mentioned that when one of the virtual CPU is interrupted, another virtual CPU may tries to enter section protected by a spin lock, which decrease performance under load. Alternative scheduling of virtual CPU is proposed to avoid such situation.

Sergey Blagodurov talked about scheduling in NUMA multicore architecture. I learned the numactl system calls that can be used at the userspace level to move process and memory between domains. He was developing scheduler algorithms to optimize performance on this architecture. Since the Linux scheduler can migrate a process, but not the related memory pages, then performance is decreased because the inter-domain communication is increased and is much slower. One of his goal was to increase memory locality for pages that were used frequently.

One talk has been given about Tilera architecture by Jon C Masters, that includes on a single chip up to 100 cores! This piece of hardware is undoubtedly very special, and I would love to get one on my desk! I wonder how does the Linux kernel can scale up to that number of cores.

Trinity, which is a system call fuzzer has been presented by Dave Jones. It's basically a tool that perform system calls with randomized arguments, but still looking at edge cases. They found and fixed around 20 weired bugs this way. Code coverage report could be useful in future.

What about redundant data on a disk? There were a talk about block device deduplication by Kuniyasu Suzaki. The principle is that two identical blocks should not be saved twice. Since it operates at the block level, mileage vary between file systems. The presenter showed principally how much space can be recovered according to block size and file system type. I'm still curious what is the performance impact and CPU overhead of performing deduplication online.

The last talk I wanted to mention is the one given by Alexandre Lissy, about verifications of the Linux kernel. Static checks like Coccinelle are available right now by running make coccicheck in your linux base source tree. While model checking on the entire kernel would be great to get more guarantees about it's well behavior, the feasibility of it with SAT model checker is another story, because of the size and complexity of the Linux source tree, which is quite an undertaking challenge. Maybe an approach like SLAM from Microsoft, that validates interface usage, could be use for Linux drivers too?

This Linux Symposium was the opportunity to meet a lot of people involved in the kernel community and share ideas. Here are few links related to the conference.

vendredi 7 janvier 2011

Tracing a linux cluster

Gather traces from multiple machines and then showing them all at once is a real challenge, because each computer has it's own independent clock. It means that two events occurring at the same time on two computers have a low probability to have the same time stamp in the trace.

Linux Trace Toolkit have a synchronization feature to adjust trace time stamps to a reference trace. It works by matching sent and received network packets found in traces. Two parameters are computed: an offset, but also a drift, that is the rate of clock change.

To experiment it, I did a virtual Ubuntu cluster of 10 nodes installed with LTTng. Virtual machines are sending each other TCP packets with the small utility pingpong while tracing is enabled. All traces are then sync to the host for analysis. Here are the results:

$ lttv -m sync_chain_batch --sync --sync-stats -t trace1 -t trace2 [...]
[...]
Resulting synchronization factors:
 trace 0 drift= 1 offset= 6.80407e+09 (2.551684) start time= 2433.182802117
 trace 1 drift= 1 offset= 4.12691e+09 (1.547687) start time= 2433.254700670
 trace 2 drift= 0.999999 offset= 0 (0.000000) start time= 2433.229372444
 trace 3 drift= 1 offset= 1.53079e+09 (0.574083) start time= 2433.366687088
 trace 4 drift= 1 offset= 2.31812e+10 (8.693471) start time= 2433.217954184
 trace 5 drift= 1 offset= 1.48957e+10 (5.586227) start time= 2433.280525047
 trace 6 drift= 1 offset= 1.76688e+10 (6.626211) start time= 2433.291125350
 trace 7 drift= 1 offset= 2.0463e+10 (7.674116) start time= 2433.325424020
 trace 8 drift= 0.999999 offset= 1.2206e+10 (4.577534) start time= 2433.204361289
 trace 9 drift= 1 offset= 9.55947e+09 (3.585022) start time= 2433.293578996
[...]

The trace 2 has been selected as the reference trace. We see inside parenthesis the offset between the current trace and the reference trace. We see a delay between each trace that match the delay between virtual machines startup. When loading these traces into lttv with --sync option, all traces events align perfectly, which is not the case without the sync option.

If you want to know more about trace synchronization, I recommend the paper Accurate Offline Synchronization of Distributed Traces Using Kernel-Level Events from Benjamin Poirier.

One simple note if you want to do this analysis, you need extended network trace events. To enable them, arm ltt with

$ sudo ltt-armall -n

I found it really interesting, because it allows to see cluster-wide system state.

jeudi 9 décembre 2010

CPU history from kernel trace

Everybody is familiar with CPU history usage, for example with GNOME monitor. We see the average usage in time.

The GNOME monitor poll /proc/stats at regular interval to compute average usage. This file is updated by the kernel. The values comes from sampling that the kernel does for accounting various process.

How could we compute the CPU usage history from the actual kernel trace? We only have to look at sched_schedule events. The event inform us that in a certain point in time, the kernel switched the task on a given CPU. If the new task has PID zero, we know that's the idle thread from the kernel. By splitting running and idle time in fixed intervals, we can compute CPU average usage.

I show one small examples done on my computer that is dual core. First, we see the execution of three cpuburn process started with one second of delay. We can see the result in the control flow viewer in LTTV, and the resulting CPU usage view.

We see a peek at the beginning of the trace. This is likely to be related to the metadata flushing when a trace is started. The two steps after are related to each cpuburn process, and we see that when the third process is started, the CPU usage stays at 100%.

For testing, I added scripts to generate those traces automatically on your computer. It's written in Java, and you will need some extra packages to run it, see the README file. Get the code on github:

https://github.com/giraldeau/flightbox

Happy hacking!

vendredi 12 novembre 2010

Introduction to Linux Tracing Toolkit

The Linux Tracing Toolkit next generation (LTTng) is absolutely amazing. In this introduction, we will look at how to interpret the kernel trace of a small program by comparing it to strace. If you want to install components and get a trace on your own machine, you can get packages for the patched kernel and other tools for Ubuntu Maverick in LTTng's PPA on Launchpad. You will also need build-essential to compile the small C executable, strace and hexdump.

Here is the small program. It opens a file and write a integer into it. Then, it adds number in a for loop.

#include <stdio.h>
#define INT_MAX 2147483647
int main(volatile int argc, char **argv) {
    int i = 3;
    FILE *fd = fopen("test.out", "w");
    fwrite(&i, sizeof(int), 1, fd);
    fclose(fd);
    volatile long a;
    int x;
    for (x = 0; x<INT_MAX; x++) {
        a++;
    }
    return 0;
}

Compile it, test it and run it under strace:

$ gcc -o usecase usecase.c
$ ./usecase
$ strace -o usecase.strace ./usecase

Ok. Now, let's run the program under tracing, but before, we have to enable all trace points in the kernel.

$ sudo ltt-armall

To get a tight window around the command, I'm using the small script trace-cmd.sh (see at the end of the post). The script simply starts tracing, run the program and stop the trace. Invoke it like this (don't forget to set the script executable):

./trace-cmd.sh ./usecase

Now, we are ready to launch lttv-gui, which is the utility to look at the trace. To load a trace, select File->Add trace, then select the "trace-cmd" directory created in the same directory as the trace-cmd.sh script. Then, scroll down to the "./usecase" process in the control flow viewer, which is in the middle of the window. Each trace events are displayed under the graph. I will present three phase of the process. First, the creation and file access, then computation phase and the process exit. Trace is rotated to make it more readable. Events name and comments are embedded in the graph. strace events are in red, while related code is in blue.

In addition to system calls, we can look at the paging of an application and see process starvation. This information is not shown by strace. Another cool side effect is that, contrary to strace, the running program can't detect it's under tracing, then we can trace viruses that otherwise quit before they can be analyzed. Tracing reports much more information about kernel internals, which helps a lot to understand what's going on. Events timestamps of the kernel trace are in nanoseconds, so the precision is really high, and since trace points are static (simple function calls compiled right into the kernel), performance impact is ridiculously low, below 3% for core trace points! I'm running LTTng in background on my system without noticeable impact, in flight recorder mode.

I will be working on advanced trace analysis for the next three years or so for my PhD. Since LTTng got all the feature I need for tracing, I will use this tools. My first goal is to create a CPU usage view. Say that something was eating your CPU a minute ago, then with kernel tracing you would be able to go in the "flight recorder" history and see what was running at that time, and understand what was going on.

You can help me in two ways. First, by submitting your ideas about cool trace analysis we could perform and/or you need. Second, you can write to Linus Torvalds and ask him to merge LTTng in mainline. LTTng is the right stuff we need for tracing and it's a bit a shame that this not merged yet while other operating systems have this since a decade.

Next post will be about TMF, an eclipse plugin for trace analysis! Stay tuned!

Script trace-cmd.sh:

#!/bin/sh

if [ -z "$@" ]; then
    echo "missing command argument"
    exit 1
fi

cmd="$@"

name="cmd"
dir="$(pwd)/trace-$name"
sudo rm -rf $dir
sudo lttctl -o channel.all.bufnum=8 -C -w $dir $name
echo "executing $cmd..."
$cmd
echo "return code: $?" 
sudo lttctl -D $name