mercredi 28 mars 2012

Paradyn Week 2012 summary

Paradyn Week 2012 was held at the University of Maryland. I went there to see what other folks were doing in the area of debugging and runtime software monitoring with executable binaries modification. I wanted to compare that approach with our current tracing methods.

Dyninst is a set of tools to modify assembly code. It's like a swiss knife for binaries. It can disassemble  them, analyse instructions, recover control flow, and then insert or delete code. It generates code for x86, amd64, ppc32 and ppc64 and few other exotic architectures.

The first presentation was about data flow analysis. The most interesting part in my point of view was the liveness analysis. It computes the set of registers read or written in a function. While tracing, only the subset of registers that actualy contains values are saved. The performance benefit has not yet been evaluated, but the hypothesis is that if fewer registers are saved at each tracepoint, then it will lower the overhead. But, since compilers tries to use as much registers as possible and that under x86 there are very few general purpose registers, probably the number of live registers, on average, is close to the entire set of registers.

Then, there were a talk of Josh Stone from Redhat about integrating Dyninst with SystemTap. The core idea is to use STP scripts and compile them as shared library to instrument userspace applications. The current implementation is able to connect static tracepoints probes to handler functions. For this proof of concept, the handler prints the function name to the console.

There were a talk about debugging at extreme scale. The scale considered is 100k nodes and 1M cores. The idea is to script the debugging phase instead of trying to use interactive debugger. The debugging script is added to the cluster scheduler queue to run later. MRNet, a tree network overlay library, is used to control the debugger. A benchmark shows that setting a breakpoint on 1M cores takes about 200ms! It's also used to gather results with a reduction algorithm to make it also scalable, instead of using only concatenation of results.

The self-propelled instrumentation presentation was about instrumenting distributed applications. It follows the control path and instrument on the fly the application. All the instrumentation is in userspace, using executable patching. If I understood well, when a client connects to a server on a different machine, then a background ssh connection is made to the other host, the instrumentation is injected to the peer process, and the program continues. The instrumentation itself is a callback that is attached to a function. For the demo, the function was performing printf.

Besides that, people were so welcoming, and it was a great pleasure to share ideas. Cherry blossom are everywhere here in Washington and College Park, we should definetely have more in Montreal!

mardi 24 janvier 2012

TCP socket blocking behavior

Here are some experiments results of blocking behavior of TCP applications. Blocking is a special state in which a process is waiting for I/O and removed from the scheduler queue. When the data is available, the process is woken up. In the case of distributed applications, the network latency is likely to cause blocking. To understand the behavior of such applications, a small experiment is made with netcat on Linux. System events are recorded with LTTng and network events are recorded with tcpdump. To simulate the network latency, the Linux traffic shaper is used. Here is the script used as the test case.

#!/bin/sh
tc qdisc add dev lo root netem delay 100ms
netcat -l localhost 8765 > /dev/null &;
echo "lttng" | netcat localhost 8765
tc qdisc del dev lo root

Here, the traffic shapping command sets the latency of packets to 100ms on the loop-back interface. The first netcat process is configured to listen on port 8765. The server exits immediately when a message is received. Next, the netcat client is spawn and a small string is transfered. The last command removes the traffic shapping configuration. The Linux kernel used is 2.6.38 on x86_64.

Tracing this script reveal three main blockings, as reported by the blocking analysis module of flightbox.

# server process
Blocking report for task /bin/netcat [7495]
Start            Duration (ms)        Syscall Wakeup
8073461215283          300,873     sys_accept SOFTIRQ
# client process
Blocking report for task /bin/netcat [7497]
Start            Duration (ms)        Syscall Wakeup
8073461820815          200,155     sys_select SOFTIRQ
8073662056512          200,135       sys_poll SOFTIRQ

The next figure shows blockings occurring in the system, including messages sent at each steps.



The server process create a new AF_INET socket, binds it and start to listen on the selected port. The accept is then performed for an incoming connection. From the trace, we observe that sys_accept blocks for about 300ms. This delay is very close to three times the network latency and is also almost equal to the process duration. Once the accept returns, the read on the socket doesn't block and the process exits.

In the case of the client, it creates the socket, then perform a sys_connect to the server. The connect returns immediately the value EINPROGRESS without blocking. The next step performed is a sys_connect, in which the client blocks for about two times the network latency. When the select returns, the actual message is sent to the server without blocking and the socket is closed. Finally, the client waits on sys_poll for about twice the network delay.

From this observation, the sys_accept performed by the server blocks until the final handshake ACK is received. Hence, unfinished handshake is completely hidden from the application and handled at the OS level. When the read is done, the data is already buffered, such that the read doesn't block in this case. In the case of the client, the connect system call returns in an optimistic fashion. The wait for the socket to be ready is differed to a select system call. This may allow many connect to be performed simultaneously. As for the poll, this may be related to the tear-down procedure of the socket, waiting for the final FIN from the server.

In the case of the traffic shaper used, the delay between consecutive packets is preserved. For example, the client sends three packets when the sys_select returns in a short burst. Hence, this experiment may not highlight all possible blockings. An alternative to traffic shaper is the iptables NFQUEUE. It allow to forward each packet in userspace for arbitrary processing. More on this in the next blog.