mardi 24 janvier 2012

TCP socket blocking behavior

Here are some experiments results of blocking behavior of TCP applications. Blocking is a special state in which a process is waiting for I/O and removed from the scheduler queue. When the data is available, the process is woken up. In the case of distributed applications, the network latency is likely to cause blocking. To understand the behavior of such applications, a small experiment is made with netcat on Linux. System events are recorded with LTTng and network events are recorded with tcpdump. To simulate the network latency, the Linux traffic shaper is used. Here is the script used as the test case.

#!/bin/sh
tc qdisc add dev lo root netem delay 100ms
netcat -l localhost 8765 > /dev/null &;
echo "lttng" | netcat localhost 8765
tc qdisc del dev lo root

Here, the traffic shapping command sets the latency of packets to 100ms on the loop-back interface. The first netcat process is configured to listen on port 8765. The server exits immediately when a message is received. Next, the netcat client is spawn and a small string is transfered. The last command removes the traffic shapping configuration. The Linux kernel used is 2.6.38 on x86_64.

Tracing this script reveal three main blockings, as reported by the blocking analysis module of flightbox.

# server process
Blocking report for task /bin/netcat [7495]
Start            Duration (ms)        Syscall Wakeup
8073461215283          300,873     sys_accept SOFTIRQ
# client process
Blocking report for task /bin/netcat [7497]
Start            Duration (ms)        Syscall Wakeup
8073461820815          200,155     sys_select SOFTIRQ
8073662056512          200,135       sys_poll SOFTIRQ

The next figure shows blockings occurring in the system, including messages sent at each steps.



The server process create a new AF_INET socket, binds it and start to listen on the selected port. The accept is then performed for an incoming connection. From the trace, we observe that sys_accept blocks for about 300ms. This delay is very close to three times the network latency and is also almost equal to the process duration. Once the accept returns, the read on the socket doesn't block and the process exits.

In the case of the client, it creates the socket, then perform a sys_connect to the server. The connect returns immediately the value EINPROGRESS without blocking. The next step performed is a sys_connect, in which the client blocks for about two times the network latency. When the select returns, the actual message is sent to the server without blocking and the socket is closed. Finally, the client waits on sys_poll for about twice the network delay.

From this observation, the sys_accept performed by the server blocks until the final handshake ACK is received. Hence, unfinished handshake is completely hidden from the application and handled at the OS level. When the read is done, the data is already buffered, such that the read doesn't block in this case. In the case of the client, the connect system call returns in an optimistic fashion. The wait for the socket to be ready is differed to a select system call. This may allow many connect to be performed simultaneously. As for the poll, this may be related to the tear-down procedure of the socket, waiting for the final FIN from the server.

In the case of the traffic shaper used, the delay between consecutive packets is preserved. For example, the client sends three packets when the sys_select returns in a short burst. Hence, this experiment may not highlight all possible blockings. An alternative to traffic shaper is the iptables NFQUEUE. It allow to forward each packet in userspace for arbitrary processing. More on this in the next blog.