tag:blogger.com,1999:blog-18645097927353232182024-03-05T16:28:29.995-08:00MultivaxThe meaning of life is forty twoFrancis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.comBlogger52125tag:blogger.com,1999:blog-1864509792735323218.post-47577258122507214702016-05-10T14:23:00.001-07:002016-05-10T14:23:21.945-07:00Function graph tracing with LTTngFTrace has the ability to hook function entry and return in the kernel. I did a small prototype demonstrating the possibility of recording these events with LTTng, and viewing the result with TraceCompass.<br />
<br />
The idea is to register callbacks using register_ftrace_graph(). This symbol is not exported, therefore we use kallsyms to lookup the address at runtime. The callbacks contains themselves a tracepoint that records entries and returns. The kernel module implementation <a href="https://github.com/giraldeau/lttng-modules/blob/fgraph/probes/lttng-fgraph.c" target="_blank">lttng-fgraph.c</a> is quite simple. It limits the tracing to the normal kernel mode (no interrupt) and the function to hook is hardcoded (do_sys_open()) for prototyping purpose.<br />
<br />
The implementation of the analysis module in TraceCompass is relatively simple (<span style="background-color: white; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; line-height: 16.8px; white-space: pre;"><a href="https://github.com/giraldeau/tracecompass/blob/fgraph/lttng/org.eclipse.tracecompass.lttng2.kernel.core/src/org/eclipse/tracecompass/internal/lttng2/kernel/core/ftrace/LttngFtraceCallStackProvider.java" target="_blank">LttngFtraceCallStackProvider.java</a></span>). In fact, it consists mostly to define entry and return events, and how to decode the event payload. In order to show function names instead of addresses, I implemented a symbol provider for kallsyms. It walk the trace directory to find the kallsyms file and parse it to map addresses to symbols. The user just needs to copy /proc/kallsyms below the trace directory, like this:<br />
<br />
mkdir my-trace/kernel/syms<br />
sudo cat /proc/kallsyms > my-trace/kernel/syms/kallsyms<br />
<br />
Sudo (root) is required, because symbol addresses are usually hidden by default. The kallsyms file must be within its own directory, otherwise the trace parser will try to open the kallsyms file as a trace stream and will report an error.<br />
<br />
Here is the result of some open system call on the system. Because the views are synchronized, the open system call interval of the processes shown in the control flow view overlap the call stack graph. It looks great, isn't?<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjW80wWXcsyAeUErVq3KRajusaGsISmzs_d7iwsvgHLyJuHsevnGDBwtpsWjyBr-Ei8RInUKqex62gKBnY46MWztsxVW4C8u8g5qcg2qRbftgsIn-aSQGNUIRha1FywE4i-tHYE1ZxJZ8/s1600/ftrace.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="372" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjW80wWXcsyAeUErVq3KRajusaGsISmzs_d7iwsvgHLyJuHsevnGDBwtpsWjyBr-Ei8RInUKqex62gKBnY46MWztsxVW4C8u8g5qcg2qRbftgsIn-aSQGNUIRha1FywE4i-tHYE1ZxJZ8/s640/ftrace.png" width="640" /></a></div>
<br />
However, the overhead of function tracing is somewhat high. For each function call, about 300ns are spent writing trace events, which may represent more time than the processing done in the function itself. To address the situation, FTrace has the ability to record an event only of the delay inside the function is greater than a given threshold. The call graph is no longer the exact representation of the execution, but it contributes to reduce the overhead and the trace size by not recording short duration functions, while still identifying slow functions.<br />
<br />
At the moment, FTrace does not exports the registration functions, because external users were not intended. Only one set of hooks can be defined at a time, so one cannot use ftrace and lttng at the same time. In the future, that could be a nice extension to FTrace to make it more general and allow other tracers to use the function graph. For instance, kprobe could get callbacks not only on function entry and return, but also on each of its children.Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-54443225164581717152016-02-24T14:34:00.002-08:002016-02-24T14:34:43.750-08:00Real-life real-time problemI installed Linux RT on the Jetson TK1 last month. Since then, cyclictest is running to test that latency is always correct. The maximum wake-up latency is usually ~100us, and it can run like that for a long time. However, on rare occasion, a latency of ~8ms occurs when running for a long period of time (about one week). This kind of problem is tricky to find without proper tools. Here is how I found the problem.<br />
<br />
I configured LTTng kernel and userspace tracing to continuously trace the system. The kernel trace records scheduling events and system calls while the userspace trace indicates the location of the missed deadline. When a high latency is detected, the trace is stopped and a snapshot of the trace buffers are written on the disk. I let the system running with tracing enabled for a couple of days, and then a snapshot was created, indicating that the problem occurred.<br />
<br />
The following figure shows the result of the traces loaded in TraceCompass. We can see this bug is the result of interactions between three tasks, namely ktimersoftd/2, irq/154-hpd and cyclictest. The cyclictest wake-up signal is delayed for a long time. This signal puts back the task in the running queue, and it is delivered by ktimersoftd/2. The ktimersoftd task has lower priority than irq/154-hpd that happens to run for a long time. Even though the cyclictest has higher priority than the IRQ thread, as long as cyclictest is in the wait queue, the priority has no effect. Priority inheritance does not work here because we don't know in advance that ktimersoftd will actually wake-up a high priority task.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdLVaCrQBamTB2kQ4zzlTAdY81ZH-ihoq8l8JXweESw4DVnlnpDxxTVat2qq1Tmmy5jq2PkKOuki-zcbJ8tzPlUJqiu0HjzcMiPFxmPe_P4Tq852Y1yuTNh_PYQHXpkUcWtb64vFDYvEM/s1600/bug.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdLVaCrQBamTB2kQ4zzlTAdY81ZH-ihoq8l8JXweESw4DVnlnpDxxTVat2qq1Tmmy5jq2PkKOuki-zcbJ8tzPlUJqiu0HjzcMiPFxmPe_P4Tq852Y1yuTNh_PYQHXpkUcWtb64vFDYvEM/s640/bug.png" width="640" /></a></div>
<br />
Obviously, the ktimersoftd threads should have greater priority than other IRQ threads as they are required by cyclictest to meet its deadline. There are few questions left: how the default priorities are defined, what is actually doing the irq/164-hpd task, and does increasing ktimersoftd priority will fix the issue? More investigation is required.<br />
<br />
In the meantime, if you want to look for yourself, the trace is <a href="http://secretaire.dorsal.polymtl.ca/~fgiraldeau/traces/cyclictest-20160224.tar.gz" target="_blank">here</a>. Cheers!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com3tag:blogger.com,1999:blog-1864509792735323218.post-55186278278083308772016-02-01T07:12:00.000-08:002016-02-01T07:12:08.248-08:00Real-time setup on Jetson TK1The Linux kernel is a popular operating system for real-time applications. Here is a summary of experimentations with the real-time kernel on the Jetson TK1 board from NVIDIA.<br />
<h2>
Setup the board</h2>
<div>
The Jetson TK1 from an internal memory. NVIDIA supplies a script to "flash" the device with the file system root and the kernel. I use the manual method from the <a href="https://developer.nvidia.com/linux-tegra-r214" target="_blank">NVIDIA Jetson TK1 documentation</a>. Download the <a href="http://developer.download.nvidia.com/embedded/L4T/r21_Release_v4.0/Tegra124_Linux_R21.4.0_armhf.tbz2" target="_blank">Driver Package</a> and the <a href="http://developer.download.nvidia.com/embedded/L4T/r21_Release_v4.0/Tegra_Linux_Sample-Root-Filesystem_R21.4.0_armhf.tbz2" target="_blank">Sample Root File System</a>, and follow the <a href="http://developer.download.nvidia.com/embedded/L4T/r21_Release_v4.0/l4t_quick_start_guide.txt" target="_blank">Quick Start Guide</a>. You should have a working system after these steps. </div>
<div>
<br /></div>
<div>
I tried to use the JetPack installer to flash the device, but it requires Ubuntu 14.04, the procedure failed and it was re-downloading the huge files, so I don't recommend it. Also, I tried the procedure to use the SD card instead of the internal memory of the board and that failed too, so I stick with flashing the internal memory. </div>
<h2>
Compile linux-rt</h2>
<div>
Clone Linux-RT from <a href="https://git.kernel.org/cgit/linux/kernel/git/rt/linux-rt-devel.git/" target="_blank">git.kernel.org</a>:<br />
<br />
<pre>git clone https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git</pre>
<pre>cd linux-rt-devel</pre>
<pre>git checkout linux-4.4.y-rt-rebase</pre>
<br />
While it downloads, install the cross-compiler for ARM (the "hf" in arm abi hf stands for hard-float, meaning that the compiler will output code optimized for on-chip floating point arithmetic):
<br />
<br />
<pre>sudo apt-get install gcc-arm-linux-gnueabihf</pre>
<br />
Configure the kernel. When cross-compiling, the ARCH and CROSS_COMPILE variables must be set.<br />
<br />
<ul>
<li>Under "Kernel Features -> Preemption Model", select the preemption level (i.e. PREEMPT_RT_BASE)</li>
<li>Under "Kernel hacking" uncheck "Debug preemptible kernel" (known to cause slowdown)</li>
</ul>
<br />
<pre># update configuration</pre>
<pre><pre>make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- tegra_defconfig</pre>
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- menuconfig
# compile
make -j12 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- </pre>
<pre><pre>make -j12 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dtbs</pre>
# install the modules into the rootfs (adjust according to your setup)
sudo make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- INSTALL_MOD_PATH=../Linux_for_Tegra/rootfs/ modules_install</pre>
<br />
There should be a directory created under Linux_for_Tegra/rootfs/lib/modules/<br />
The last step consists to flash the device with the fresh kernel. However, the flash.sh script dit not work correctly for me. The options "-K <kernel>" and "-d <dtbfile>" do not configure the rootfs correctly. This step must be done manually and the settings are not overwriten by the flash script. Beware that the script apply_binaries.sh will copy the zImage from the kernel directory. To avoid using the previous kernel by mistake, replace the default zImage </dtbfile></kernel>in the kernel directory with your own.<br />
<br />
The Device Tree Blob (DTB) describes the hardware addresses that are specific to the device. Previously, one had to write these memory locations in C, whereas it is done declaratively and converted to a binary format. In my understanding, it simplifies the support of a wide range of devices. I were not able to boot the board without the proper DTB file set in the bootloader.<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-size: 14.6667px; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "courier new" , "courier" , monospace;">sudo cp arch/arm/boot/zImage ../Linux_for_Tegra/rootfs/boot/</span></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-size: 14.6667px; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "courier new" , "courier" , monospace;">sudo cp arch/arm/boot/zImage ../Linux_for_Tegra/kernel/</span></span></div>
<span style="font-family: "courier new" , "courier" , monospace;"><span id="docs-internal-guid-e2655c2c-7a02-b37e-e251-2ac5ab7ce911"></span></span><br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-size: 14.6667px; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: "courier new" , "courier" , monospace;">sudo cp arch/arm/boot/dts/tegra124-jetson-tk1.dtb ../Linux_for_Tegra/rootfs/boot/</span></span></div>
<h2>
Setup the bootloader</h2>
Edit the bootloader configuration file <span style="font-family: "arial"; font-size: 14.6667px; line-height: 20.24px; white-space: pre-wrap;">Linux_for_Tegra/rootfs/boot/extlinux/extlinux.conf to set the FDT field with the path of the Device Tree Blob (DTB) (do not change the other fields):</span><br />
<span style="font-family: "arial"; font-size: 14.6667px; line-height: 20.24px; white-space: pre-wrap;"><br /></span>
<span style="font-size: 14.6667px; line-height: 20.24px; white-space: pre-wrap;"><span style="font-family: "courier new" , "courier" , monospace;">MENU TITLE Jetson-TK1 eMMC boot options
LABEL primary
MENU LABEL primary kernel
LINUX /boot/zImage
FDT /boot/tegra124-jetson-tk1.dtb</span></span><span style="font-family: "arial"; font-size: 14.6667px; line-height: 20.24px; white-space: pre-wrap;">
</span><br />
<div>
<span style="font-family: "arial"; font-size: 14.6667px; line-height: 20.24px; white-space: pre-wrap;"><br /></span></div>
Then, put the device in recovery mode (hold the recovery button while pressing reset, the board should be displayed in dmesg) and execute the usual flash command:<br />
<br />
<pre>cd Linux_for_Tegra/</pre>
<pre>./flash.sh jetson-tk1 mmcblk0p1</pre>
<br />
After the board reboots, it should run the real-time kernel. The board has a serial port and spawning minicom with a serial-to-usb device allowed to see the boot logs. This is required because the screen is blank at this early boot stage.<br />
<br />
I still have an issue with the CONFIG_PREEMPT_RT_FULL. Without the serial cable, it would be not possible to diagnose the problem.<br />
<br />
<span id="docs-internal-guid-e2655c2c-7a1a-7c7a-6de2-2579bf4e39f0"><span style="font-family: "courier new" , "courier" , monospace;"><span style="font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;">[ 5.170764] Unable to handle kernel paging request at virtual address ffefe574</span><span style="font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;">[ 5.186139] kernel BUG at kernel/locking/rtmutex.c:1011!</span><span style="font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;">[ 5.981440] Fixing recursive fault but reboot is needed!</span></span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><br /></span></span>
<br />
<span style="font-family: Courier New, Courier, monospace;"><span style="font-family: inherit; font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;">I use instead CONFIG_PREEMPT_RT_BASE, but I don't know exactly what is the difference between the full RT config, this is something to investigate.</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-size: 14.6667px; vertical-align: baseline; white-space: pre-wrap;"><br /></span></span></div>
Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-54754785556024395452015-12-04T14:51:00.000-08:002015-12-10T11:50:01.307-08:00MachineKit LTTng kernel trace How does MachineKit execution look like at the system level? Here are some observations done using an LTTng kernel trace on Linux preempt-rt 4.1.13-rt15 on a 4 cores machine. The workload traced is the abs.0 unit test. This test spawns halcmd, simulating the position control of a motor (or something). Here is the command:<br />
<br />
cd machinekit/<br />
. scripts/rip-environment<br />
cd tests/abs.0/<br />
halcmd -v test.hal<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
The overall command takes 3.3s to complete. Let's check what's in the trace.<br />
<br />
First, the rtapi processing takes about 600ms to execute and is shown in Figure 1. One of the related process, named fast:0, seems to do the actual real-time job and is shown in Figure 2. This process sleeps periodically using the system call clock_nanosleep(). I observed a wake-up latency of about 30us and the thread itself runs for about 2-3us in user-space before returning to sleep.<br />
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp1RWq-6CPzPFg1wAlSLXQ3f5qjQbJd8AwsqfwrZUvI84vLn6wFfC5k6b5DcbvXI-wRBCgZv4vIQH4ZxZh7vpqHo_3AjWclGSMzFyXqKVTSYnXY18PNq_l6Saqv1UHJqBp3pfWskeaSrE/s1600/Screenshot+from+2015-12-04+17-13-39.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="106" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp1RWq-6CPzPFg1wAlSLXQ3f5qjQbJd8AwsqfwrZUvI84vLn6wFfC5k6b5DcbvXI-wRBCgZv4vIQH4ZxZh7vpqHo_3AjWclGSMzFyXqKVTSYnXY18PNq_l6Saqv1UHJqBp3pfWskeaSrE/s320/Screenshot+from+2015-12-04+17-13-39.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1: Overview of the rtapi execution.</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4BVBipxmUShRyZ1JYijboHmwrnBDrSZSsE4UcsxoM-Ffl-AToMZxmPx8Xdo-D5GkyFaNDiIbsaUmVKPVDTilB4y-XTLosIE6HFXjIZhNcYvEPr7W3mFCqbzQ7-UxrEGnSJSjeSgLtlLE/s1600/Screenshot+from+2015-12-04+17-10-54.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="15" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4BVBipxmUShRyZ1JYijboHmwrnBDrSZSsE4UcsxoM-Ffl-AToMZxmPx8Xdo-D5GkyFaNDiIbsaUmVKPVDTilB4y-XTLosIE6HFXjIZhNcYvEPr7W3mFCqbzQ7-UxrEGnSJSjeSgLtlLE/s320/Screenshot+from+2015-12-04+17-10-54.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2: Real-time process with period of about 300us.</td></tr>
</tbody></table>
<br />
<div class="separator" style="clear: both; text-align: left;">
The scripts and utilities to manage the test environment behave quite differently than the RT process. In particular, there is a bunch of processes that interact for very short durations, namely halcmd and rtapi, as shown in Figure 3. They perform sendto() and recvfrom() system calls, probably for round-trip communication. Maybe using shared memory could help here to streamline the communication. There is numerous sleep of 10ms to 200ms performed by halcmd in midway. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmGTP8s7meFhyl9nNyGPAt3_J8735zMrdcB_jtbgfDLIa0xZmTzHo_gnYPUzJsKj6qNmjNxqTznQbTiM-VwrkQep3yMGdwxWyrvH8i57Pcp3tNtEV-gf7FE9D3gkXl1tboIuylVDYOGpo/s1600/mk.tests.abs.0.1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="92" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmGTP8s7meFhyl9nNyGPAt3_J8735zMrdcB_jtbgfDLIa0xZmTzHo_gnYPUzJsKj6qNmjNxqTznQbTiM-VwrkQep3yMGdwxWyrvH8i57Pcp3tNtEV-gf7FE9D3gkXl1tboIuylVDYOGpo/s320/mk.tests.abs.0.1.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3: Interactions between halcmd and rtapi.</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: start;">
There is a pattern of fork-exec-wait done by the script realtime shown in Figure 4. This script spawns inivar, flavor and halrun executables, together running for more than 400ms. Using programmatic API instead of spawning executables would make the processing more efficient. </div>
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisJVhtd9m0I0X1408LRXEmhyphenhyphennDL63v6Y9KetXifGurC2ZMPmfpPW1iqHVuBCXbgo4vWAECkzRoEDwdh8mh8nY5dCGGoIoZ8d3vYxx0MZnMLxterUG6vWW2vkkB8N7CbZM8MyTj26sbSpo/s1600/mk.tests.abs.0.2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="92" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisJVhtd9m0I0X1408LRXEmhyphenhyphennDL63v6Y9KetXifGurC2ZMPmfpPW1iqHVuBCXbgo4vWAECkzRoEDwdh8mh8nY5dCGGoIoZ8d3vYxx0MZnMLxterUG6vWW2vkkB8N7CbZM8MyTj26sbSpo/s320/mk.tests.abs.0.2.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4: Repeated fork-exec-wait pattern.</td></tr>
</tbody></table>
Finally, there is a giant 2.5s sleep at the end of the test, shown in Figure 5. It represents therefore 75% of the test execution time. This kind of sleep is usually done for quick-n-dirty synchronization, but should be replaced by wait for the actual event required to execute as quickly as possible. This way, the time to run the unit tests can be decreased significantly, possibly representing greater productivity for developers.<br />
<div>
<br /></div>
<div>
Of course, if we had a user-space trace in addition to the kernel trace, we could have a greater insight of the internal state of machinekit. But even without, we can observe interesting behavior. The next step will be to actually plug and control some actual hardware in real-time.</div>
Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-35771839301168933482015-04-02T14:04:00.001-07:002015-04-02T14:34:48.131-07:00Why malware detection based on syscalls n-grams is unlikely to work<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In the field of computer security, it was suggested to use </span><a href="http://en.wikipedia.org/wiki/N-gram" style="line-height: 22.0799999237061px; text-decoration: none;"><span style="color: #1155cc; font-family: Arial; font-size: 15px; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">n-grams</span></a><span style="color: black; font-family: Arial; font-size: 15px; vertical-align: baseline; white-space: pre-wrap;"> of</span><span style="color: black; font-family: Arial; font-size: 15px; vertical-align: baseline; white-space: pre-wrap;"> system calls </span><span style="color: black; font-family: Arial; font-size: 15px; vertical-align: baseline; white-space: pre-wrap;">to detect malware running on a computer. Let's demonstrate that there is no relationship between n-gram counts and malicious behavior.</span><br />
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">False-negative example. The first trace is a weather app that opens a config file, then performs some remote procedure call to get the data, and then closes file descriptors. The other is the same sequence, but for a malicious app that gained root privilege by some means, and sends the content of the password file.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Trace A1:</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">open(“app.conf”); read(); connect(); send(); recv(); close(); close()</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Trace A2:</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">open(“/etc/passwd”); read(); connect(); send(); recv(); close(); close()</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The probability of each n-gram is trivially the same for the same sequence. Therefore, the algorithm fails to detect the malware.</span><br />
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">False-positive example. Both sequences are functionally equivalent and non-malicious, but the second trace reduces the maximum number of opened file descriptor at one time.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Trace B1: </span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">open(); open(); open(); read(); read(); read(); close(); close(); close()</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Trace B2: </span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span class="Apple-tab-span" style="white-space: pre;"> </span></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">open(); read(); close(); open(); read(); close(); open(); read(); close()</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Let’s use Python to generate all n-grams of size 3.</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">ngrams = [x for x in itertools.product(['open','read','close'],['open','read','close'],['open','read','close'])]</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> B1 B2</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> (B2 sym diff B1)</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open open open 1 0 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open open read 1 0 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open open close 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open read open 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open read read 1 0 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open read close 0 3 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open close open 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open close read 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> open close close 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read open open 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read open read 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read open close 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read read open 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read read read 1 0 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read read close 1 0 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read close open 0 2 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read close read 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> read close close 1 0 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close open open 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close open read 0 2 \empty</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close open close 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close read open 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close read read 0 0 </span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close read close 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close close open 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close close read 0 0</span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span><span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">close close close 1 0 \empty</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: 'Courier New'; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">total 7 7</span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The n-gram counts for both sets are totally disjoints. Therefore, the algorithm will wrongly detect the trace B2 as malware code, while it is in fact the result of a simple optimization.</span><br />
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Moreover, a malicious app can trick the detection by calling dummy system calls randomly, which will swamp the potentially mallicious sequences. Because each n-gram probability will be non-statistically significant, the algorithm will fail to identify the change of behavior, and classify it as malware.</span><br />
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: Arial; font-size: 15px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Therefore, we conclude that this technique is unlikely to </span><span style="font-family: Arial; font-size: 15px; line-height: 1.38; white-space: pre-wrap;">provide any benefits to identify malware automatically. </span></div>
Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-371268210281427302014-12-02T14:20:00.003-08:002014-12-05T13:09:03.752-08:00Full stack Python tracingThe project <a href="https://github.com/giraldeau/python-profile-ust" target="_blank">profile-python-ust</a> allows to record function call and return from Python interpreted code using LTTng-UST.
As an example, consider this foo.py code (calling itself bidon.py that just print something on the console):
<br />
<pre><code>
</code></pre>
<pre><code>import bidon</code></pre>
<pre><code>
def baz():
bidon.bidon()
def bar():
baz()
def foo():
bar()
if __name__=='__main__':
foo()
</code></pre>
The trace produced is:
<br />
<pre><code>
python:call: { cpu_id = 7 }, { co_name = "<module>" }
python:call: { cpu_id = 7 }, { co_name = "foo" }
python:call: { cpu_id = 7 }, { co_name = "bar" }
python:call: { cpu_id = 7 }, { co_name = "baz" }
python:return: { cpu_id = 7 }, { }
python:return: { cpu_id = 7 }, { }
python:return: { cpu_id = 7 }, { }
python:return: { cpu_id = 7 }, { }
</module></code></pre>
However, the tracing hooks into CPython are not called for external C modules. Therefore, we loose track of what is going on. For instance, <a href="http://cython.org/">Cython</a> compiles Python code as a native shared library for greater efficiency.
To overcome this limitation, I compiled the Cython foo.pyx code using GCC function instrumentation. Here is the setup.py file (The same code gets compiled twice, with and without instrumentation):
<br />
<pre><code>
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
ext = [
Extension('foo1', ['foo1.pyx'],
extra_compile_args = ['-g'],
),
Extension('foo2', ['foo2.pyx'],
extra_compile_args = ['-finstrument-functions', '-g'],
),
]
setup(
ext_modules = cythonize(ext),
)
</code></pre>
<br />
And then I invoked both modules foo1.foo() and foo2.foo(), with the lttng-ust-cyg-profile instrumentation, that hooks on the GCC instrumentation. I also enabled kernel tracing, just for fun. Both traces (kernel and user space) are loaded in an <a href="http://projects.eclipse.org/projects/tools.tracecompass">Trace Compass</a> experiment.<br />
<br />
We see in the screenshot below the users space stack according to time for the foo1 and foo2 executions. In the first execution, we see the entry point of inside foo1.so and the invocation of Python code from the library, but function calls inside foo1.so are not visible. In the second execution, both the Python and the GCC instrumentation are available, and the complete callstack is visible. The higlighted time range is about 6us and corresponds to the write() system call related to the print() on the console. We can see the system call duration and the blocking in the control flow view.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMeOpx7VIUTCnlHTbNgBRKYTOJwKrFB6ISM53Ha9hnEgDl6gsoRc4i8NsiYRUthKd8rhfuJqZA7KYEkNQUcfZZw3uR5JoUzy_b9_ur-5wda12jrrETi4X5AeA4ctkfGkPppYrgVtqwxCI/s1600/python-profile-ust-with-kernel.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMeOpx7VIUTCnlHTbNgBRKYTOJwKrFB6ISM53Ha9hnEgDl6gsoRc4i8NsiYRUthKd8rhfuJqZA7KYEkNQUcfZZw3uR5JoUzy_b9_ur-5wda12jrrETi4X5AeA4ctkfGkPppYrgVtqwxCI/s1600/python-profile-ust-with-kernel.png" height="339" width="640" /></a></div>
<br />
Unfortunately, symbols can't be resolved, because base address library loading can't be enabled at this time with Python instrumentation, due to this <a href="https://bugs.lttng.org/issues/865" target="_blank">bug</a>.<br />
<br />
This was an example how three sources of events finally create a complete picture of the execution!<br />
<br />
EDIT:<br />
<br />
A Cython module can call profile hook of the main interpreter. Call cython with the profile option, and compile the resulting code with -DCYTHON_TRACE=1. This can be done in setuptools with:<br />
<br />
<pre><code>
cythonize(ext_trace, compiler_directives={'profile': True}),
</code></pre>
<br />
Thanks to Stefan Behnel for this tip!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-50797619843709531882014-11-19T10:00:00.003-08:002014-11-19T13:06:27.056-08:00How Linux changes the CPU FrequencyThe <i>ondemand</i> governor adjust the processor frequency making a trade-off between power and performance, and is the implementation of <a href="http://en.wikipedia.org/wiki/Dynamic_frequency_scaling" target="_blank">Dynamic Frequency Scaling</a> under Linux. However, it was also causing unexpected delays in my performance benchmarks. The following figure shows the delay density of RPC requests for the CPU frequency governor <i>ondemand</i> and <i>performance</i>. Delay variation is lower when the CPU frequency is fixed.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBcmdO12KMkC6yzEnrDuWKWsDkoOLwvuJSQlYRO2C53AB1VfkhRd1WRNZFhrd4tTre1bQESMGFhRfmiThhP4nidq5Z7D7staDhg7v1Y2jTbS1CTorJQf_oXgu8vchD6Db5u6o8b7sYhr0/s1600/request-delay-density.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBcmdO12KMkC6yzEnrDuWKWsDkoOLwvuJSQlYRO2C53AB1VfkhRd1WRNZFhrd4tTre1bQESMGFhRfmiThhP4nidq5Z7D7staDhg7v1Y2jTbS1CTorJQf_oXgu8vchD6Db5u6o8b7sYhr0/s1600/request-delay-density.png" height="247" width="400" /></a></div>
<br />
I wanted to visualize how the frequency changes according to time and scheduling. I used LTTng to record <span style="font-family: Courier New, Courier, monospace; font-size: x-small;">power_cpu_frequency</span> and <span style="font-family: Courier New, Courier, monospace; font-size: x-small;">sched_switch</span> events. I loaded them in Trace Compass, and defined the analysis declaratively using XML (available <a href="https://github.com/giraldeau/workload-kit/blob/master/fsa/power-cpu-frequency.xml" target="_blank">here</a>). The CPU frequency is shown in shades of red (stronger means higher frequency) and in green whether the CPU is busy or idle. (you guessed correctly, holiday season is coming). We see in the following figure that it takes some time for the frequency to ramp-up.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjv7lRxIz2dmFhs062ggz8aS7yWZ4NA3zN_32UXC8nyzYc0g4h8Ct1lUamYi_CCvM-0p-1hLLen8IVeHS42oQnD8B9DS7BdpHQRsa_mTq0uCdQDUAbYLX_bGtnHSUNvZxz49ARmm3obAVA/s1600/cpufreq.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjv7lRxIz2dmFhs062ggz8aS7yWZ4NA3zN_32UXC8nyzYc0g4h8Ct1lUamYi_CCvM-0p-1hLLen8IVeHS42oQnD8B9DS7BdpHQRsa_mTq0uCdQDUAbYLX_bGtnHSUNvZxz49ARmm3obAVA/s1600/cpufreq.png" height="136" width="640" /></a></div>
<br />
<br />
In fact, changing the CPU frequency is equivalent to distorts the elapsed time of the execution. Timestamps in the trace are in nanoseconds instead of clock cycles, but to compare executions, the base should be clock cycles instead. We could scale the time using the frequency events to approximate the time in processor cycles.<br />
<br />
The task kworker, woken-up from a timer softirq, changes the CPU frequency. The kworker thread runs with a varying period, typically between 4ms and 20ms. The algorithm seems to take into account the average CPU usage, because if the CPU does rapid context switch between idle and busy, the frequency never reaches the maximum.<br />
<br />
Therefore, a task performing frequent blocking/running cycles may run faster if it shares the CPU with another task, because the average CPU usage will increase and frequency will be set accordingly. We could also estimate the relative power cost of a segment of computation with great precision.<br />
<br />
In conclusion, disabling frequency scaling for benchmarks should reduce variability. Here is how to do it on recent Linux kernels:<br />
<br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">set_scaling_gov() {</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> gov=${1-performance}</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> for i in $(ls -1 /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor); do</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> echo ${gov} | sudo tee $i > /dev/null</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> done</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">}</span><br />
<br />
Here is the view for the whole trace. Happy hacking!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqFH_3wG8MAnFOvTslxzwoTjZl0UT5BcT84ZPHuYznzTuhXTdu0WlK7TRHrjMXL2wVoqWhdxyyznvo5PSDyg-bdXqESnFRCA3SY2tJMad7Zin4q-dArPAZypeFfqxVStBbbpJwBEAyMR4/s1600/Screenshot+from+2014-11-19+16:03:58.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqFH_3wG8MAnFOvTslxzwoTjZl0UT5BcT84ZPHuYznzTuhXTdu0WlK7TRHrjMXL2wVoqWhdxyyznvo5PSDyg-bdXqESnFRCA3SY2tJMad7Zin4q-dArPAZypeFfqxVStBbbpJwBEAyMR4/s1600/Screenshot+from+2014-11-19+16:03:58.png" height="280" width="640" /></a></div>
<br />Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-60596805746271755442014-10-19T05:51:00.000-07:002014-10-19T05:51:04.282-07:00LinuxCon 2014, Düsseldorf<div style="font-family: Helvetica; font-size: 11px;">
<h2>
<span style="letter-spacing: 0.0px;">Distributed Active Path from Kernel Trace</span></h2>
<span style="letter-spacing: 0.0px;">I presented recent work about the recovery of active path of a distributed program, only using a kernel trace. The result is a time-view of the execution across computers. One of the most interesting result is the Django Poll three-tiers app, including an Apache web server and a PostgreSQL database. Client is simulated by scripting with Mechanize. The screen shot shows the execution for the POST, where a large sync is visible near the end of the sequence, necessary for the update SQL statement. </span></div>
<div style="font-family: Helvetica; font-size: 11px; min-height: 13px;">
<span style="letter-spacing: 0.0px;"></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgw4z5zVh0Zb_RfJUtk9B9sHXRU6qdlOmyPZrWES9qTDuYkFJUlTRGScvEJrj_J9vJ6ofg1iKY-TrnRjz7XJ-W0rE6DI7j864H2VWebZPrDPviLhX4plB4azygz4xqMX9H8W2WkHFR995w/s1600/django1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgw4z5zVh0Zb_RfJUtk9B9sHXRU6qdlOmyPZrWES9qTDuYkFJUlTRGScvEJrj_J9vJ6ofg1iKY-TrnRjz7XJ-W0rE6DI7j864H2VWebZPrDPviLhX4plB4azygz4xqMX9H8W2WkHFR995w/s1600/django1.png" height="228" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4muGTPH_WB1QWJeJyZ45xVggC7BrtGYHKJRV8R6KeRIPW6vSkOlZehthopLiSx8yI_yREDt8DRqhr1CYS5hHi2hqtTc0VUKIGn5BnTnukcQe_oUyOhWi5nku_hGb-fq-G1y5ogAOO33s/s1600/django2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4muGTPH_WB1QWJeJyZ45xVggC7BrtGYHKJRV8R6KeRIPW6vSkOlZehthopLiSx8yI_yREDt8DRqhr1CYS5hHi2hqtTc0VUKIGn5BnTnukcQe_oUyOhWi5nku_hGb-fq-G1y5ogAOO33s/s1600/django2.png" height="240" width="320" /></a></div>
<br />
<br /></div>
<div style="font-family: Helvetica; font-size: 11px;">
<h2>
Declarative views from user-space trace with Trace Compas</h2>
<span style="letter-spacing: 0.0px;">Geneviève Bastien presented how to use LTTng-UST and Trace Compas to create a time view of the state of an MPI computation. The MPI program simulates unbalanced computation, where the green represent work, and red is wait at the barrier. Multiple cycles are shown, and all the views are synchronized, which is handy to analyse the application at multiple level at once. In the following figure, we see below that the wait at the barrier (in red) is actually implemented using a busy loop calling non-blocking poll() system call, shown on top. We observed this behavior with the OpenMPI library, and we think it is intended to reduce the latency at the expense of power and resource efficiency.</span><br />
<span style="letter-spacing: 0.0px;"><br /></span>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnIRj6-hCsCqozQ-L6iapXYlr0-c0z6XkYCYPUIRj5yvIamFebxGaS2o3Wb_8mbhQIopfzztrdYTbRs5Za_ho5CnIJnDxFW62A8TSxkqREiDLaoDp6-hFD7KqPI7Omdrn5GscFpAduCYw/s1600/mpi-imbalance.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnIRj6-hCsCqozQ-L6iapXYlr0-c0z6XkYCYPUIRj5yvIamFebxGaS2o3Wb_8mbhQIopfzztrdYTbRs5Za_ho5CnIJnDxFW62A8TSxkqREiDLaoDp6-hFD7KqPI7Omdrn5GscFpAduCYw/s1600/mpi-imbalance.png" height="113" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Below: MPI imbalance application state (green: compute, red: wait). Above: corresponding kernel trace, including a zoom on the busy wait.</td></tr>
</tbody></table>
<h2>
<span style="letter-spacing: 0.0px;">Preemption analysis across host and virtual machines</span></h2>
</div>
<div style="font-family: Helvetica; font-size: 11px;">
<span style="letter-spacing: 0.0px;">Execution in virtual machines is subject to higher level of variation in response time. Geneviève Bastien presented the virtual machine analysis, that shows the preemption crossing virtual machines boundaries, the research result of former student Mohamad Gebai. A process in the VM may appear to run while it is not, because the process representing its virtual CPU on the host is be preempted. The host and the VMs are traced at the same time, and traces are synchronized. The global view of the resources allows to pinpoint preemption cause given a task, whenever the other task is local, on the host or inside another virtual machine. The following figure shows qemu processes on the host and traces from within the host. We clearly see a gettimeofday() that is much longer that one would expect (13ms instead of ~200ns), and the reason is that there is another process, in another VM, that was running for that time, and the preemption cause is shown below.</span><br />
<span style="letter-spacing: 0.0px;"><br /></span>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggbmDLd_pdIQHICGpMmDVaDAQ75ck1Hm33Xwfk-n9aPS5h9O7f3CbZAUDPELIE8iVwWR4hFNRulDDYIbjNWG8QdZg619Oxz82IcBNHYK2RzRfxXVqtm6856735JallTKZ74sl3fEbgDRA/s1600/Screenshot+from+2014-10-19+08:35:35.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggbmDLd_pdIQHICGpMmDVaDAQ75ck1Hm33Xwfk-n9aPS5h9O7f3CbZAUDPELIE8iVwWR4hFNRulDDYIbjNWG8QdZg619Oxz82IcBNHYK2RzRfxXVqtm6856735JallTKZ74sl3fEbgDRA/s1600/Screenshot+from+2014-10-19+08:35:35.png" height="290" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Total view of the system, crossing virtual machine boundaries.</td></tr>
</tbody></table>
<span style="letter-spacing: 0.0px;"><br /></span></div>
<div style="font-family: Helvetica; font-size: 11px; min-height: 13px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 11px;">
<h2>
<span style="letter-spacing: 0.0px;">ftrace</span></h2>
</div>
<div style="font-family: Helvetica; font-size: 11px;">
<span style="letter-spacing: 0.0px;">Sten Rosted from RedHat presented the inner working of ftrace. It is a framework to hook into each function in the kernel. The kernel is compiled with gcc -pg option, that adds a call to mcount() at the beginning of each function, but this call is replaced by nops. The space left in each function header is patched at runtime to jump to the instrumentation. This framework is used by kprobe, but the ftrace backend can be used directly instead.</span></div>
<div style="font-family: Helvetica; font-size: 11px; min-height: 13px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 11px;">
<h2>
<span style="letter-spacing: 0.0px;">folly</span></h2>
</div>
<div style="font-family: Helvetica; font-size: 11px;">
<span style="letter-spacing: 0.0px;">Ben Maurer from Facebook presented efficiency and scalability issues they face in their environment. Performance in accepting connections is critical. He mentioned that getting the next available file descriptor is a linear search that hurts the performance. He also shows a method to decrease the number of scheduling switch of workers by replacing pthread condition wait, and improving load distribution among threads. I liked his quantitative approach, always evaluating both kernel and user-space solutions to the same problem.</span></div>
<div style="font-family: Helvetica; font-size: 11px; min-height: 13px;">
<span style="letter-spacing: 0.0px;"></span><br />
<h2>
Cycle accurate profiling using CoreSight hardware trace</h2>
</div>
<div style="font-family: Helvetica; font-size: 11px;">
<span style="letter-spacing: 0.0px;">Pawel Moll from ARM presented the CoreSight instruction tracing capabilities. The processor records branch taken and the number of cycles taken by each instructions. He demonstrated a prototype of cycle accurate (i.e. exact cycle count) profiler. The trace decoding is somewhat cumbersome, because the code memory must be reconstructed offline to resolve address symbols, and the complete toolchain requires proper integration, but still an impressive demonstration.</span></div>
<div style="font-family: Helvetica; font-size: 11px; min-height: 13px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<h2>
<span style="font-weight: normal;"><span style="font-family: inherit;">Thanks to NVIDIA!</span></span></h2>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;">I also got a JETSON TK1 big.LITTLE board from NVIDIA. I will be able to test heterogenous traces experiments (x86, arm), a very common case for mobile and cloud computing. I will also be able to test the CoreSight tracing. Thanks NVIDIA!</span></div>
<div>
<span style="letter-spacing: 0.0px;"><br /></span></div>
Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-75651606184219617202014-10-08T14:51:00.000-07:002014-10-08T14:51:00.572-07:00Ubuntu login analysisI'm tracing a lot these days, and in the viewer, I need to scroll a lot to get to the interesting stuff in the control flow view. Today, I checked all these processes for fun. In fact, it's related to landscape triggered on Ubuntu login. To display the message about system usage and updates (which I think is great), landcape calls numerous sub-programs, for a total of about 250ms on my system. It's something one should keep in mind if SSH commands are repeated in a tight loop: the maximum rate is about 4 commands per seconds. Hopefully, <a href="http://www.fabfile.org/" target="_blank">Fabric</a> reuses SSH connection! ;-)<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKEkyYzxFcyp3ZaZVRR64L-zwdjyRYN3OpUt6jrfY1kCMhIrIofD8C_7Jo8tPq7oP2EFQNioklmGkfjejukWJ9-0408MnQicc1nsZWa2L6vMnHj5KGYDxKjTyfuLEi7j-f4ANc3x9K9rc/s1600/login-prompt.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKEkyYzxFcyp3ZaZVRR64L-zwdjyRYN3OpUt6jrfY1kCMhIrIofD8C_7Jo8tPq7oP2EFQNioklmGkfjejukWJ9-0408MnQicc1nsZWa2L6vMnHj5KGYDxKjTyfuLEi7j-f4ANc3x9K9rc/s1600/login-prompt.png" height="257" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Ubuntu login greeting</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUnLTGTEddtTHz3xW0KRcUbyhEPAKjJnAHUoVBBeTs4c-4aiQWJxmz5_ay6iGrNcT27oqqaZ1-e-wXuL9ueYVXxUTTavMVrV2OROuqHA4BNCxKMwzWUELQiezTbxDQJj1HPLX-kLoKn50/s1600/landscape-login.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUnLTGTEddtTHz3xW0KRcUbyhEPAKjJnAHUoVBBeTs4c-4aiQWJxmz5_ay6iGrNcT27oqqaZ1-e-wXuL9ueYVXxUTTavMVrV2OROuqHA4BNCxKMwzWUELQiezTbxDQJj1HPLX-kLoKn50/s1600/landscape-login.png" height="250" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The active path of the computation of the login greeting</td></tr>
</tbody></table>
<br />
<br />Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-91543167735594513972014-08-06T08:01:00.001-07:002014-08-06T08:01:34.816-07:00Pinpoint memory usage of Java program with Eclipse Memory AnalyzerMy Java program was using a lot of memory. An object was holding a reference on a large temporary data structure, even after it was out of scope. It is indeed difficult to understand what prevent the garbage collection to do its job. Here is how I found out.<br />
<br />
<br />
<ul>
<li>Download and extract <a href="http://www.eclipse.org/mat/" target="_blank">Eclipse Memory Analyzer</a> (MAT).</li>
<li>Download the <a href="https://github.com/giraldeau/org.eclipse.linuxtools/blob/graph/lttng/org.eclipse.linuxtools.tmf.analysis.graph.core.tests/src/org/eclipse/linuxtools/tmf/analysis/graph/core/tests/staging/HeapDump.java" target="_blank">HeapDump.java</a> class and insert a call to HeapDump.dumpHeap() after the point your think the memory should be released.</li>
<li>Run the instrumented program. The resulting file size may be rather large (seems to be the hole VM memory), so make sure you have enough free space.</li>
<li>Open it in MAT. (Notice: I was experiencing a hang bug on Ubuntu 14.04, dismiss the initial report dialog and do not generate HTML reports, it seems to trigger the bug). Click on Dominator Tree.</li>
<li>Once you found the offending object, right click on it and select Path to GC Roots > with all references. This will display the references that prevent the GC to free this object.</li>
<li>Fix the code and you can start over to verify that the problem is indeed fixed ;-)</li>
</ul>
<br />
<br />
The following screenshots shows an example of the tool usage for my problem. The class TmfNetworkEventsMatching matches send to receive network events in a trace. Unmatched packets are kept temporarily in memory. However, the old unmatched packets were never freed. In the Dominator Tree, we see that TmfNetworkEventsMatching is 40 bytes (Shallow Heap), but holds around 1.2MB of objects (Retained Heap). The Path to GC Roots highlights that a reference of TmfNetworkEventMatching is kept inside TmfRequestExecutor. In this case, a missing call to dispose() was preventing the release of resources. Other resources are auto-closable, except for one of them, which needs manual call to dispose(). Thank you MAT!<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmAjGcXWk6RMWtXTNYfEgBKhDr8dr1fHJ6D6qwyK4CO9Q2Euh5AskcJmMXhQkigEX8XKd0Sf6DZtQjHiym8jp42UjT7gtEKBWPoaCpJELVeBDEtdtV0ipwoMwWaP4GiWn4BMDh4EUhfjU/s1600/Screenshot+from+2014-08-06+10:43:09.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmAjGcXWk6RMWtXTNYfEgBKhDr8dr1fHJ6D6qwyK4CO9Q2Euh5AskcJmMXhQkigEX8XKd0Sf6DZtQjHiym8jp42UjT7gtEKBWPoaCpJELVeBDEtdtV0ipwoMwWaP4GiWn4BMDh4EUhfjU/s1600/Screenshot+from+2014-08-06+10:43:09.png" height="395" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Dominator Tree</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyJm0SopGUjIMpFUVxody326U3Uhfi-29v5MTvaKqvOMPA4GihsY3zClzEM4DD4h4z4sn9300RQHetKfF5EPP0Ws8sq9PlkLev6jQ8nsW_YnBWuyC8tNioFCRB64Pxz4U21QdztdkHJDw/s1600/Screenshot+from+2014-08-06+10:43:28.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyJm0SopGUjIMpFUVxody326U3Uhfi-29v5MTvaKqvOMPA4GihsY3zClzEM4DD4h4z4sn9300RQHetKfF5EPP0Ws8sq9PlkLev6jQ8nsW_YnBWuyC8tNioFCRB64Pxz4U21QdztdkHJDw/s1600/Screenshot+from+2014-08-06+10:43:28.png" height="396" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Path to GC Roots</td></tr>
</tbody></table>
<br />Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-58205766735181467002014-05-29T15:03:00.001-07:002014-05-29T15:03:18.198-07:00Performance comparison of CTF trace readersI'm implementing a kernel trace analysis using the Java CTF reader, and I wanted to know if I would increase the performance by going with the babeltrace reader in C. Here are my findings.<br />
<br />
One of the expansive step of trace analysis is the actual trace reading. It consists to read the binary trace and retrieve the timestamps, the event type and the value of each fields. The event is then added to a priority queue for reading multiple streams in order. We compare three trace readers, namely the Babeltrace reader 1.2 (C implementation, dummy output format), the Java CTF reader and the Eclipse TMFEventRequest (which uses the Java CTF reader as back-end). We compare the trace reading in cache cold conditions (drop_caches prior to reading) to when the trace is resident in page cache. I ensure that the lazy loading of fields values are effectively loaded in all cases, but without printing the content. The trace size is 855MB and contains 22 million kernel events from a Django web app benchmark (recorded with lttng), and it takes between 20s and 80s to process the trace. The drive used is an SSD Crucial M500 1TB, and the host is an i7-4770 with 32GB of RAM. The following figure shows the performance results in thousands of events per second.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiU1_Y1E51P3EjZZlmvl9K71tUQQa5FwiKYpO3ZXMlajZwCv0uXu2YNPv2IHd5nC-vgfJH-mTG5bCEELSoVpJ4rNXAK88yAC5IPB6sbsZYTHTw2MJcGtMORWp3XCUG5Waa3xAqqhoSWVM/s1600/perf-cmp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiU1_Y1E51P3EjZZlmvl9K71tUQQa5FwiKYpO3ZXMlajZwCv0uXu2YNPv2IHd5nC-vgfJH-mTG5bCEELSoVpJ4rNXAK88yAC5IPB6sbsZYTHTw2MJcGtMORWp3XCUG5Waa3xAqqhoSWVM/s1600/perf-cmp.png" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
We observe that babeltrace is roughly 2 times faster than the CTFTraceReader in Java, and about 3.5 times faster than TmfEventRequest. I measured also the CPU usage, and the processing is mostly serial and single threaded in every cases, so babeltrace seems effectively more efficient.<br />
<br />
The other important observation is that trace parsing is CPU bound, not I/O bound. Reading a trace from a hard drive is certainly I/O bound, but it's not the case with SSD. The difference between cache cold and cache hot is between 2% and 5% either in Java or in C. It means we could probably speed-up the reading using parallel threads.<br />
<br />
In conclusion, both the C and the Java libraries provides acceptable performance level for CTF trace reading. For high-performance processing, the babeltrace library is preferable. The CTFTraceReader provides nonetheless a good performance, and may be a compromise between the rapid development and runtime efficiency. If abstracting the trace source is a requirement, then TMF is what you need, but you now know the cost of this additional flexibility. Yours to choose!<br />
<br />
Notice: the TMFEventRequest can be done either with ExecutionType.FOREGROUND or ExecutionType.BACKGROUND. The background parameter introduce delays (sleeps) and reduces the throughput. Use the foreground parameter to reduce processing time and improve UI interactivity, but uses background for low priority processing.Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-49870583909673047952013-12-05T11:01:00.003-08:002013-12-05T11:53:57.383-08:00System-level profiling of APTAPT is the package manager of Debian based distributions. Understanding where the time is spent installing a package is difficult, because there are numerous process involved. We discuss limitations of profiler tools and present system-level trace analysis that highlight waiting between tasks.<br />
<br />
Callgrind is a profiler that helps to find hotspot in the application. One limitation of a profiler is that the time spent waiting is not taken into account, for instance I/O, resources or other tasks. This situation is shown in figure 1, where we traced the command "sudo apt-get install tree --yes". The actual computation done in the apt-get process accounts for 63% of the elapsed time, while waiting time is about 37%. The callgrind results covers only instructions from the application, not the waiting time.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrVlUWD34Qr11pEVrd3OEOmMV-pyf8F0jXCwFQRFhahmhnyKVgSwfh0JmZVUFcT0-XE8ieR6pfkYehY8FmuhDfrC0y4UqvT-MtEqfnto2MWmWkymrK7UhH08Y_7P0nL1E4iD0q6Jqri3Q/s1600/apt-overview.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="356" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrVlUWD34Qr11pEVrd3OEOmMV-pyf8F0jXCwFQRFhahmhnyKVgSwfh0JmZVUFcT0-XE8ieR6pfkYehY8FmuhDfrC0y4UqvT-MtEqfnto2MWmWkymrK7UhH08Y_7P0nL1E4iD0q6Jqri3Q/s640/apt-overview.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1: State of apt-get process according to time.</td></tr>
</tbody></table>
<br />
How to effectively understand what the application was waiting for? For this matter, we performed a waiting dependency analysis on all tasks on the system. The event sched_wakeup indicates the waiting source. Figure 2 shows the result of waiting dependencies among tasks for the command.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjS6-GpMuE1B-9S7lIFcjUKJJiToKdoDaBSKpjgJiU6jYU5NXIu_JCxfUGrJqN2ViIBAwbGmU4HXNDNH-Rpny9TXcRMDWGQSTX6EleEqtTOoC2rIUCpLsRcJDeoZE2jZtM1VUpu2PJVovo/s1600/apt-get-graph-0.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjS6-GpMuE1B-9S7lIFcjUKJJiToKdoDaBSKpjgJiU6jYU5NXIu_JCxfUGrJqN2ViIBAwbGmU4HXNDNH-Rpny9TXcRMDWGQSTX6EleEqtTOoC2rIUCpLsRcJDeoZE2jZtM1VUpu2PJVovo/s640/apt-get-graph-0.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 13px;">Figure 2: Overview of apt-get waiting dependencies according to time.</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
The analysis resolves waiting time recursively. In this run, apt-get process is waiting for dpkg, and it accounts for about 20% of the execution time. Figure 3 shows the wait I/O that occurs in dpkg, up to the kernel thread responsible for the file system journal of the partition on which the I/O is performed.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoTYGaB5NtedlSr73zr5Qfx_Vi8oif8KQXhhIChrLlYfT3E5Zu15W-Ys2VbngdmcJVnBFvI89gvOD7uIBkNUF-eFswjtVBNiegKmBglt8SfuIIFT3_BFhB1QFY66xIYphx0MU9S9OWzTM/s1600/apt-get-graph-1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="106" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoTYGaB5NtedlSr73zr5Qfx_Vi8oif8KQXhhIChrLlYfT3E5Zu15W-Ys2VbngdmcJVnBFvI89gvOD7uIBkNUF-eFswjtVBNiegKmBglt8SfuIIFT3_BFhB1QFY66xIYphx0MU9S9OWzTM/s640/apt-get-graph-1.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3: I/O wait of dpkg according to time.</td></tr>
</tbody></table>
Another interesting part of the execution is the update of man database performed by mandb. Figure 4 shows that about 200 threads are spawned. In this case, the overhead of starting the threads and setup communication pipe is high compared to the amount of actual work, in consequence running time could be greatly improved by using a thread pool instead.<br />
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0l0yWPCWMhyphenhyphenVe7YUgmbAg63jsvQKo-Gn5kuEfAXULxrxL4mJwddaHqdcwtifKKqncA3oVbGecCNp8t6eOMlIi7FOe04522aKLQ5GzHmJr1jtjklEuATNdxT7t3VlrcQfT3BMOEZ97Mpc/s1600/apt2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0l0yWPCWMhyphenhyphenVe7YUgmbAg63jsvQKo-Gn5kuEfAXULxrxL4mJwddaHqdcwtifKKqncA3oVbGecCNp8t6eOMlIi7FOe04522aKLQ5GzHmJr1jtjklEuATNdxT7t3VlrcQfT3BMOEZ97Mpc/s640/apt2.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4: Execution of mandb according to time.</td></tr>
</tbody></table>
We located the code responsible for creating all these threads and processes using perf callchain. It seems to be related to how the library libpipeline is working. Here is an example of an interesting call stack:<br />
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff729a30b __execvpe (/lib/x86_64-linux-gnu/libc-2.17.so)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff75a4de4 pipecmd_exec (/usr/lib/x86_64-linux-gnu/libpipeline.so.1.2.2)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">409813 [unknown] (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff75a4e59 pipecmd_exec (/usr/lib/x86_64-linux-gnu/libpipeline.so.1.2.2)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff75a662a pipeline_start (/usr/lib/x86_64-linux-gnu/libpipeline.so.1.2.2)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">408d96 find_name (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">40449c test_manfile (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">404708 testmandirs (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">405035 update_db (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">40a1c6 mandb (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">40a4fb process_manpath (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">403777 main (/usr/local/bin/mandb)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff71f9ea5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.17.so)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">403b09 _start (/usr/local/bin/mandb)</span></div>
</div>
<div>
<br /></div>
<div>
At the end of apt-get execution, there is a sleep of 500 ms, shown in blue in the figure 2. Here is the location in the code of this nanosleep, again using perf callchain feature: </div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">ffffffff816cc0c0 __schedule ([kernel.kallsyms])</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">ffffffff816cc799 schedule ([kernel.kallsyms])</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">ffffffff816cb66c do_nanosleep ([kernel.kallsyms])</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">ffffffff810827f9 hrtimer_nanosleep ([kernel.kallsyms])</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">ffffffff8108292e sys_nanosleep ([kernel.kallsyms])</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">ffffffff816d616f tracesys ([kernel.kallsyms])</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff72978c0 __GI___libc_nanosleep (/lib/x86_64-linux-gnu/libc-2.17.so)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff7b8ac95 pkgDPkgPM::DoTerminalPty(int) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff7b93720 pkgDPkgPM::Go(int) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff7b26585 pkgPackageManager::DoInstallPostFork(int) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">415a1b [unknown] (/usr/bin/apt-get)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">418dfc [unknown] (/usr/bin/apt-get)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">7ffff7af96c2 CommandLine::DispatchArg(CommandLine::Dispatch*, bool) (/usr/lib/x86_64-linux-gnu/libapt-pkg.so.4.12.0)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">40a2ec [unknown] (/usr/bin/apt-get)</span></div>
</div>
<div>
<br /></div>
<div>
Here is the detail of pkgDPkgPM::DoTerminalPty() function from apt/apt-pkg/deb/dpkgpm.cc:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;">/*</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> * read the terminal pty and write log</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> */</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">void pkgDPkgPM::DoTerminalPty(int master)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">{</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> unsigned char term_buf[1024] = {0,0, };</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> ssize_t len=read(master, term_buf, sizeof(term_buf));</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> if(len == -1 && errno == EIO)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> {</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> // this happens when the child is about to exit, we</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> // give it time to actually exit, otherwise we run</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> // into a race so we sleep for half a second.</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> struct timespec sleepfor = { 0, 500000000 };</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> nanosleep(&sleepfor, NULL);</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> return;</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> if(len <= 0)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> return;</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> FileFd::Write(1, term_buf, len);</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> if(d->term_out)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> fwrite(term_buf, len, sizeof(char), d->term_out);</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">}</span></div>
</div>
<div>
<br /></div>
<div>
The comment mentions a possible race, but the caller may be fixed to account for this situation. IMHO, this function should not sleep if an error occurs while reading the file descriptor, and should instead return immediately.</div>
<div>
<br /></div>
<div>
This was a small example of what can be achieved with the waiting dependency analysis from a kernel trace. The analysis plug-in will be available soon in Eclipse Linux Tools, and uses standard tracepoints from the Linux kernel, recorded with LTTng. I hope that this information will help the community to improve performance of installing packages with APT. </div>
<div>
<br /></div>
<div>
Cheers!</div>
<div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSP_BDoG85KeIUGaaNMb7eF-2szCJcLejbrlaRQtTCm8hpsoq7UlwVyHwUKGP2nSkci2oo8mYejC4UshXs-F51cXMsiamir7zVg01SIJjO0lDRmwM-QEkehxDVwCfSeb1gz-BBAjjo9mg/s1600/apt-plate.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="193" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSP_BDoG85KeIUGaaNMb7eF-2szCJcLejbrlaRQtTCm8hpsoq7UlwVyHwUKGP2nSkci2oo8mYejC4UshXs-F51cXMsiamir7zVg01SIJjO0lDRmwM-QEkehxDVwCfSeb1gz-BBAjjo9mg/s320/apt-plate.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The APT licence plate! (thanks to Yannick)</td></tr>
</tbody></table>
<br /></div>
<div>
<br /></div>
Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com1tag:blogger.com,1999:blog-1864509792735323218.post-19646921401908815342013-06-07T08:08:00.001-07:002013-06-07T08:08:04.197-07:00How slow is Java reflection?I use Java reflection for trace analysis, as a way to call hooks when specific events are encountered. I wanted to know the overheard related to using such dynamic invocation. First, here is a small example of a trace event handler.<br />
<br />
<pre>public class FooTraceEventHandler extends TraceEventHandlerBase {
public FooTraceEventHandler() {
super();
this.hooks.add(new TraceHook("sched_switch"));
}
public void handle_sched_switch(TraceReader reader,
CtfTmfEvent event) {
// Process event
}
}
</pre>
<br />
By convention, the function "handle_sched_switch" is called only with an instance of "sched_switch" event as argument. The magic happens by first getting a reference the method by it's name (done once), and then call it dynamically later in the processing path, similar to the following small example.<br />
<br />
<pre>method = dummy.getClass().getMethod("foo", (Class[]) null);
...
method.invoke(dummy);
</pre>
<br />
To compare dynamic invocation to compile-time method call, I did a trivial <a href="https://github.com/giraldeau/aka/blob/master/org.eclipse.linuxtools.lttng2.kernel.aka/src/org/lttng/studio/utils/MethodBench.java" target="_blank">benchmark program</a>. Here are the results for 10 billion calls.
<br />
<br />
<pre>StaticMethodBenchmark 9,653000
DynamicMethodBenchmark 47,625000
</pre>
<br />
This experiment shows that dynamic invocation is roughly 5x slower than compile-time method call. Dynamic invocations is also weaker, because a missing method for the corresponding hook will be detected only at runtime.<br />
<br />
There are benefits to using hooks. The test for the event type is done only once for all handlers. It avoid quite a lot of repetition. Also, it allows to list events required for a specific analysis, and then enable only this subset in the tracing session, reducing the runtime overhead and the trace size compared to enabling all events. So, we need to preserve these properties, but with compile-time invocations.Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-81471395928292397692013-06-04T11:58:00.001-07:002013-06-04T11:58:57.053-07:00How-to setup LTTng on Fedora 19Here is how to install lttng including kernel and userspace tracing on Fedora 19 beta.<br />
<br />
Instructions uses sudo. To configure sudo on your fedora system, use this command as root, and replace [username] with your username, then logout and login again:</username><br />
<br />
<pre>usermod [username] -a -G wheel</username></pre>
<h3>
<br /></h3>
<h3>
1 Build and install kernel modules</h3>
<h4>
<br /></h4>
<h4>
1.1 Perform any pending updates:</h4>
<div>
<pre>sudo yum update</pre>
</div>
<br />
If the kernel is updated, reboot the machine.<br />
<br />
<h4>
1.2 Install development tools:</h4>
<pre>sudo yum group install "Development tools"</pre>
<pre>
</pre>
<h4>
1.3 Fix kernel headers location</h4>
<div>
At the time of writing this How-to, the version of the kernel-devel package doesn't exactly matches the running kernel (uname -r). We worked around this issue by copying headers in the directory matching the version of the running kernel. The version numbers are likely to change (and this issue may be solved), so adapt it your situation.<br />
<br />
First, check if this applies to your system:<br />
<br /></div>
<pre>rpm -qa | grep kernel-devel</pre>
<pre>kernel-devel-3.9.4-300.fc19.x86_64
</pre>
<pre></pre>
<pre>uname -r
3.9.4-301.fc19.x86_64</pre>
<br />
If versions do not match, you can fix (hack) it like this:
<br />
<br />
<pre>sudo su -
cd /usr/src/kernels
cp -a 3.9.4-300.fc19.x86_64 3.9.4-301.fc19.x86_64</pre>
<h4>
<br /></h4>
<h4>
1.4 Download lttng-modules</h4>
<div>
We need to install lttng-modules 2.2.0-rc2 (or later) for compatibility with linux 3.9.4.<br />
<br /></div>
<pre>wget http://lttng.org/files/lttng-modules/lttng-modules-2.2.0-rc2.tar.bz2
tar -xjvf lttng-modules-2.2.0-rc2.tar.bz2
cd lttng-modules-2.2.0-rc2/
KERNELDIR=/usr/src/kernels/$(uname -r) make
sudo KERNELDIR=/usr/src/kernels/$(uname -r) make modules_install
sudo depmod -a
sudo modprobe lttng-tracer
</pre>
<div>
<br />
The module lttng_tracer should be loaded and visible with lsmod. </div>
<h3>
<br /></h3>
<h3>
Step 2: Install lttng packages</h3>
<pre>sudo yum install lttng-tools lttng-ust babeltrace lttng-ust-devel</pre>
<br />
Now, reboot the machine. The service lttng-sessiond should be started automatically, and all required modules should be loaded. Here is some checks that can be done:<br />
<br />
<pre>$ lsmod | grep lttng_tracer
lttng_tracer 558145 24 lttng_ring_buffer_client_discard,lttng_probe_jbd2,lttng_probe_kmem,lttng_probe_napi,lttng_probe_scsi,lttng_probe_sock,lttng_ring_buffer_client_mmap_overwrite,lttng_probe_statedump,lttng_ring_buffer_metadata_client,lttng_ring_buffer_client_mmap_discard,lttng_probe_irq,lttng_probe_kvm,lttng_probe_net,lttng_probe_skb,lttng_probe_udp,lttng_ring_buffer_metadata_mmap_client,lttng_probe_module,lttng_probe_signal,lttng_probe_vmscan,lttng_probe_block,lttng_ring_buffer_client_overwrite,lttng_probe_power,lttng_probe_sched,lttng_probe_timer
lttng_lib_ring_buffer 50443 7 lttng_ring_buffer_client_discard,lttng_ring_buffer_client_mmap_overwrite,lttng_ring_buffer_metadata_client,lttng_ring_buffer_client_mmap_discard,lttng_tracer,lttng_ring_buffer_metadata_mmap_client,lttng_ring_buffer_client_overwrite
lttng_statedump 30537 1 lttng_tracer
lttng_ftrace 13097 1 lttng_tracer
lttng_kprobes 12911 1 lttng_tracer
lttng_kretprobes 13224 1 lttng_tracer
$ sudo service lttng-sessiond status
Redirecting to /bin/systemctl status lttng-sessiond.service
lttng-sessiond.service - LTTng 2.x central tracing registry session daemon
Loaded: loaded (/usr/lib/systemd/system/lttng-sessiond.service; enabled)
Active: active (running) since Tue 2013-06-04 14:48:53 EDT; 1min 8s ago
Process: 385 ExecStart=/usr/bin/lttng-sessiond -d (code=exited, status=0/SUCCESS)
Main PID: 418 (lttng-sessiond)
CGroup: name=systemd:/system/lttng-sessiond.service
└─418 /usr/bin/lttng-sessiond -d
</pre>
<br />
<h3>
Step 3: Make traces with workload-kit</h3>
<div>
One way to make traces is to use workload-kit. Here is how to install it from sources.</div>
<pre>wget http://secretaire.dorsal.polymtl.ca/~fgiraldeau/workload-kit/workload-kit-0.1.13.tar.bz2
tar -xjvf workload-kit-0.1.13.tar.bz2
cd workload-kit-0.1.13/
./configure --disable-cputemp
make -j12
sudo make install
</pre>
Then, the tool lttng-simple can be used to generate a UST trace or a kernel trace for instance:
<br />
<pre>lttng-simple -k -s -- sleep 1
babeltrace sleep-k/
lttng-simple -u -- /usr/local/share/workload-kit/scripts/ust/wk-heartbeat
babeltrace wk-heartbeat-u/
</pre>
Happy tracing!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-35158998719550792102013-05-31T08:22:00.001-07:002013-05-31T08:22:55.025-07:00Tracing userspace memory allocationYou may wonder what are the actual memory allocations in a program according to time. Here is a method to trace these operations with the LTTng User Space Tracer (UST). Events are recorded by overloading malloc() and free() with a library that is preloaded with LD_PRELOAD. This library is called "liblttng-ust-libc-wrapper.so". To easy the actual tracing session setup, this feature has been added to the lttng-simple utility of workload-kit v0.1.13. By specifying the option "--enable-libc-wrapper", then the environment is set automatically. Here is quick demo to look at the memory allocations of gnome-calculator. (output of babeltrace is simplified for clarity)<br />
<br />
<pre>// trace the program
$ lttng-simple -u --enable-libc-wrapper -- gnome-calculator
// display events
$ babeltrace gnome-calculator-u/
[10:48:49.808255177] malloc: { size = 40, ptr = 0xCA8220 }
[10:48:49.808256389] free: { ptr = 0xCA8220 }
...
// count events and look for lost events
$ javeltrace analyze -i ctf gnome-calculator-u/...
number of streams 1
number of packets 1,058
number of events 181,879
number of CPUs 8
number of lost events 0
</pre>
<br />
This experiment shows that there are about 182 <i>thousands</i> memory allocation events to start a calculator!<br />
<br />
Checking for lost events with javeltrace is recommended. The default buffer size in UST is quite small, and in consequence lost events occurs even with very light load, like starting the calculator with memory allocation tracing. The tool lttng-simple setups buffers of 16MB to make it safer.<br />
<br />
The next step is of course to display this data according to time and compare it to physical pages allocated by the operating system with a kernel trace.<br />
<br />
Happy hacking!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-10931584180368994592013-03-28T09:00:00.002-07:002013-03-28T09:00:37.460-07:00Augedit prototypeAugeas is a great tool for string to tree transformation. I created a GUI to make it visual and more interactive. The prototype app is called Augedit. It displays the Augeas tree on the left side, and on the the string representation on the right side. By selecting a node in the tree, the label and the value matched are highlighted in the text view. Here is an example with a trivial key/value lens, one record per line.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyolGUsgQo3vcisyZno9rrjMWm-mWELzNmNU5raTxvRjdZRmnU_EdvRHqdUZUkmyAC-Bmit2iWwzXvfuDJ7sgDm-dIgTHKtMYBVbhW26MjGSfuILkwRzmoWLAIYDJRgXbUzjrr-afI0Uc/s1600/Capture+du+2013-03-28+11:29:57.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="170" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyolGUsgQo3vcisyZno9rrjMWm-mWELzNmNU5raTxvRjdZRmnU_EdvRHqdUZUkmyAC-Bmit2iWwzXvfuDJ7sgDm-dIgTHKtMYBVbhW26MjGSfuILkwRzmoWLAIYDJRgXbUzjrr-afI0Uc/s320/Capture+du+2013-03-28+11:29:57.png" width="320" /></a></div>
<br />
Behind the scene, it uses span information of the selected path to recover the file content and highlight the key and the value. The app is in Vala and uses Gtk toolkit.<br />
<br />
The prototype as of now is only a viewer. This could be extended in many ways:<br />
<br />
<ul>
<li>Edition capabilities, where the tree and the text are synchronized as the user types</li>
<li>Providing a view to query and display results</li>
<li>XPath expression builder, so that they can be validated and then used in scripts</li>
<li>Highlight the whole match for a sub-tree The information is available from the span, but I haven't been able to set alpha channel for colors in Gtk.TextView. Using pango/cairo/webkit based widget to display the text may be the solution.</li>
</ul>
<div>
Vala is quite nice, I like the idea of having bare metal binaries and the integration with Augeas with the bindings works great, documentation on valadoc.org is excellent. However, GCC complains because of the C code generated and it's a bit annoying part of it is my fault (using deprecated stuff, like Gtk.Box() for instance). For more code examples, I'm using yorba projects as reference, like geary and shotwell. Finally, debugging is not as convenient, because of the additional layer of code generation.</div>
<div>
<br /></div>
<div>
The code is on github: <a href="https://github.com/giraldeau/augedit">https://github.com/giraldeau/augedit</a></div>
<div>
<br /></div>
<div>
Have fun!</div>
<div>
<br /></div>
Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-83888852711185795582013-03-28T07:24:00.000-07:002013-03-28T07:24:25.609-07:00Presenter assistsI was a volunteer at EclipseCon 2013. One of the task consists to help the speaker keep on time, by displaying cardboards showing the remaining time of the session. For fun, I and François Rajotte made a small QML app for that purpose (and most of all to experiment with QML). The app is based on a timer, with color transition when remaining time goes under some threshold. It is really hacky, but it works!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkcp3A_LJIdamidHyUBRbShquKfI03q1nRVLiU1BfIHP9EgJr8Ax1sItn35K6kSUc_lm2zsxKlluP1RDtqaS8iBQ-d0-MDaFOqABaVjUVsUFO7CCXcsDyyyZE-4Q87KR27xfOXxidA23Q/s1600/Capture+du+2013-03-28+10:05:34.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="160" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkcp3A_LJIdamidHyUBRbShquKfI03q1nRVLiU1BfIHP9EgJr8Ax1sItn35K6kSUc_lm2zsxKlluP1RDtqaS8iBQ-d0-MDaFOqABaVjUVsUFO7CCXcsDyyyZE-4Q87KR27xfOXxidA23Q/s320/Capture+du+2013-03-28+10:05:34.png" width="320" /></a></div>
<br />
Usage:<br />
<br />
<ul>
<li>Start or reset the timer: click anywhere in the window</li>
<li>Full screen: press "F"</li>
<li>Quit: press "Q"</li>
</ul>
<br />
We definitely should be able to set the start time (not all sessions have the same length) and the list of threshold with custom colors should be customizable. It requires an additional UI component, and those settings should be saved.<br />
<br />
As for QML, it's quite cool. The rendering is flicker free and it's quite easy to make animations. We had one issue with keyboard events. We extended a QAction and overloaded the method eventFilter to get key events. The problem is the method always returned "true", as if we should stop forwarding the event to other event filters. It turns out that it was preventing QML to behave properly, because its handler runs after our own. Be aware!<br />
<br />
The code is on github: <a href="https://github.com/giraldeau/presenter-assists">https://github.com/giraldeau/presenter-assists</a>Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-31754623185571123052012-03-28T07:03:00.001-07:002012-03-28T07:20:23.910-07:00Paradyn Week 2012 summary<a href="http://www.dyninst.org/meetings/2012" target="_blank">Paradyn Week 2012</a> was held at the <a href="http://www.umd.edu/" target="_blank">University of Maryland</a>. I went there to see what other folks were doing in the area of debugging and runtime software monitoring with executable binaries modification. I wanted to compare that approach with our current tracing methods.<br />
<br />
Dyninst is a set of tools to modify assembly code. It's like a swiss knife for binaries. It can disassemble them, analyse instructions, recover control flow, and then insert or delete code. It generates code for x86, amd64, ppc32 and ppc64 and few other exotic architectures.<br />
<br />
The first presentation was about data flow analysis. The most interesting part in my point of view was the liveness analysis. It computes the set of registers read or written in a function. While tracing, only the subset of registers that actualy contains values are saved. The performance benefit has not yet been evaluated, but the hypothesis is that if fewer registers are saved at each tracepoint, then it will lower the overhead. But, since compilers tries to use as much registers as possible and that under x86 there are very few general purpose registers, probably the number of live registers, on average, is close to the entire set of registers.<br />
<br />
Then, there were a talk of Josh Stone from Redhat about integrating Dyninst with SystemTap. The core idea is to use STP scripts and compile them as shared library to instrument userspace applications. The current implementation is able to connect static tracepoints probes to handler functions. For this proof of concept, the handler prints the function name to the console.<br />
<br />
There were a talk about debugging at extreme scale. The scale considered is 100k nodes and 1M cores. The idea is to script the debugging phase instead of trying to use interactive debugger. The debugging script is added to the cluster scheduler queue to run later. MRNet, a tree network overlay library, is used to control the debugger. A benchmark shows that setting a breakpoint on 1M cores takes about 200ms! It's also used to gather results with a reduction algorithm to make it also scalable, instead of using only concatenation of results.<br />
<br />
The self-propelled instrumentation presentation was about instrumenting distributed applications. It follows the control path and instrument on the fly the application. All the instrumentation is in userspace, using executable patching. If I understood well, when a client connects to a server on a different machine, then a background ssh connection is made to the other host, the instrumentation is injected to the peer process, and the program continues. The instrumentation itself is a callback that is attached to a function. For the demo, the function was performing printf.<br />
<br />
Besides that, people were so welcoming, and it was a great pleasure to share ideas. Cherry blossom are everywhere here in Washington and College Park, we should definetely have more in Montreal!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-32417857340941349372012-01-24T20:29:00.000-08:002012-02-02T10:58:07.254-08:00TCP socket blocking behaviorHere are some experiments results of blocking behavior of TCP applications. Blocking is a special state in which a process is waiting for I/O and removed from the scheduler queue. When the data is available, the process is woken up. In the case of distributed applications, the network latency is likely to cause blocking. To understand the behavior of such applications, a small experiment is made with netcat on Linux. System events are recorded with LTTng and network events are recorded with tcpdump. To simulate the network latency, the Linux traffic shaper is used. Here is the script used as the test case.<br />
<br />
<pre>#!/bin/sh
tc qdisc add dev lo root netem delay 100ms
netcat -l localhost 8765 > /dev/null &;
echo "lttng" | netcat localhost 8765
tc qdisc del dev lo root
</pre><br />
Here, the traffic shapping command sets the latency of packets to 100ms on the loop-back interface. The first netcat process is configured to listen on port 8765. The server exits immediately when a message is received. Next, the netcat client is spawn and a small string is transfered. The last command removes the traffic shapping configuration. The Linux kernel used is 2.6.38 on x86_64.<br />
<br />
Tracing this script reveal three main blockings, as reported by the blocking analysis module of flightbox.<br />
<br />
<pre># server process
Blocking report for task /bin/netcat [7495]
Start Duration (ms) Syscall Wakeup
8073461215283 300,873 sys_accept SOFTIRQ</pre><pre># client process
Blocking report for task /bin/netcat [7497]
Start Duration (ms) Syscall Wakeup
8073461820815 200,155 sys_select SOFTIRQ
8073662056512 200,135 sys_poll SOFTIRQ
</pre><br />
The next figure shows blockings occurring in the system, including messages sent at each steps.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6nP-J32LcjKXtnD4lyFZh7alQ2DvMgnHfkScr1x7bqK86FJTbN0POgZXdwu-qNnvQKdfzAe41itDzDzbli5oYKc6Ftdh1HYTAzPvlPvcz0dvcpEMhgb4w87TeHTZGd54ph096x_vYTVE/s1600/tcp-delay.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6nP-J32LcjKXtnD4lyFZh7alQ2DvMgnHfkScr1x7bqK86FJTbN0POgZXdwu-qNnvQKdfzAe41itDzDzbli5oYKc6Ftdh1HYTAzPvlPvcz0dvcpEMhgb4w87TeHTZGd54ph096x_vYTVE/s1600/tcp-delay.png" /></a></div><br />
<br />
The server process create a new AF_INET socket, binds it and start to listen on the selected port. The accept is then performed for an incoming connection. From the trace, we observe that sys_accept blocks for about 300ms. This delay is very close to three times the network latency and is also almost equal to the process duration. Once the accept returns, the read on the socket doesn't block and the process exits.<br />
<br />
In the case of the client, it creates the socket, then perform a sys_connect to the server. The connect returns immediately the value EINPROGRESS without blocking. The next step performed is a sys_connect, in which the client blocks for about two times the network latency. When the select returns, the actual message is sent to the server without blocking and the socket is closed. Finally, the client waits on sys_poll for about twice the network delay.<br />
<br />
From this observation, the sys_accept performed by the server blocks until the final handshake ACK is received. Hence, unfinished handshake is completely hidden from the application and handled at the OS level. When the read is done, the data is already buffered, such that the read doesn't block in this case. In the case of the client, the connect system call returns in an optimistic fashion. The wait for the socket to be ready is differed to a select system call. This may allow many connect to be performed simultaneously. As for the poll, this may be related to the tear-down procedure of the socket, waiting for the final FIN from the server.<br />
<br />
In the case of the traffic shaper used, the delay between consecutive packets is preserved. For example, the client sends three packets when the sys_select returns in a short burst. Hence, this experiment may not highlight all possible blockings. An alternative to traffic shaper is the iptables NFQUEUE. It allow to forward each packet in userspace for arbitrary processing. More on this in the next blog.Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-44440269033121778252011-11-16T08:28:00.000-08:002011-11-16T08:30:36.492-08:00Tracing MPI applicationWhile preparing labs for the Parallel Systems course at Polytechnique, I had the chance to refresh my MPI knowledge. One of the activity consist to compare communication patterns in a MPI application. For this purpose, the MPI library can be linked with special library to record all communication. Jumpshot is the graphical viewer for the trace. I wanted to compared traces obtained from MPE and compare it to a kernel trace.<br />
<br />
First, some technical details on how to instrument MPI program, because it's not so obvious after all and documentation is sparse. I tried MPICH and OpenMPI as MPI implementation. I would recommend MPICH because it seems to provides better error message, which are very valuable to understand failures. In addition, on Fedora, MPE library is built-in in the MPICH package, which is not the case for OpenMPI. On Ubuntu, MPE library must be downloaded and installed by hand [1] because there is no binary package for it. As for the autoconf for the project, I reused m4 macro cs_mpi.m4 from Code Saturne [2] with the following options to configure. By linking to llmpe, the library will ouput suitable file for Jumpshot. <br />
<br />
<pre>CC=mpicc LDFLAGS="-L/usr/local/lib" LIBS="-llmpe -lmpe" ./configure
</pre><br />
The experiment consists in a small MPI application that exchange grid boundaries in 1D cartesian decomposition using blocking send and receive. 4 process are launched on the same host. The first figure shows a zoomed view of execution and messages exchanged between process according to time. The second figure shows a statistic view for an interval, in which the time spent in each state is summed. In one look, one can see what is the relative taken for the communication, which is very handy to understand performance.<br />
<br />
The same application has been traced with the LTTng kernel tracer. While the program is computing and exchanging data with neighbors, no system calls are performed to transfer the data. In fact, between process on the same hosts, MPICH uses shared memory for communication, obviously to achieve better performance. This example shows that it may not be possible to recover relationship between processes that communicates through shared memory. One thing to note is that numerous polls on file descriptors are performed, probably for synchronization. Maybe the relationship can be recovered from there. <br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrE6aFq_lz_BN8UIgS6orBKqskQvZF4cJnReMbzRUC9nqr7lJjVwyuhhTUv2cLzCgU94_HL38O785VB-GL6JDatH3iX9RjYjKwRVNF45tlckndwa-sckeQaskpLcL6L73zid-Hq7fGbtA/s1600/Screenshot-TimeLine-jumpshot.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="255" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrE6aFq_lz_BN8UIgS6orBKqskQvZF4cJnReMbzRUC9nqr7lJjVwyuhhTUv2cLzCgU94_HL38O785VB-GL6JDatH3iX9RjYjKwRVNF45tlckndwa-sckeQaskpLcL6L73zid-Hq7fGbtA/s320/Screenshot-TimeLine-jumpshot.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Jumshot Timeline view (green: recv, blue: send)</td><td class="tr-caption" style="text-align: center;"><br />
</td></tr>
</tbody></table><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVyOjuByOFA-OjoOub0h1tVQ_ksXSJwEfwCwtPk48RM30iO2Yv6xhss_58wCeSezJvtw-xyg_N2QD32ZjLukGXVrxTp5NVTdBgqOT966ffETeX7NBuiCu8uOoTlUGPoC4yuRqnRI8Jb5Q/s1600/Screenshot-Histogram-jumpshot.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="260" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVyOjuByOFA-OjoOub0h1tVQ_ksXSJwEfwCwtPk48RM30iO2Yv6xhss_58wCeSezJvtw-xyg_N2QD32ZjLukGXVrxTp5NVTdBgqOT966ffETeX7NBuiCu8uOoTlUGPoC4yuRqnRI8Jb5Q/s320/Screenshot-Histogram-jumpshot.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Jumshot Histogram view (green: recv, blue: send)</td></tr>
</tbody></table><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQXtJ-DRMg4NP_iN6pigXsFu-zRlufSUDQYSvbaP7kEg7Gi4VgCELBgu3FXlkFhZUW02VvrS7doZ2HvDANpK-ji2_zs0z17viP9oBM4yu7p2Ny1cX_dDGSnrO975__kL7uz-VWE65Uz7o/s1600/MPI-kernel.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="74" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQXtJ-DRMg4NP_iN6pigXsFu-zRlufSUDQYSvbaP7kEg7Gi4VgCELBgu3FXlkFhZUW02VvrS7doZ2HvDANpK-ji2_zs0z17viP9oBM4yu7p2Ny1cX_dDGSnrO975__kL7uz-VWE65Uz7o/s640/MPI-kernel.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">LTTv control flow view (green: userspace)</td></tr>
</tbody></table><br />
Would it be possible to trace shared memory accesses directly? A recent kernel feature allows to trace memory map I/O (MMIOTRACE). It works by resetting the flag that indicates a page is in memory, such that a page fault is triggered each time a page is accessed. The page fault handler is instrumented to record those events. This technique may be a hint on how to recover shared memory process relationship. More on this later.<br />
<br />
In conclusion, MPI instrumentation with MPE helps to understand runtime performance behaviour of MPI application. The same results can't be obtained with the current kernel tracing infrastructure found in LTTng. <br />
<br />
[1] <a href="ftp://ftp.mcs.anl.gov/pub/mpi/mpe/mpe2.tar.gz">ftp://ftp.mcs.anl.gov/pub/mpi/mpe/mpe2.tar.gz</a><br />
[2] <a href="http://research.edf.com/research-and-the-scientific-community/software/code-saturne/introduction-code-saturne-80058.html">http://research.edf.com/research-and-the-scientific-community/software/code-saturne/introduction-code-saturne-80058.html</a>Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-11292015300736432972011-07-26T08:59:00.000-07:002011-07-26T08:59:23.892-07:00Common Trace Format BitfieldsThe <a href="http://http://git.efficios.com/?p=ctf.git;a=blob_plain;f=common-trace-format-specification.txt;hb=HEAD">Common Trace Format</a> (CTF) will be the default format of the next release of LTTng 2.0. This new trace format is able to write data at the bit level to optimize space usage of each field. Such sub-byte packed bits is called a bitfield. <br />
<br />
Endianness of the bitfield follows the native endianness of the architecture where the trace is obtained. Big endian (BE) and little endian (LE) are supported. But how does the bits are written into the trace? According to the documentation in include/babeltrace/bitfield.h: <br />
<br />
<pre style="font-size:8pt;">The inside of a bitfield is from high bits to low bits.
On big endian, bytes are places from most significant to less significant.
Also, consecutive bitfields are placed from higher to lower bits.
On little endian, bytes are placed from the less significant to the most
significant. Also, consecutive bitfields are placed from lower bits to higher
bits.
</pre><br />
To make sure I understood correctly, I created an example based on the current implementation in babeltrace. The structure has three fields. Each field is followed by it's length in bits.<br />
<br />
<pre>struct x {
unsigned int a : 4;
unsigned int b : 32;
unsigned int c : 4;
}
</pre><br />
The total size of the structure is 40 bits or 5 bytes. For the example, following values are assigned to the fields to be able to spot them easily:<br />
<br />
<pre>x.a = 0x0000000E;
x.b = 0xA1B2C3D4;
x.c = 0x0000000F;
</pre><br />
Then, writing those values in a byte array in big and little endian gives the results in the table below. The binary and hexadecimal reprensentation of consecutive bytes are displayed.<br />
<br />
<table><tr> <td><br />
<pre>Big endian
0b11101010 0xEA
0b00011011 0x1B
0b00101100 0x2C
0b00111101 0x3D
0b01001111 0x4F
0b00000000 0x00
[...]
</pre></td>
<td><br />
<br />
</td>
<td><br />
<pre>Little endian
0b01001110 0x4E
0b00111101 0x3D
0b00101100 0x2C
0b00011011 0x1B
0b11111010 0xFA
0b00000000 0x00
[...]
</pre></td> </tr>
</table><br />
Big endian is simple, because the bitfield is written sequentially. The first byte 0xEA starts with the four bits if the <i>a</i> field, followed by the first four bits of the <i>b</i> field. Notice that in consequence all subsequent bytes of the <i>b</i> field are shifted. The last byte starts with the last four bits of the <i>b</i> field. The <i>c</i> field fills the rest of the last byte. All bytes that follow are untouched.<br />
<br />
As to little endian, the first byte ends with the <i>a</i> field. Then, the <i>b</i> field is written, but the most significant bit is found on the second half of the last byte. Notice that tbe bytes and written backwards, also shifted by four bits. The third field <i>c</i> is placed in the upper half of the last byte.<br />
<br />
Happy hacking!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-37321975139568423732011-06-16T18:39:00.001-07:002011-06-16T18:42:51.684-07:00Summary of Linux Symposium 2011Linux Symposium took place in Ottawa once again in 2011. I went there for a talk about recovering system metrics from kernel trace. There were interesting talks, here is a summary. <br />
<br />
Hitoshi Mitake explained scalability issues for virtualization on multi-core architecture with real-time constraints. He mentioned that when one of the virtual CPU is interrupted, another virtual CPU may tries to enter section protected by a spin lock, which decrease performance under load. Alternative scheduling of virtual CPU is proposed to avoid such situation.<br />
<br />
Sergey Blagodurov talked about scheduling in NUMA multicore architecture. I learned the numactl system calls that can be used at the userspace level to move process and memory between domains. He was developing scheduler algorithms to optimize performance on this architecture. Since the Linux scheduler can migrate a process, but not the related memory pages, then performance is decreased because the inter-domain communication is increased and is much slower. One of his goal was to increase memory locality for pages that were used frequently.<br />
<br />
One talk has been given about Tilera architecture by Jon C Masters, that includes on a single chip up to 100 cores! This piece of hardware is undoubtedly very special, and I would love to get one on my desk! I wonder how does the Linux kernel can scale up to that number of cores.<br />
<br />
Trinity, which is a system call fuzzer has been presented by Dave Jones. It's basically a tool that perform system calls with randomized arguments, but still looking at edge cases. They found and fixed around 20 weired bugs this way. Code coverage report could be useful in future.<br />
<br />
What about redundant data on a disk? There were a talk about block device deduplication by Kuniyasu Suzaki. The principle is that two identical blocks should not be saved twice. Since it operates at the block level, mileage vary between file systems. The presenter showed principally how much space can be recovered according to block size and file system type. I'm still curious what is the performance impact and CPU overhead of performing deduplication online. <br />
<br />
The last talk I wanted to mention is the one given by Alexandre Lissy, about verifications of the Linux kernel. Static checks like Coccinelle are available right now by running make coccicheck in your linux base source tree. While model checking on the entire kernel would be great to get more guarantees about it's well behavior, the feasibility of it with SAT model checker is another story, because of the size and complexity of the Linux source tree, which is quite an undertaking challenge. Maybe an approach like SLAM from Microsoft, that validates interface usage, could be use for Linux drivers too?<br />
<br />
This Linux Symposium was the opportunity to meet a lot of people involved in the kernel community and share ideas. Here are few links related to the conference.<br />
<ul><li><a href="http://research.microsoft.com/en-us/projects/slam/">http://research.microsoft.com/en-us/projects/slam/</a></li>
<li><a href="http://codemonkey.org.uk/projects/trinity/">http://codemonkey.org.uk/projects/trinity/</a></li>
<li><a href="http://oss.sgi.com/projects/libnuma/">http://oss.sgi.com/projects/libnuma/</a></li>
<li><a href="http://www.dcl.info.waseda.ac.jp/">http://www.dcl.info.waseda.ac.jp/</a></li>
<li><a href="http://www.opendedup.org/">http://www.opendedup.org/</a></li>
<li><a href="http://www.tilera.com/">http://www.tilera.com/</a></li>
<li><a href="http://www.linuxsymposium.org/2011/schedule.php">http://www.linuxsymposium.org/2011/schedule.php</a></li>
</ul>Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-69175865236715768382011-05-21T14:58:00.000-07:002011-05-21T15:00:38.742-07:00Performance analysis with kernel traceIn this blog, I want to show some progress I made toward providing performance analysis tools for developpers and system administrators. The idea is to answer questions like: <br />
<br />
<ul><li>What is the cause of lattency in the system?</li>
<li>How much resource a task require?</li>
<li>What is the relationship between process?</li>
<li>What is the critical path of a task?</li>
<li>Which subsystem is the bottleneck of the system?</li>
</ul><br />
To answer such questions, a kernel trace is an excellent source of system performance data. It has low overhead and provide precise timings information about events occuring on the system. But, get an overview of a system behavior by looking at each individual events is a challenge. We need an analysis toolbox. In the rest of this blog, Four tools are presented with which you can experiement yourself. A summary example is presented to show the integration of various analysis made with flightbox, the prototype implementation.<br />
<br />
<h2>Resource usage recovery</h2>The basic metric needed to perform various other analysis is resource usage. CPU usage recovery has already been presented from previous post. The same kind of analysis can be done with memory, disk and network usage. <br />
<br />
In the case of CPU usage, kernel scheduling events indicates the process that drives the CPU. The CPU usage of a process is thus the sum of scheduled time. To display on a chart, CPU usage intervals are decomposed in fixed time range. An average usage is then easy to compute, and the precision can be increased by reducing the duration of time interval. The technique is presented on the following figure. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5dYhBv4zJ3R5RCb_v-n_IbjmJ9rc2DBh_6boRoo0Nae751D9pAy6tc8W2y-naS_83v3TmjP6qmlkpRTPmF1yrTglGdmVUfsCLdR4OqYzycPMVRkyIwxDjdYx73-LO4udioFKMKDASg7k/s1600/cpu-usage-rastering.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="162" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5dYhBv4zJ3R5RCb_v-n_IbjmJ9rc2DBh_6boRoo0Nae751D9pAy6tc8W2y-naS_83v3TmjP6qmlkpRTPmF1yrTglGdmVUfsCLdR4OqYzycPMVRkyIwxDjdYx73-LO4udioFKMKDASg7k/s320/cpu-usage-rastering.png" /></a></div><br />
This technique is known as geometric integration. This computation behaves somewhat like a low-pass filter, by smoothing the raw CPU usage, making it easier to understand. The size of the resulting data is proportional to the trace duration times the desired precision times the number of process, no matter the trace size. In consequence, the CPU usage of a large trace can be summarized with a small array of doubles. <br />
<br />
<h2>Blocking analysis</h2>A process is blocked when a resource is not available and it can't progress. For example, we a process perform a read on a socket while the data is not yet received, the process block and the sceduler remove the waiting task from the running queue. Blocking can only occur in kernel mode. The main cause of blocking is in system calls. It can also occur during a page fault for which the page is swapped for example. The following picture shows typical example of blocking. Colors are matching legend in LTTV, green is for running in user mode, blue is for running in system call and red is for blocked on IO. Finally, orange represent the interval while the data is ready, but the process is not yet scheduled.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZqfd4hh9TwKyhPylArQ1anUUHjHDOj5Sd02dC0ljKXkBwyU44ki9FYT6b_ejdRf6xcVYhq_Gf9T6UQh1bGcj1U0kWoeEO1EwSUP8IC4xpWp8-r3cCQrIEU3RFzcQ67ZiP6fEyp-FQu_k/s1600/blocking-sequence.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="80" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZqfd4hh9TwKyhPylArQ1anUUHjHDOj5Sd02dC0ljKXkBwyU44ki9FYT6b_ejdRf6xcVYhq_Gf9T6UQh1bGcj1U0kWoeEO1EwSUP8IC4xpWp8-r3cCQrIEU3RFzcQ67ZiP6fEyp-FQu_k/s320/blocking-sequence.png" /></a></div><br />
Some blocking are not necesserely bad. Waiting for user input or a server that waits for a request to come is not important. We are mostly interested on blocking on the critical path of an application, meaning blocking that induce processing delay from the user's point of view.<br />
<br />
While a task is blocked, a subtask can perform some work. This subtask can also block, thus blocking among tasks is hierarchical.<br />
<br />
<h2>Inter-process relationship</h2><br />
Under Linux, process can cooperate with various Inter-Process Communication mechanism, namely: <br />
<br />
<ul><li>Socket</li>
<li>Pipe</li>
<li>Fifo</li>
<li>Mutex</li>
<li>Signal</li>
<li>Shared memory</li>
<li>File lock</li>
</ul><br />
Those mechanism make possible to share resource, send message and data between process. Sockets are well instrumented in the Linux kernel. Hence, it's possible to recover the link between process that use socket to communicate. It works at the host level, but provided traces from many hosts, it would be possible to recover links between distributed components. For example, we could recover process relationship for an httpd client, shown at the following figure.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbgMZ7XIPLNzQwY4ZuwzZbntcwYDT5db92GXPkfHtzeLnpwop9Db_GyMLchjf8o8QPBtiEUUv3CY40vWGbInRyLEqcKqj8jJjO8yDJtiLBeEVNSSCTZGpnBTadzybqoct0gZ6UIqzrUf4/s1600/http-process-relationship.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="320" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbgMZ7XIPLNzQwY4ZuwzZbntcwYDT5db92GXPkfHtzeLnpwop9Db_GyMLchjf8o8QPBtiEUUv3CY40vWGbInRyLEqcKqj8jJjO8yDJtiLBeEVNSSCTZGpnBTadzybqoct0gZ6UIqzrUf4/s320/http-process-relationship.png" /></a></div><br />
<h2>Resource usage accounting</h2><br />
The idea behing resource usage accounting is to associate resource usage of a server to the client that does the request. This analysis is able to compute the cost of some request accros distributed process as if it was performed by only one thread. To perform this analysis, ressource usage, inter-process relationship and blocking analysis are required.<br />
<br />
For example, say that an http client perform a request on a web server. The resulting web page is returned to the client. While waiting for the web page, the http client blocks. By associating the CPU usage of the httpd thread that process the request of the client while it's blocked, the resource usage is effectively accounted to the right process. This is shown by the following figure.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbofQEwQZSId8gOu_90AfoINsSVUfK-AndqgSX0UYi4Na-zZVF-HN81q92T4ZyYkFJXNyrE7qSbo_3xnIM9MREhQP7UBnOkbM7RBdmpBxShLrDlnCqTUULCsuUkBISf0DXtAd2D5AjEbM/s1600/resource-accounting.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="91" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbofQEwQZSId8gOu_90AfoINsSVUfK-AndqgSX0UYi4Na-zZVF-HN81q92T4ZyYkFJXNyrE7qSbo_3xnIM9MREhQP7UBnOkbM7RBdmpBxShLrDlnCqTUULCsuUkBISf0DXtAd2D5AjEbM/s320/resource-accounting.png" /></a></div><br />
<h2>Example</h2><br />
In the following example, we will use a special RPC client and server, srvhog and clihog from workload-kit. The binary srvhog has two commands, sleep and hog. Each operation takes 100ms. Sleep command is implemented with a timer and doesn't consume CPU, while hog command perform computation that last for the duration. A calibration procedure is performed when srvhog starts to ensure that the hog computation duration is close to 100ms. Communication between client and server is done by a simple TCP socket.<br />
<br />
The following picture shows the CPU usage of the traced system, a host that has two cores, for the hog scenario. The overall CPU usage is shown in blue, while the CPU usage of srvhog is red. The calibration phase of srvhog last for about one second followed by a burst of about 100ms. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhalCQPCIBZ9KYEIOUPv0A25_D5rn_fPh6ZEHPrCtUvoVJBm2TBYMYofcIvPA9ju8oW64mOD44DRZADFSjF9NVhOoCWtq7oTupYwRrSZsuHf3oC6XA7vPDQTC3L35wthOm0f_Z8hTVqejM/s1600/Screenshot-Flightbox+.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="265" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhalCQPCIBZ9KYEIOUPv0A25_D5rn_fPh6ZEHPrCtUvoVJBm2TBYMYofcIvPA9ju8oW64mOD44DRZADFSjF9NVhOoCWtq7oTupYwRrSZsuHf3oC6XA7vPDQTC3L35wthOm0f_Z8hTVqejM/s320/Screenshot-Flightbox+.png" /></a></div><br />
The second analysis shows blocking of clihog. It waits on system call read for 101ms, that correspond to the computation time of the server before reply is sent. Other blocking are very small and negligeable.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikGyRV6UE_pt8eekHQuQ2fQixw7VqMdUnwJtgK3dGo41dCxUaLete4vt6JniAN79_-bu7opdOhtqu3-vRHrY9sZyEjuFvSeusLjvfAStqmPZSxBsa3lbBhpjnR3qovuKYQPz4tP-i_9Yw/s1600/Screenshot-Flightbox+-1.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="128" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikGyRV6UE_pt8eekHQuQ2fQixw7VqMdUnwJtgK3dGo41dCxUaLete4vt6JniAN79_-bu7opdOhtqu3-vRHrY9sZyEjuFvSeusLjvfAStqmPZSxBsa3lbBhpjnR3qovuKYQPz4tP-i_9Yw/s320/Screenshot-Flightbox+-1.png" /></a></div><br />
By analyzing socket operations of srvhog and clihog, the following graph can be computed. It shows that the client perform a connection to the server. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIdxONq0p87f6dh5V7o-7UYUKjk0ItePp_oB0ANFyYKpIRLWoA3Q4CZzYeSd0sNl_ZCzj-kz6sUNIwSMhXDf_l3toiYG3yZGQZ0Ogs_COgKOQhEG0nqMGnzJK_l7r0uDsa-U_yuvp47oM/s1600/rpc-hog-100ms-task-graph.dot.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="160" width="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIdxONq0p87f6dh5V7o-7UYUKjk0ItePp_oB0ANFyYKpIRLWoA3Q4CZzYeSd0sNl_ZCzj-kz6sUNIwSMhXDf_l3toiYG3yZGQZ0Ogs_COgKOQhEG0nqMGnzJK_l7r0uDsa-U_yuvp47oM/s320/rpc-hog-100ms-task-graph.dot.png" /></a></div><br />
Finally, by computing resource accounting, the following report is obtained, showing that the CPU usage of srvhog while processing the request should be really accounted to clihog. CPU usage is almost 90ms, which is close to the expected 100ms and waiting time. The exact CPU usage is closer to 100ms and the error can be explained by the lost of precision while computing average CPU usage. To increase precision, smaller CPU usage intervals can be used.<br />
<br />
<pre style="font-size: 8pt">Report for task pid=2668 cmd=/usr/local/bin/clihog
File descriptor blocking summary
FD N Sum (ms)
------------------------------------------------------
132.207.224.50:9876 2 101,225
CPU accounting
PID Self (ms) Sub (ms) Total (ms)
------------------------------------------------------------------
2668 clihog 2,675 89,797 92,471
\_ 2666 srvhog 89,797 0,000 89,797
</pre><br />
<h2>Conclusion</h2><br />
Kernel analysis goes well beyong kernel developpers. Kernel trace can be very useful to system administrators and application developpers provided simple analysis tools that abstract trace events and produce meaningfull results to users. <br />
<br />
Try it for yourself, see <a href="https://github.com/giraldeau/flightbox">flightbox</a> and <a href="https://github.com/giraldeau/workload-kit">workload-kit</a> on github. <br />
<br />
More to come, stay tuned!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-36121528822344530852011-05-10T13:03:00.001-07:002011-05-24T13:36:40.847-07:00Tracing Workshop '11I assisted to the Tracing Workshop held at the École Polytechnique de Montréal on 9-10 may 2011, here are some highlights of the event. Partners presented their challenges followed by current research conducted in the fields from many Canadian universities. <br />
<br />
<h2>Partners</h2>Partners mentioned the interest in security, debugging hardware, real-time systems, embedded systems and large scale distributed infrastructure. <br />
<br />
On the security aspects, the goal would be to increase the accuracy of detecting intrusion. It's reported that simple attacks are often working, like sending an email with a link that loads some browser extension or other web site, up to get the user credentials.<br />
<br />
The need to model normal behavior and alert when the system is going outside of it without false positive has been mentioned a lot.<br />
<br />
In the field of large scale infrastructure, there are needs for understanding the system wide behavior of a distributed system at runtime. This is at the heart of finding problems fast on hard to reproduce problems, including kernel, and apps crash and performance slowdowns.<br />
<br />
Monitoring a system modify the system behavior and the measure itself, even in the computer domain. In the embedded and real-time, first there is the challenge to put an upper bound to tracing impact, to make sure to respect timing constraints, even in the case tracing is enabled.<br />
<br />
<h2>Research</h2>Sebastian Fischmeister from the University of Waterloo presented research about using a sampling approach to tracing for real-time systems. He presented an approach to add hardware instruction for tracing, that would run it only if the timing requirements would be met.<br />
<br />
Ashvim Goel from the University of Toronto talked about increasing security by recording all system calls to allow off-line analysis. Data is indexed to allow fast queries.<br />
<br />
Greg Frank from Carleton University presented performance model based on traces. Execution paths at the system level can be modeled by a graph. Graph can be annotated with resource consumption.<br />
<br />
Del Mayers from the University of Victoria presented a reverse engineering tool using tracing. By running the program with the desired functionality, then repeating the scenario without using that particular activity, it's possible to do a difference set operation on called methods and isolate with low noise level the classes and methods that implements this particular functionality.Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0tag:blogger.com,1999:blog-1864509792735323218.post-71397026308224403832011-02-23T07:37:00.000-08:002011-02-23T08:47:05.103-08:00Managing Apache config with AugeasRelease 0.8.0 of Augeas includes a special lens for managing Apache web server configuration files. First, we will see how the lens maps sections and directives to the tree. Then, we will review some recepies to modify the configuration and show results.<br />
<br />
<h2>Quick start</h2>Apache configuration has two main components. The first component is section, which is similar to XML elements. Sections can be nested and contain directives. Directives are composed of a name and positional arguments.<br />
<br />
As you may know, Apache is very modular. A module can define its own directives and Apache will run the module's callback to parse arguments. This is reflected in the way the configuration is handled in Augeas. The lens is mostly generic and allows arbitrary sections and directives, with arbitrary arguments. Apache allow case insensitive sections are directive. The actual version enforce that open and close tags to be the same case. I recommend to standardize the casing of the configuration for simplicity. Now, let's look at a small example, with one section and one directive. <br />
<br />
<pre style="font-size: 8pt;"><VirtualHost *:80>
ServerAdmin foo@example.org
</VirtualHost>
</pre><br />
This configuration produces the following tree.<br />
<br />
<pre style="font-size: 8pt;">{ "VirtualHost"
{ "arg" = "*:80" }
{ "directive" = "ServerAdmin"
{ "arg" = "foo@example.org" }
}
}
</pre><br />
Let's say we want to modify the email address of the server admin, we can do a simple set:<br />
<pre style="font-size: 8pt;">augtool> set "/VirtualHost/*[self::directive='ServerAdmin']/arg" "bar@example.com"
</pre>To select the ServerAdmin directive, we need to search for it as a child of VirtualHost. We then set the first argument of the corresponding directive to a new value. Here is the resulting diff from the original file. If there are more than one VirtualHost entries in the file, then a simple set will fail, a set multiple (setm) is required to change them all at once.<br />
<br />
<pre style="font-size: 8pt;">$ diff -u httpd.conf httpd.conf.augnew
--- httpd.conf 2011-02-22 22:03:34.000000000 -0500
+++ httpd.conf.augnew 2011-02-22 22:04:34.844099001 -0500
@@ -1,3 +1,3 @@
<VirtualHost *:80>
- ServerAdmin foo@example.org
+ ServerAdmin bar@example.org
</VirtualHost>
</pre><br />
<h2>Tree structure</h2>There are 4 special things to know about the tree structure of apache configuration.<br />
<br />
<ul><li>Section names are labels of the tree, they appear on the path.</li>
<li>Directives are held in value of a "directive" node.</li>
<li>Arguments of either sections or directives are held in values of "arg" node.</li>
<li>Arguments nodes must be before any other children.</li>
</ul><br />
To demonstrate this, let's create a configuration from scratch, and see it build at each step in augtool. First, launch augtool. I suggest to use the -n option to save to a new file. <br />
<pre style="font-size: 8pt;">$ sudo augtool -n
</pre>The variable $conf holds the path to the target empty file, for example under Ubuntu it would be:<br />
<pre style="font-size: 8pt;">augtool> defvar conf /files/etc/apache2/sites-available/foo
</pre><br />
<h3>Step 1: create VirtualHost section</h3>The clear command empty the value of a node and it create it if the node doesn't exists yet.<br />
Command:<br />
<pre style="font-size: 8pt;">augtool> clear $conf/VirtualHost
</pre>Result:<br />
<pre style="font-size: 8pt;"><VirtualHost>
</VirtualHost>
</pre><br />
<h3>Step 2: Set the argument of VirtualHost section.</h3>Command:<br />
<pre style="font-size: 8pt;">augtool> set $conf/VirtualHost/arg "172.16.0.1:80"
</pre>Result:<br />
<pre style="font-size: 8pt;"><VirtualHost 172.16.0.1:80>
</VirtualHost>
</pre><br />
<h3>Step 3: Create the ServerAdmin directive under VirtualHost.</h3>Command:<br />
<pre style="font-size: 8pt;">augtool> set $conf/VirtualHost/directive "ServerAdmin"
</pre>Result:<br />
<pre style="font-size: 8pt;"><VirtualHost 172.16.0.1:80>
ServerAdmin
</VirtualHost>
</pre><br />
<h3>Step 4: Set the ServerAdmin first argument to configure the email.</h3>Command:<br />
<pre style="font-size: 8pt;">augtool> set $conf/VirtualHost/*[self::directive='ServerAdmin']/arg "baz@example.org"
</pre>Result:<br />
<pre style="font-size: 8pt;"><VirtualHost 172.16.0.1:80>
ServerAdmin baz@example.org
</VirtualHost>
</pre><br />
<h2>The order matters</h2>What if we perform step 3 before step 2? Here is the result:<br />
<pre style="font-size: 8pt;">augtool> set $conf/VirtualHost/directive "ServerAdmin"
augtool> set $conf/VirtualHost/arg "172.16.0.1:80"
augtool> print $conf
/files/etc/apache2/sites-available/foo
/files/etc/apache2/sites-available/foo/VirtualHost
/files/etc/apache2/sites-available/foo/VirtualHost/directive = "ServerAdmin"
/files/etc/apache2/sites-available/foo/VirtualHost/arg = "172.16.0.1:80"
augtool> save
Saving failed
augtool> print /augeas/files/etc/apache2/sites-available/foo/error/message =
Failed to match ({ /arg/ = /[...]/ }({ /arg/ = /[...]/ })*)?(<<rec>>
| { /directive/ = /[...]/ } | { /#comment/ = /[...]/ } | { })*
with tree
{ \"directive\" = \"ServerAdmin\" } { \"arg\" = \"172.16.0.1:80\" }"
</pre><br />
I reformated the error message to highlight why it failed. The lens require "arg" nodes to appear before "directive", "#comment" and empty nodes. In this case, the node "directive" is before the "arg" node, and it doesn't match the tree structure. So, how to make sure that we insert nodes before any other? Here is the twist. First, we will remove the arg node and create it at the right place.<br />
<br />
<pre style="font-size: 8pt;">augtool> rm $conf/VirtualHost/arg
rm : $conf/VirtualHost/arg 1
augtool> ins arg before $conf/VirtualHost/*[1]
augtool> set $conf/VirtualHost/arg "baz@example.org"
augtool> save
Saved 1 file(s)
</pre><br />
And the corresponding result is the one expected. The predicate [1] will always select the first node in a node set and is equivalent to [position()=1]. There is one gotcha, that is if there are no children under VirtualHost, no node will match and the insert will fail. In this case, a simple set is required.<br />
<br />
Also, take care to set a value to "arg" nodes! Otherwise, Augeas use the section name lens to render it. Here is the result of having an "arg" node with a null value.<br />
<br />
<pre style="font-size: 8pt;"><VirtualHost>
<arg>
</arg>
</VirtualHost>
</pre><br />
<h2>Use cases</h2><h3>Modify document root</h3>The first use case I show is to modify document root in the default configuration of Apache under Ubuntu. The default document root is "/var/www", but say we want to change it to "/srv/www". We have to modify the DocumentRoot directive and one Directory section. I'm using the match command to verify that the right node is selected before applying changes. (sorry for long lines...)<br />
<br />
<pre style="font-size: 8pt;">augtool> match $conf2/VirtualHost/*[self::directive="DocumentRoot"]
/files/etc/apache2/sites-available/default/VirtualHost/directive[2] = DocumentRoot
augtool> set $conf2/VirtualHost/*[self::directive="DocumentRoot"]/arg /srv/www
augtool> match $conf2/VirtualHost/Directory/arg[. = "/var/www/"]
/files/etc/apache2/sites-available/default/VirtualHost/Directory[2]/arg = /var/www/
augtool> set $conf2/VirtualHost/Directory/arg[. = "/var/www/"] "/srv/www/"
augtoo> save
Saved 1 file(s)
</pre><br />
Here is the resulting diff. <br />
<br />
<pre style="font-size: 8pt;">--- /etc/apache2/sites-available/default 2011-02-22 23:30:48.000000000 -0500
+++ /etc/apache2/sites-available/default.augnew 2011-02-22 23:48:10.000000000 -0500
@@ -1,12 +1,12 @@
<virtualhost *:80="">
ServerAdmin webmaster@localhost
- DocumentRoot /var/www
+ DocumentRoot /srv/www
<Directory>
Options FollowSymLinks
AllowOverride None
</Directory>
- <Directory /var/www/>
+ <Directory /srv/www/>
Options Indexes FollowSymLinks MultiViews
AllowOverride None
Order allow,deny
</virtualhost></pre><br />
<h3>Changing permissions</h3><br />
On Ubuntu, Apache is configured to serve documentation on http://localhost/doc/. We will share this documentation with everybody by changing Order and Allow from directive and delete Deny directive of the doc directory. We are using the regexp function to select the node with or without leading slash and with or without quotes. We remove all arguments of "Allow from" to make sure that arguments are sane. <br />
<br />
<pre style="font-size: 8pt;">augtool> match $conf2/VirtualHost/Directory[arg =~ regexp(".*/usr/share/doc.*")]/
/files/etc/apache2/sites-available/default/VirtualHost/Directory[4] = (none)
augtool> defvar dir $conf2/VirtualHost/Directory[arg =~ regexp(".*/usr/share/doc.*")]/
augtool> rm $dir/*[self::directive="Allow"]/arg
augtool> set $dir/*[self::directive="Allow"]/arg[1] from
augtool> set $dir/*[self::directive="Allow"]/arg[2] all
augtool> set $dir/*[self::directive="Order"]/arg[1] "allow,deny"
augtool> rm $dir/*[self::directive="Deny"]
rm : $dir/*[self::directive="Deny"] 3
augtool> save
Saved 1 file(s)
</pre><br />
In this example, the predicate "*[self::directive='Allow']" can be rewritten "directive[ . = 'Allow']", both are equivalent.<br />
<br />
Here is the resulting diff.<br />
<pre style="font-size: 8pt;">--- /etc/apache2/sites-available/default 2011-02-22 23:30:48.000000000 -0500
+++ /etc/apache2/sites-available/default.augnew 2011-02-23 00:10:14.000000000 -0500
@@ -33,9 +33,8 @@
<Directory "/usr/share/doc">
Options Indexes MultiViews FollowSymLinks
AllowOverride None
- Order deny,allow
- Deny from all
- Allow from 127.0.0.0/255.0.0.0 ::1/128
+ Order allow,deny
+ Allow from all
</directory>
</pre><br />
<h2>Conclusion</h2>Augeas is a very powerful and precise tool to modify Apache configuration file. We can do about any changes with a very simple tree structure. As always, don't forget to reload Apache configuration for changes to take effect. <br />
<br />
Thanks to David Lutterkort, who designed the amazing recursive lens that Apache lens uses and feedback for the square lens, and Raphaël Pinson for the feedback and stress test of the lens on more than 2000 Apache configurations and the help for this blog post. <br />
<br />
Happy hacking!Francis Giraldeauhttp://www.blogger.com/profile/01004841549682103364noreply@blogger.com0