mardi 10 mai 2016

Function graph tracing with LTTng

FTrace has the ability to hook function entry and return in the kernel. I did a small prototype demonstrating the possibility of recording these events with LTTng, and viewing the result with TraceCompass.

The idea is to register callbacks using register_ftrace_graph(). This symbol is not exported, therefore we use kallsyms to lookup the address at runtime. The callbacks contains themselves a tracepoint that records entries and returns. The kernel module implementation lttng-fgraph.c is quite simple. It limits the tracing to the normal kernel mode (no interrupt) and the function to hook is hardcoded (do_sys_open()) for prototyping purpose.

The implementation of the analysis module in TraceCompass is relatively simple (LttngFtraceCallStackProvider.java). In fact, it consists mostly to define entry and return events, and how to decode the event payload. In order to show function names instead of addresses, I implemented a symbol provider for kallsyms. It walk the trace directory to find the kallsyms file and parse it to map addresses to symbols. The user just needs to copy /proc/kallsyms below the trace directory, like this:

mkdir my-trace/kernel/syms
sudo cat /proc/kallsyms > my-trace/kernel/syms/kallsyms

Sudo (root) is required, because symbol addresses are usually hidden by default. The kallsyms file must be within its own directory, otherwise the trace parser will try to open the kallsyms file as a trace stream and will report an error.

Here is the result of some open system call on the system. Because the views are synchronized, the open system call interval of the processes shown in the control flow view overlap the call stack graph. It looks great, isn't?


However, the overhead of function tracing is somewhat high. For each function call, about 300ns are spent writing trace events, which may represent more time than the processing done in the function itself. To address the situation, FTrace has the ability to record an event only of the delay inside the function is greater than a given threshold. The call graph is no longer the exact representation of the execution, but it contributes to reduce the overhead and the trace size by not recording short duration functions, while still identifying slow functions.

At the moment, FTrace does not exports the registration functions, because external users were not intended. Only one set of hooks can be defined at a time, so one cannot use ftrace and lttng at the same time. In the future, that could be a nice extension to FTrace to make it more general and allow other tracers to use the function graph. For instance, kprobe could get callbacks not only on function entry and return, but also on each of its children.

mercredi 24 février 2016

Real-life real-time problem

I installed Linux RT on the Jetson TK1 last month. Since then, cyclictest is running to test that latency is always correct. The maximum wake-up latency is usually ~100us, and it can run like that for a long time. However, on rare occasion, a latency of ~8ms occurs when running for a long period of time (about one week). This kind of problem is tricky to find without proper tools. Here is how I found the problem.

I configured LTTng kernel and userspace tracing to continuously trace the system. The kernel trace records scheduling events and system calls while the userspace trace indicates the location of the missed deadline. When a high latency is detected, the trace is stopped and a snapshot of the trace buffers are written on the disk. I let the system running with tracing enabled for a couple of days, and then a snapshot was created, indicating that the problem occurred.

The following figure shows the result of the traces loaded in TraceCompass. We can see this bug is the result of interactions between three tasks, namely ktimersoftd/2, irq/154-hpd and cyclictest. The cyclictest wake-up signal is delayed for a long time. This signal puts back the task in the running queue, and it is delivered by ktimersoftd/2. The ktimersoftd task has lower priority than irq/154-hpd that happens to run for a long time. Even though the cyclictest has higher priority than the IRQ thread, as long as cyclictest is in the wait queue, the priority has no effect. Priority inheritance does not work here because we don't know in advance that ktimersoftd will actually wake-up a high priority task.


Obviously, the ktimersoftd threads should have greater priority than other IRQ threads as they are required by cyclictest to meet its deadline. There are few questions left: how the default priorities are defined, what is actually doing the irq/164-hpd task, and does increasing ktimersoftd priority will fix the issue? More investigation is required.

In the meantime, if you want to look for yourself, the trace is here. Cheers!

lundi 1 février 2016

Real-time setup on Jetson TK1

The Linux kernel is a popular operating system for real-time applications. Here is a summary of experimentations with the real-time kernel on the Jetson TK1 board from NVIDIA.

Setup the board

The Jetson TK1 from an internal memory. NVIDIA supplies a script to "flash" the device with the file system root and the kernel. I use the manual method from the NVIDIA Jetson TK1 documentation. Download the Driver Package and the Sample Root File System, and follow the Quick Start Guide. You should have a working system after these steps. 

I tried to use the JetPack installer to flash the device, but it requires Ubuntu 14.04, the procedure failed and it was re-downloading the huge files, so I don't recommend it. Also, I tried the procedure to use the SD card instead of the internal memory of the board and that failed too, so I stick with flashing the internal memory. 

Compile linux-rt

Clone Linux-RT from git.kernel.org:

git clone https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
cd linux-rt-devel
git checkout linux-4.4.y-rt-rebase

While it downloads, install the cross-compiler for ARM (the "hf" in arm abi hf stands for hard-float, meaning that the compiler will output code optimized for on-chip floating point arithmetic):

sudo apt-get install gcc-arm-linux-gnueabihf

Configure the kernel. When cross-compiling, the ARCH and CROSS_COMPILE variables must be set.

  • Under "Kernel Features ->  Preemption Model", select the preemption level (i.e. PREEMPT_RT_BASE)
  • Under "Kernel hacking" uncheck "Debug preemptible kernel" (known to cause slowdown)

# update configuration
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- tegra_defconfig
make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- menuconfig # compile make -j12 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- 
make -j12 ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- dtbs
# install the modules into the rootfs (adjust according to your setup) sudo make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- INSTALL_MOD_PATH=../Linux_for_Tegra/rootfs/ modules_install

There should be a directory created under Linux_for_Tegra/rootfs/lib/modules/
The last step consists to flash the device with the fresh kernel. However, the flash.sh script dit not work correctly for me. The options "-K " and "-d " do not configure the rootfs correctly. This step must be done manually and the settings are not overwriten by the flash script. Beware that the script apply_binaries.sh will copy the zImage from the kernel directory. To avoid using the previous kernel by mistake, replace the default zImage in the kernel directory with your own.

The Device Tree Blob (DTB) describes the hardware addresses that are specific to the device. Previously, one had to write these memory locations in C, whereas it is done declaratively and converted to a binary format. In my understanding, it simplifies the support of a wide range of devices. I were not able to boot the board without the proper DTB file set in the bootloader.


sudo cp arch/arm/boot/zImage ../Linux_for_Tegra/rootfs/boot/
sudo cp arch/arm/boot/zImage ../Linux_for_Tegra/kernel/

sudo cp arch/arm/boot/dts/tegra124-jetson-tk1.dtb ../Linux_for_Tegra/rootfs/boot/

Setup the bootloader

Edit the bootloader configuration file Linux_for_Tegra/rootfs/boot/extlinux/extlinux.conf to set the FDT field with the path of the Device Tree Blob (DTB) (do not change the other fields):

MENU TITLE Jetson-TK1 eMMC boot options LABEL primary MENU LABEL primary kernel LINUX /boot/zImage FDT /boot/tegra124-jetson-tk1.dtb

Then, put the device in recovery mode (hold the recovery button while pressing reset, the board should be displayed in dmesg) and execute the usual flash command:

cd Linux_for_Tegra/
./flash.sh jetson-tk1 mmcblk0p1

After the board reboots, it should run the real-time kernel. The board has a serial port and spawning minicom with a serial-to-usb device allowed to see the boot logs. This is required because the screen is blank at this early boot stage.

I still have an issue with the CONFIG_PREEMPT_RT_FULL. Without the serial cable, it would be not possible to diagnose the problem.

[    5.170764] Unable to handle kernel paging request at virtual address ffefe574
[    5.186139] kernel BUG at kernel/locking/rtmutex.c:1011!
[    5.981440] Fixing recursive fault but reboot is needed!



I use instead CONFIG_PREEMPT_RT_BASE, but I don't know exactly what is the difference between the full RT config, this is something to investigate.

vendredi 4 décembre 2015

MachineKit LTTng kernel trace

How does MachineKit execution look like at the system level? Here are some observations done using an LTTng kernel trace on Linux preempt-rt 4.1.13-rt15 on a 4 cores machine. The workload traced is the abs.0 unit test. This test spawns halcmd, simulating the position control of a motor (or something). Here is the command:

cd machinekit/
. scripts/rip-environment
cd tests/abs.0/
halcmd -v test.hal

The overall command takes 3.3s to complete. Let's check what's in the trace.

First, the rtapi processing takes about 600ms to execute and is shown in Figure 1. One of the related process, named fast:0, seems to do the actual real-time job and is shown in Figure 2. This process sleeps periodically using the system call clock_nanosleep(). I observed a wake-up latency of about 30us and the thread itself runs for about 2-3us in user-space before returning to sleep.


Figure 1: Overview of the rtapi execution.
Figure 2: Real-time process with period of about 300us.

The scripts and utilities to manage the test environment behave quite differently than the RT process. In particular, there is a bunch of processes that interact for very short durations, namely halcmd and rtapi, as shown in Figure 3. They perform sendto() and recvfrom() system calls, probably for round-trip communication. Maybe using shared memory could help here to streamline the communication. There is numerous sleep of 10ms to 200ms performed by halcmd in midway. 

Figure 3: Interactions between halcmd and rtapi.
There is a pattern of fork-exec-wait done by the script realtime shown in Figure 4. This script spawns inivar, flavor and halrun executables, together running for more than 400ms. Using programmatic API instead of spawning executables would make the processing more efficient. 


Figure 4: Repeated fork-exec-wait pattern.
Finally, there is a giant 2.5s sleep at the end of the test, shown in Figure 5. It represents therefore 75% of the test execution time. This kind of sleep is usually done for quick-n-dirty synchronization, but should be replaced by wait for the actual event required to execute as quickly as possible. This way, the time to run the unit tests can be decreased significantly, possibly representing greater productivity for developers.

Of course, if we had a user-space trace in addition to the kernel trace, we could have a greater insight of the internal state of machinekit. But even without, we can observe interesting behavior. The next step will be to actually plug and control some actual hardware in real-time.

jeudi 2 avril 2015

Why malware detection based on syscalls n-grams is unlikely to work

In the field of computer security, it was suggested to use n-grams of system calls to detect malware running on a computer. Let's demonstrate that there is no relationship between n-gram counts and malicious behavior.

False-negative example. The first trace is a weather app that opens a config file, then performs some remote procedure call to get the data, and then closes file descriptors. The other is the same sequence, but for a malicious app that gained root privilege by some means, and sends the content of the password file.


Trace A1: open(“app.conf”); read(); connect(); send(); recv(); close(); close()
Trace A2: open(“/etc/passwd”); read(); connect(); send(); recv(); close(); close()


The probability of each n-gram is trivially the same for the same sequence. Therefore, the algorithm fails to detect the malware.

False-positive example. Both sequences are functionally equivalent and non-malicious, but the second trace reduces the maximum number of opened file descriptor at one time.


Trace B1: open(); open(); open(); read(); read(); read(); close(); close(); close()
Trace B2: open(); read(); close(); open(); read(); close(); open(); read(); close()


Let’s use Python to generate all n-grams of size 3.


ngrams = [x for x in itertools.product(['open','read','close'],['open','read','close'],['open','read','close'])]
                                        B1 B2 (B2 sym diff B1)
open  open  open  1 0  \empty
open  open  read  1 0  \empty
open  open close  0 0
open  read  open  0 0
open  read  read  1 0  \empty
open  read close  0 3  \empty
open close  open  0 0
open close  read  0 0
open close close  0 0
read  open  open  0 0
read  open  read  0 0
read  open close  0 0
read  read  open  0 0
read  read  read  1 0  \empty
read  read close  1 0  \empty
read close  open  0 2  \empty
read close  read  0 0
read close close  1 0  \empty
close  open  open  0 0
close  open  read  0 2  \empty
close  open close  0 0
close  read  open  0 0
close  read  read  0 0
close  read close  0 0
close close  open  0 0
close close  read  0 0
close close close  1 0  \empty
total              7 7


The n-gram counts for both sets are totally disjoints. Therefore, the algorithm will wrongly detect the trace B2 as malware code, while it is in fact the result of a simple optimization.

Moreover, a malicious app can trick the detection by calling dummy system calls randomly, which will swamp the potentially mallicious sequences. Because each n-gram probability will be non-statistically significant, the algorithm will fail to identify the change of behavior, and classify it as malware.

Therefore, we conclude that this technique is unlikely to provide any benefits to identify malware automatically.

mardi 2 décembre 2014

Full stack Python tracing

The project profile-python-ust allows to record function call and return from Python interpreted code using LTTng-UST. As an example, consider this foo.py code (calling itself bidon.py that just print something on the console):

import bidon

def baz():
    bidon.bidon()

def bar():
    baz()

def foo():
    bar()

if __name__=='__main__':
    foo()
The trace produced is:

python:call: { cpu_id = 7 }, { co_name = "" }
python:call: { cpu_id = 7 }, { co_name = "foo" }
python:call: { cpu_id = 7 }, { co_name = "bar" }
python:call: { cpu_id = 7 }, { co_name = "baz" }
python:return: { cpu_id = 7 }, { }
python:return: { cpu_id = 7 }, { }
python:return: { cpu_id = 7 }, { }
python:return: { cpu_id = 7 }, { }
However, the tracing hooks into CPython are not called for external C modules. Therefore, we loose track of what is going on. For instance, Cython compiles Python code as a native shared library for greater efficiency. To overcome this limitation, I compiled the Cython foo.pyx code using GCC function instrumentation. Here is the setup.py file (The same code gets compiled twice, with and without instrumentation):

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension

ext = [
    Extension('foo1', ['foo1.pyx'],
              extra_compile_args = ['-g'],
    ),
    Extension('foo2', ['foo2.pyx'],
              extra_compile_args = ['-finstrument-functions', '-g'],
    ),
]

setup(
    ext_modules = cythonize(ext),
)

And then I invoked both modules foo1.foo() and foo2.foo(), with the lttng-ust-cyg-profile instrumentation, that hooks on the GCC instrumentation. I also enabled kernel tracing, just for fun. Both traces (kernel and user space) are loaded in an Trace Compass experiment.

We see in the screenshot below the users space stack according to time for the foo1 and foo2 executions. In the first execution, we see the entry point of inside foo1.so and the invocation of Python code from the library, but function calls inside foo1.so are not visible. In the second execution, both the Python and the GCC instrumentation are available, and the complete callstack is visible. The higlighted time range is about 6us and corresponds to the write() system call related to the print() on the console. We can see the system call duration and the blocking in the control flow view.


Unfortunately, symbols can't be resolved, because base address library loading can't be enabled at this time with Python instrumentation, due to this bug.

This was an example how three sources of events finally create a complete picture of the execution!

EDIT:

A Cython module can call profile hook of the main interpreter. Call cython with the profile option, and compile the resulting code with -DCYTHON_TRACE=1. This can be done in setuptools with:


 cythonize(ext_trace, compiler_directives={'profile': True}),

Thanks to Stefan Behnel for this tip!

mercredi 19 novembre 2014

How Linux changes the CPU Frequency

The ondemand governor adjust the processor frequency making a trade-off between power and performance, and is the implementation of Dynamic Frequency Scaling under Linux. However, it was also causing unexpected delays in my performance benchmarks. The following figure shows the delay density of RPC requests for the CPU frequency governor ondemand and performance. Delay variation is lower when the CPU frequency is fixed.

I wanted to visualize how the frequency changes according to time and scheduling. I used LTTng to record power_cpu_frequency and sched_switch events. I loaded them in Trace Compass, and defined the analysis declaratively using XML (available here). The CPU frequency is shown in shades of red (stronger means higher frequency) and in green whether the CPU is busy or idle. (you guessed correctly, holiday season is coming). We see in the following figure that it takes some time for the frequency to ramp-up.



In fact, changing the CPU frequency is equivalent to distorts the elapsed time of the execution. Timestamps in the trace are in nanoseconds instead of clock cycles, but to compare executions, the base should be clock cycles instead. We could scale the time using the frequency events to approximate the time in processor cycles.

The task kworker, woken-up from a timer softirq, changes the CPU frequency. The kworker thread runs with a varying period, typically between 4ms and 20ms. The algorithm seems to take into account the average CPU usage, because if the CPU does rapid context switch between idle and busy, the frequency never reaches the maximum.

Therefore, a task performing frequent blocking/running cycles may run faster if it shares the CPU with another task, because the average CPU usage will increase and frequency will be set accordingly. We could also estimate the relative power cost of a segment of computation with great precision.

In conclusion, disabling frequency scaling for benchmarks should reduce variability. Here is how to do it on recent Linux kernels:

set_scaling_gov() {
        gov=${1-performance}
        for i in $(ls -1 /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor); do
                echo ${gov} | sudo tee $i > /dev/null
        done
}

Here is the view for the whole trace. Happy hacking!