mercredi 24 février 2016

Real-life real-time problem

I installed Linux RT on the Jetson TK1 last month. Since then, cyclictest is running to test that latency is always correct. The maximum wake-up latency is usually ~100us, and it can run like that for a long time. However, on rare occasion, a latency of ~8ms occurs when running for a long period of time (about one week). This kind of problem is tricky to find without proper tools. Here is how I found the problem.

I configured LTTng kernel and userspace tracing to continuously trace the system. The kernel trace records scheduling events and system calls while the userspace trace indicates the location of the missed deadline. When a high latency is detected, the trace is stopped and a snapshot of the trace buffers are written on the disk. I let the system running with tracing enabled for a couple of days, and then a snapshot was created, indicating that the problem occurred.

The following figure shows the result of the traces loaded in TraceCompass. We can see this bug is the result of interactions between three tasks, namely ktimersoftd/2, irq/154-hpd and cyclictest. The cyclictest wake-up signal is delayed for a long time. This signal puts back the task in the running queue, and it is delivered by ktimersoftd/2. The ktimersoftd task has lower priority than irq/154-hpd that happens to run for a long time. Even though the cyclictest has higher priority than the IRQ thread, as long as cyclictest is in the wait queue, the priority has no effect. Priority inheritance does not work here because we don't know in advance that ktimersoftd will actually wake-up a high priority task.

Obviously, the ktimersoftd threads should have greater priority than other IRQ threads as they are required by cyclictest to meet its deadline. There are few questions left: how the default priorities are defined, what is actually doing the irq/164-hpd task, and does increasing ktimersoftd priority will fix the issue? More investigation is required.

In the meantime, if you want to look for yourself, the trace is here. Cheers!

3 commentaires:

Unknown a dit…

Can you add more information about the kernel version you are testing against ?

Unknown a dit…

From the trace metadata:
hostname = "tegra-ubuntu";
domain = "kernel";
sysname = "Linux";
kernel_release = "4.4.0-rt3";
kernel_version = "#4 SMP PREEMPT RT Tue Feb 9 10:52:38 EST 2016";
tracer_name = "lttng-modules";
tracer_major = 2;
tracer_minor = 8;
tracer_patchlevel = 0;

Julien Desfossez a dit…

If we look at the trace, it's interesting because we see that ktimersoftd/2 changes priority constantly, between -2 and -100, but I think we don't have enough events to understand why it changes like that.
Another intesting thing from this trace is that the scheduling latency is not to wakeup cyclictest, but to wakeup the process that wakes up cyclictest, so filtering on scheduling latencies of cyclictest does not show anything.