vendredi 7 janvier 2011

Tracing a linux cluster

Gather traces from multiple machines and then showing them all at once is a real challenge, because each computer has it's own independent clock. It means that two events occurring at the same time on two computers have a low probability to have the same time stamp in the trace.

Linux Trace Toolkit have a synchronization feature to adjust trace time stamps to a reference trace. It works by matching sent and received network packets found in traces. Two parameters are computed: an offset, but also a drift, that is the rate of clock change.

To experiment it, I did a virtual Ubuntu cluster of 10 nodes installed with LTTng. Virtual machines are sending each other TCP packets with the small utility pingpong while tracing is enabled. All traces are then sync to the host for analysis. Here are the results:

$ lttv -m sync_chain_batch --sync --sync-stats -t trace1 -t trace2 [...]
Resulting synchronization factors:
 trace 0 drift= 1 offset= 6.80407e+09 (2.551684) start time= 2433.182802117
 trace 1 drift= 1 offset= 4.12691e+09 (1.547687) start time= 2433.254700670
 trace 2 drift= 0.999999 offset= 0 (0.000000) start time= 2433.229372444
 trace 3 drift= 1 offset= 1.53079e+09 (0.574083) start time= 2433.366687088
 trace 4 drift= 1 offset= 2.31812e+10 (8.693471) start time= 2433.217954184
 trace 5 drift= 1 offset= 1.48957e+10 (5.586227) start time= 2433.280525047
 trace 6 drift= 1 offset= 1.76688e+10 (6.626211) start time= 2433.291125350
 trace 7 drift= 1 offset= 2.0463e+10 (7.674116) start time= 2433.325424020
 trace 8 drift= 0.999999 offset= 1.2206e+10 (4.577534) start time= 2433.204361289
 trace 9 drift= 1 offset= 9.55947e+09 (3.585022) start time= 2433.293578996

The trace 2 has been selected as the reference trace. We see inside parenthesis the offset between the current trace and the reference trace. We see a delay between each trace that match the delay between virtual machines startup. When loading these traces into lttv with --sync option, all traces events align perfectly, which is not the case without the sync option.

If you want to know more about trace synchronization, I recommend the paper Accurate Offline Synchronization of Distributed Traces Using Kernel-Level Events from Benjamin Poirier.

One simple note if you want to do this analysis, you need extended network trace events. To enable them, arm ltt with

$ sudo ltt-armall -n

I found it really interesting, because it allows to see cluster-wide system state.

Aucun commentaire: