Wait Analysis of Distributed Systems Using Kernel Tracing

这篇文章核心是在对distributed system里面的等待原因进行根因分析。主要是看因为什么wait了、wait了几次。

ANALYSIS ARCHITECTURE

Operating System Trace of a Distributed System

主要考虑四个情况:

  • running
  • preempted
  • interrupted
  • blocked

介绍了一下tracing的过程。 给了一个tracing的点的表(Table 1)。 主要是通过LTTng,对一系列静态的点进行trace。

Trace Synchronization

分布式事件分析的一个挑战是没有一个全局时钟。这部分介绍了一些时钟同步方法。 作者最后采用的似乎是convex hull algorithm。

Trace Analysis

对等待原因的恢复需要在相关事件之间进行有效的导航。我们通过从同步追踪中建立的有向无环图(DAG)来实现这一点。 然后,通过反向遍历该图来恢复等待原因。 给了两个算法。

算法1详细说明了trace到图的转换。主要原理是根据trace中记录到的事件的类型进行特殊判断。

We define the active path of execution as the execution path where all blocking edges are substituted by their corresponding subtask. The algorithm is shown in Algorithm 2 and works as follow. The states of the main task are iterated forward, and visited edges are appended to the active path. If a blocked state is found, the incoming wake-up edge is followed, and the backward iteration starts. In the backward direction, the visited edges are prepended to a local path. If an incoming packet is found, the source is followed backward. If a blocking edge is found while iterating backward, this procedure is repeated recursively. The backward iteration stops when the beginning of the blocking interval is reached, the accumulated path is appended to the result and the forward iteration resumes.