How to diagnose nanosecond network latencies in rich end-host stacks

低延迟的网络堆栈已经将终端主机内的网络延迟降低到了微秒级。然而，终端主机分析器的开销很大，它们只对confirm a hypothesis有用，而不是首先诊断出问题。

本文展示了如何建立一个具有全栈覆盖和低开销的延迟诊断工具，它可以识别而不仅仅是确认终端主机的延迟来源。这种独特的测量方法通过调和多个时间域（网络和CPU）的CPU和NIC硬件剖析跟踪，以纳秒级的精度重建终端主机内的网络信息寿命。它发现了内核和用户空间堆栈中的意外延迟源。

我们的调查发现，现有的终端主机分析器在网络延迟诊断方面有三个原因。

首先，现有的分析器未能捕捉到由网卡增加的延迟偏差，从信息进入（或离开）网卡到它们被驱动器接收（或离开）的时间点。
其次，它们的高开销严重干扰了延迟分布，压倒了正在追求的根本原因。
第三，现有的剖析器过于沉重，无法应用于整个堆栈。诊断网络延迟的开发者必须在这种剖析器发挥作用之前就已经猜到了要看哪里。用户确定堆栈中可能增加延迟偏差的部分，然后使用分析工具来测量这些部分的延迟。不幸的结果是，堆栈中未被检查的部分的延迟来源没有被发现。此外，堆栈的不同部分之间的重要互动也没有被注意到[46,68]。

Challenges and key ideas

四个目标

全覆盖
自动
纳秒级精度
低开销

Profiling network-message lifetimes

网络消息是在网卡和一个或多个CPU核心中处理的。捕获跨越这些设备的消息lifetime具有挑战性，原因有二。

Challenge 1. 网卡、CPU分析和软件的时间戳来自独立变化的时钟。

CPU profiling用hardware clock；网卡用software sync机制比如phc2sys；software clock使用如ptp协议不断调整和同步。

一种理想解决方法是modify time synchronization protocols to expose clock changes to profilers like perf。但是要改整个software stack。

所以我们使用了一个更简单但更昂贵的解决方法。在时间同步协议改变软件时钟后，内核重新计算CPU profiling时钟和软件时钟之间的转换。通过使用虚拟动态共享对象（vDSO）机制将这些转换参数暴露给用户空间，我们可以在剖析期间从用户线程中轮询它们。我们的目标是捕获软体时钟的每一个变化，以便在所有时间点上拥有最准确的时钟映射。

Challenge 2. 在主机software stack中没有跟踪网络信息lifetimes的支持。

Software profiling tools can only track network messages within the network stack

CPU profiling tracks all system activity but not the message lifetimes within which such system activity occurs

解决方法是对于每个消息（例如，M2），我们试图获得三个时间戳：

一个NIC时间戳（M2(n)）
一个核心移交时间戳（M2(h)）
一个应用程序时间戳（M2(a)）

NIC hardware-timestamps和application software-timestamps从send/receive的时候拿（M(n)和M(a)）。 messages sent from end-hosts是ordered by application timestamps； messages received at end-hosts是ordered by NIC timestamps。

M(h)是从points where messages cross software-processing and core boundaries拿，比如sock_def_readable

Diagnosing high message latencies

就算跑一小段时间，也有很多数据。为了方便，必须能够identify anomalous system activity。这种异常行为可能有：functions that take longer和functions that occur more frequently。

Challenge 3. 同一个latency deviation可能由好几个anomalies导致

为了减少嵌套带来的歧义，我们只报告那些不能被其嵌套函数解释的异常函数。

为了确定一个异常函数是否被嵌套的异常现象所解释，我们使用一个启发式方法。如果嵌套的异常现象加起来占到父函数延迟偏差的80%以上（原因见第5.2节），我们就得出结论，嵌套的异常现象解释了父异常现象，并省略了父异常现象作为偏差的原因。

Design and implementation

整个框架由两部分组成：Profiling System和Nsight Analysis

Profiling System

又分成两个部分：CPU profiler收集从call到return的信息；shim收集metadata，包括

时间戳：recv、send、sk_data_ready
core number：在kernel-user space边界收集
socket file descriptor：检测head-of-line blocking

用LD_PRELOAD实现shim。

在Intel-PT上使用perf，改的perf。

Nsight Analysis

给了六种异常分类

functions that take longer
functions that are called more frequently
unexpected functions
entire program context unexpected（作为scheduling decisions或者interrupts的结果）
absence of system activity，CPU在等些什么，比如因为cache miss导致的memory read
cross message interference in the network stack

四个阶段

找unexpected program context，这部分主要是OS和应用异常
找gaps in system activities，从gap之前的最后一次活动识别原因
找function that are slower/more frequently called/unexpected，主要是Error handling code path（如TCP重传）或者应用瓶颈
找overlap between msg
- 同一个应用msg之间的head-of-line blocking
- 不同应用msg之间的cross-app network interference

归因：只需要一个原因就够了

对program context和cross-app interference异常，不需要往下分析，因为一般就是由于调度引起的
子原因什么时候替代父原因：占了超过80%的时间

注：

intel-PT的batch是322ns，可能会引起under-reporting或者bogus gap，因此低于这个的不考虑
没有分得很细，而是分类了，这个过程是partly自动化
- The process of categorization is partly automated. For example, process contexts are automatically derived from perf. We categorize 1350 functions in Linux by hand, using the contextual information from their names in a user configurable CSV file. We also introduce head-of-line blocking as a category. To do so, we automate summarization of all functions that are executed when processing messages with minimum latency. When these functions occur more frequently or their latencies deviate (for example, NIC interrupt processing takes longer) in message lifetimes of the same application, NSight shows head-of-line blocking as one of the root causes. We ver- ify the latency deviations attributed to head-of-line blocking by cross-referencing message lifetimes to identify overlaps.