本篇文章用prediction error来表示一条trace是否“有趣”。 prediction error大，意味着模型没法很好地预测这条trace，因此这条trace是有趣的。然后基于此做tail-based sampling。

To reduce computational and storage overheads, most tracing frameworks do not keep all traces, but sample them uniformly at random. 但是，uniform random sampling inevitably captures redundant, common-case execution traces。

本篇文章主要希望稀释biased trace sampling。为此，作者主要解决了以下几个high-level challenges。

we cannot rely on manually engineered features
ampling decisions must be robust to noise and heterogeneity in the trace data
sampling techniques face strict operational requirements to be useful in practice，比如
- sampling decision must be fast
- techniques must scale to a large volume of traces
- they must operate online over a continuous stream of traces

整体来说，Sifter maintains a low-dimensional probabilistic model of execution paths in the distributed system。之所以是low-dimensional，是因为需要强迫模型find approximate representation。

Motivation

Distributed Tracing Background

这里比较有用的是给了一点定义：

A key design goal for distributed tracing tools is to trace production workloads – that is, to produce, collect, and store traces at large request volume, with reasonable overheads. In the extreme, tracing tools can capture an end-to-end execution trace for every request in the entire production workload.

Tracing Pipeline and Overheads

主要描述了overhead的三个来源

runtime overhead to generate trace data, and for propagating trace contexts alongside requests
route trace data to tracing back- ends, incurring network overheads
Processing a trace incurs computational costs for constructing its abstract representation and applying feature-extraction functions

Reducing Overheads by Sampling

图还行。

Head-Based Sampling。Dapper和Canopy都采用的。但是the set of sampled traces contains mostly common-case execution paths, with a lot of overlap and redundant information
Tail-Based Sampling。里面有一个文章的核心论点：Although the runtime overheads of distributed tracing are non-zero, they are not dominant compared to the subsequent trace processing, querying, and storage costs

本篇文章就是基于Tail-Based Sampling的核心思想做的。

Limitations of Existing Approaches

Feature Engineering

现有的方法，manual feature engineering，is naturally limited to only those features that developers can predict will be interesting a priori, and those features that can be easily written as a feature extraction functions。文章主张，Ideally, a tail-based sampler should not require developers to explicitly specify features on which to bias the sampling decision。 the sampling decision should be made directly on the underlying trace data, and features should be learned, rather than engineered。

Operational Requirements

trace sampling的结果必须要onlnie。但是现有的general-purpose machine learning和graph mining algorithms都是offline的：运行速度很慢、有很大的computational costs。

此外还要考虑scalability。

Goals

主要是一个表达式。希望输出一个采样函数。同时，希望算法的复杂度是O1的、状态存储的大小是常量。

Challenges

A Primer on Traces

一些定义吧。

Events：Events are the core building block of a trace
Relationships：Events alone tell some of the story, but the most important information in a trace is the causal ordering between events
Traces：A trace combines events and their causal ordering into a DAG

Heterogeneous Event Annotations

对于trace sampling来讲，整合event annotations并不是一件容易的事情，这涉及到了自然语音处理。文章想了一个绕过去的方法。用（比如）【source file和line number作为label标记event】。这种label被称为origin。

Incorporating Structure

不仅需要考虑label，还需要考虑structure。但是structure可能会带来状态过多的问题，为此，sifter基于两个直觉解决这个问题。

首先，虽然轨迹中的路径总数庞大得难以捉摸，但 "高信号 "路径的数量很可能是可控的。例如，我们经常会看到相同的事件子序列以完全相同的顺序出现，这是因为连续事件之间没有分支点。虽然我们无法预测哪些路径可能有用，但我们可以利用高效的降维技术来进行降维。

其次，我们认为基于路径的有用特征更有可能是短的而不是长的。

Approximate Matching

准确的matching不行，因为过拟合的整个模型无法适应噪音的存在。

Cross-Component Tracing

Different system components may incorporate tracing with varying levels of detail.
因为微服务具有复杂的依赖、同时同一个服务可能在不同的位置被调用，因此两条trace可能在high-level的角度看起来不一样，但是它们实际是一样的
Change over time。顾名思义

Design

Dimensionality Reduction

用户首先指定一个固定的N，然后框架从traces中提取出来所有长度为N的happens-before path。图3介绍了整个过程。

输入是去掉了N/2（其实是上取整）位置的其他的N-1个label。比如N是3。那么就留下1和3位置的label作为输入。这个输入是一个one-hot encoding。就是哪个位置有、哪个位置就是1。

网络的第一层把这个one-hot encoding转换成一个P维（也是实现强行定义好的，比如10）的向量。这一步的目的就是降维。

网络的第二层把它们concatenates起来，然后使用softmax预测output label。 output label就是刚才留下来的N/2位置的label。也是以one-hot encoding的形式。

Sampling Using Prediction Error

sifter利用上述的prediction error作为sampling的最核心指标。一个high prediction error代表这条trace并不能够被很好的预测、因此是一个edge case。

Calculating Sampling Probability

维护了一个长为k（没错，k也是预先定义好的）的窗口/ 这个窗口保存了k个最近的traces的prediction losses。然后基于当前的loss和过去的loss，计算一个概率。

Sifter Workflow

这个就是一个workflow。比较有用的是引用了一下，typical tracing backends会aggregate events in memory for a short period of time。

Scalability

主要是说模型保存的状态很少。这是肯定的，因为就两层+一个softmax。