Performance troubleshooting in data centers: an annotated bibliography?

发表于 2022-09-19 | 分类于论文 | | 次

字数统计: 193 | 阅读时长 ≈ 1

这篇文章主要是一篇综述，分类方法可以参考一下

注：按照原文的说法，detailed diagnosis of root-causes usually takes place after alarms are raised by anomaly detection.

Detection

Reactive
- metric violation
  - threshold-based：这种是最简单的，只需要比对阈值就可以了
  - modeling-based
    - 机器学习模型
    - time-series模型
    - queuing模型
  - correlation-based：分析historical data
- outlier detection：information and probability theories
- post-detection：减少假阳性或对异常排序
Proactive：主要是根据系统现在的情况进行预测

Diagnosis

dependency inference
- 根据检测的level分类
  - machine level
  - request level
  - process/thread level
- active probing
correlation analysis：大多数用机器学习方法
similarity analysis：一般是通过对比peer machine或者peer software component

supporing infra

恢复

研究新的设计，来支持容错
重启
re-schedule / re-execution