Performance troubleshooting in data centers: an annotated bibliography?

这篇文章主要是一篇综述,分类方法可以参考一下

注:按照原文的说法,detailed diagnosis of root-causes usually takes place after alarms are raised by anomaly detection.

Detection

  • Reactive
    • metric violation
      • threshold-based:这种是最简单的,只需要比对阈值就可以了
      • modeling-based
        • 机器学习模型
        • time-series模型
        • queuing模型
      • correlation-based:分析historical data
    • outlier detection:information and probability theories
    • post-detection:减少假阳性或对异常排序
  • Proactive:主要是根据系统现在的情况进行预测

Diagnosis

  • dependency inference
    • 根据检测的level分类
      • machine level
      • request level
      • process/thread level
    • active probing
  • correlation analysis:大多数用机器学习方法
  • similarity analysis:一般是通过对比peer machine或者peer software component

supporing infra

恢复

  • 研究新的设计,来支持容错
  • 重启
  • re-schedule / re-execution