这篇文章主要是一篇综述,分类方法可以参考一下
注:按照原文的说法,detailed diagnosis of root-causes usually takes place after alarms are raised by anomaly detection.
Detection
- Reactive
- metric violation
- threshold-based:这种是最简单的,只需要比对阈值就可以了
- modeling-based
- 机器学习模型
- time-series模型
- queuing模型
- correlation-based:分析historical data
- outlier detection:information and probability theories
- post-detection:减少假阳性或对异常排序
- metric violation
- Proactive:主要是根据系统现在的情况进行预测
Diagnosis
- dependency inference
- 根据检测的level分类
- machine level
- request level
- process/thread level
- active probing
- 根据检测的level分类
- correlation analysis:大多数用机器学习方法
- similarity analysis:一般是通过对比peer machine或者peer software component
supporing infra
恢复
- 研究新的设计,来支持容错
- 重启
- re-schedule / re-execution