文章写的太好了
主要贡献
- 基于SVM的SLO violation的detection和localization
- 基于RL的SLO violation的mitigation
- Online Training和performance anomaly injection framework
基于K8s的开源https://gitlab.engr.illinois.edu/DEPEND/firm.git
Background and Characterization
Definition
- service dependency graph
- execution history graph
- critical path
Insights
- 关键路径的行为本身就会变动,是dynamic的
- 有大延迟的微服务不一定是SLO violations的根本原因
- 现有的mitigation方法有scale out和scale up,但是都是基于static policies的,并不合适
The FIRM Framework
Tracing Coordinator
用的是基于OpenTracing的方案。
选取的指标:
cAdvisor / Prometheus
- cpu_usage_seconds_total
- memory_usage_bytes
- fs_write/read_seconds
- fs_usage_bytes
- network_transmit/receive_bytes_total
- processes
Linux perf
- offcore_response.*.llc_hit/miss.local_DRAM
- offcore_response.*.llc_hit/miss.remote_DRAM
Critical Path Extractor
就是一个有向无环图的最长路。定义了三种workflow
- parallel
- sequential
- background:不向parent span返回values的
Critical Component Extractor
主要提取两个关键指标
- Per-CP Variability: Relative Importance
- T_{CP}=,对于每一个instance i来说,他的relative importance就是Pearson correlation coefficient。PCC(T_{CP}, T_{i})
- Per-Instance Variability: Congestion Intensity
- 就是T_{99}/T_{50}
把这两个指标喂进SVM里面,进行dynamic筛选
SLO Violation Mitigation Using RL
RL Primer和Why RL介绍的不错
Action Execution
- CPU Actions用cgroups
- Memory Actions用Intel MBA和Intel CAT
- I/O Actions用cgroups里面的blkio
- Network Actions用Hierarchical Token Bucket
Performance Anomaly Injector
用的是一些现成的benchmark工具