FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices

文章写的太好了

主要贡献

  • 基于SVM的SLO violation的detection和localization
  • 基于RL的SLO violation的mitigation
  • Online Training和performance anomaly injection framework

基于K8s的开源https://gitlab.engr.illinois.edu/DEPEND/firm.git

Background and Characterization

Definition

  1. service dependency graph
  2. execution history graph
  3. critical path

Insights

  1. 关键路径的行为本身就会变动,是dynamic
  2. 有大延迟的微服务不一定是SLO violations的根本原因
  3. 现有的mitigation方法有scale out和scale up,但是都是基于static policies的,并不合适

The FIRM Framework

Tracing Coordinator

用的是基于OpenTracing的方案。

选取的指标:

cAdvisor / Prometheus

  • cpu_usage_seconds_total
  • memory_usage_bytes
  • fs_write/read_seconds
  • fs_usage_bytes
  • network_transmit/receive_bytes_total
  • processes

Linux perf

  • offcore_response.*.llc_hit/miss.local_DRAM
  • offcore_response.*.llc_hit/miss.remote_DRAM

Critical Path Extractor

就是一个有向无环图的最长路。定义了三种workflow

  • parallel
  • sequential
  • background:不向parent span返回values的

Critical Component Extractor

主要提取两个关键指标

  • Per-CP Variability: Relative Importance
    • T_{CP}=,对于每一个instance i来说,他的relative importance就是Pearson correlation coefficient。PCC(T_{CP}, T_{i})
  • Per-Instance Variability: Congestion Intensity
    • 就是T_{99}/T_{50}

把这两个指标喂进SVM里面,进行dynamic筛选

SLO Violation Mitigation Using RL

RL Primer和Why RL介绍的不错

Action Execution

  • CPU Actions用cgroups
  • Memory Actions用Intel MBA和Intel CAT
  • I/O Actions用cgroups里面的blkio
  • Network Actions用Hierarchical Token Bucket

Performance Anomaly Injector

用的是一些现成的benchmark工具