Empowering Azure Storage with RDMA

本篇文章主要介绍了微软在部署intra-region RDMA的经验。 Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. 而且,据introduction所说,不同于以往的工作,微软尝试把RDMA使用在storage frontend traffic(between compute VMs and storage clusters)和backend traffic(within a storage cluster)。

Background

Network Architecture of an Azure Region

当前Azure的交换机主要分成四种:tier 0 (T0), tier 1 (T1), tier 2 (T2) and regional hub (RH)。 使用external BGP (eBGP) for routing and equal-cost multi-path (ECMP) for load balancing。

主要定义了以下四种units:

  • Rack: a T0 switch and the servers connected to it.
  • Cluster: a set of racks connected to the same set of T1 switches.
  • Datacenter: a set of clusters connected to the same set of T2 switches.
  • Region: datacenters connected to the same set of RH switches. In contrast with short links (several to hundreds of meters) in datacenters, T2 and RH switches are connected by long-haul links whose lengths can be as long as tens of kilometers.

有两点需要注意的:

  1. 因为T2和RH之间的long-haul links,一个datacenter里面的RTT波动在几microseconds,但是在region之间可能达到2 milliseconds。
  2. Azure uses pizza box switches for T0和T1 switches、use chassis switches for T2和RH switches。

High Level Architecture of Azure Storage

在Azure里面,为了cost saving和auto-scaling。 因此,Azure里面目前主要有两种cluster:compute and storage。 VMs are created in compute clusters but the actual storage of Virtual Hard Disks (VHDs) resides in storage clusters.

Azure的存储主要包括有三个layer:the frontend layer, the partition layer, and the stream layer。

The stream layer是一个append-only的分布式文件系统。 它主要在disks上store bits并replicates them for durability,但是并不理解high-level的存储抽象。 stream layer上的守护进程被称为Extent Node(EN)。

The partition layer主要understands different storage abstractions, manages partitions of all the data objects in a storage cluster, and stores object data on top of the stream layer. partition layer上的守护进程被称为Partition Server(PS)。

The frontend layer主要处理授权功能,并将相关的requests发送到对应的PS上。 frontend layer上的守护进程被称为frontend server。