本篇文章主要介绍了微软在部署intra-region RDMA的经验。 Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. 而且,据introduction所说,不同于以往的工作,微软尝试把RDMA使用在storage frontend traffic(between compute VMs and storage clusters)和backend traffic(within a storage cluster)。
Background
Network Architecture of an Azure Region
当前Azure的交换机主要分成四种:tier 0 (T0), tier 1 (T1), tier 2 (T2) and regional hub (RH)。 使用external BGP (eBGP) for routing and equal-cost multi-path (ECMP) for load balancing。
主要定义了以下四种units:
- Rack: a T0 switch and the servers connected to it.
- Cluster: a set of racks connected to the same set of T1 switches.
- Datacenter: a set of clusters connected to the same set of T2 switches.
- Region: datacenters connected to the same set of RH switches. In contrast with short links (several to hundreds of meters) in datacenters, T2 and RH switches are connected by long-haul links whose lengths can be as long as tens of kilometers.
有两点需要注意的:
- 因为T2和RH之间的long-haul links,一个datacenter里面的RTT波动在几microseconds,但是在region之间可能达到2 milliseconds。
- Azure uses pizza box switches for T0和T1 switches、use chassis switches for T2和RH switches。
High Level Architecture of Azure Storage
在Azure里面,为了cost saving和auto-scaling。 因此,Azure里面目前主要有两种cluster:compute and storage。 VMs are created in compute clusters but the actual storage of Virtual Hard Disks (VHDs) resides in storage clusters.
Azure的存储主要包括有三个layer:the frontend layer, the partition layer, and the stream layer。
The stream layer是一个append-only的分布式文件系统。 它主要在disks上store bits并replicates them for durability,但是并不理解high-level的存储抽象。 stream layer上的守护进程被称为Extent Node(EN)。
The partition layer主要understands different storage abstractions, manages partitions of all the data objects in a storage cluster, and stores object data on top of the stream layer. partition layer上的守护进程被称为Partition Server(PS)。
The frontend layer主要处理授权功能,并将相关的requests发送到对应的PS上。 frontend layer上的守护进程被称为frontend server。