In our upcoming SoCC 2018 paper, we present Kurma, our fast and accurate load balancer for geo-distributed storage systems. By decoupling end-to-end request completion time into components of base propagation latency, network congestion, and service time distribution, Kurma accurately estimates the rate of SLO violations for requests redirected across geo-distributed datacenters. By operating at the granularity of seconds, Kurma reduces SLO violations by a factor of up to 3 or reduces the cost of running the service by up to 17%.
Credits
This is a joint work with Kirill Bogdanov (KTH Royal Institute of Technology), Waleed Reda (Université catholique de Louvain/KTH Royal Institute of Technology), Gerald Q. Maguire Jr. (KTH Royal Institute of Technology), Dejan Kostic (KTH Royal Institute of Technology), and Marco Canini (KAUST). The full abstract is as follows:
Abstract
The increasing density of globally distributed datacenters reduces the network latency between neighboring datacenters and allows replicated services deployed across neighboring locations to share workload when necessary, without violating strict Service Level Objectives (SLOs).
We present Kurma, a practical implementation of a fast and accurate load balancer for geo-distributed storage systems. At run-time, Kurma integrates network latency and service time distributions to accurately estimate the rate of SLO violations for requests redirected across geo-distributed datacenters. Using these estimates, Kurma solves a decentralized rate-based performance model enabling fast load balancing (in the order of seconds) while taming global SLO violations. We integrate Kurma with Cassandra, a popular storage system. Using real-world traces along with a geo-distributed deployment across Amazon EC2, we demonstrate Kurma’s ability to effectively share load among datacenters while reducing SLO violations by up to a factor of 3 in high load settings or reducing the cost of running the service by up to 17%.