Project Title
Increasing the Spatial Correlation of Logical Units of Data to Enable an Ultra-Low Latency Internet (ULTRA)
Three-part public summary (current as of January 2025):
Summary of the context and overall objectives of the project
A number of time-critical societal applications deployed in the cloud infrastructure have to provide high reliability and throughput, along with guaranteed low latency for delivering data. This low latency guarantee is sorely lacking today, with the so-called tail-latency of slowest responses in popular cloud services being several orders of magnitude longer than the median response times. This issue renders the existing cloud infrastructures unsuitable for time-critical societal applications.
This project has produced knowledge necessary to produce the tools that are the key to dramatically reducing the latency of important societal services; these include cloud services used by a large number of users on a daily basis. The overarching goal of this project was to enforce a large degree of correlation in the data requests (logical units of data), both temporally (across time) and spatially (as server work units require correlation to achieve high cache hit rates). The result is that the logical units of data can be processed at almost the maximum processing speed of the cloud servers.
Main results achieved so far
In CacheDirector [EuroSys 2019], we show how we unlocked a performance-enhancing feature that existed in Intel® processors for almost a decade. Our results show that placing incoming packets on the so-called slice of the fast cache memory that is closest to the CPU core that will process the packet could reduce the tail latencies in latency-critical Network Function Virtualization (NFV) service chains by 21.5%. In our USENIX ATC 2020 paper, we demonstrate that optimizing DDIO (a form of direct cache access) could reduce the latency of I/O intensive network functions running at 100Gbps by up to ~30%.
Our technique for load-balancing within a single machine called RSS++ [CoNEXT 2019] incurs up to 14x lower 95th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. Our Cheetah [NSDI 2020] work is a load balancer that supports uniform load distribution and guaranteed per-connection-consistency (PCC) while being scalable, memory efficient, resilient to clogging attacks, and fast at processing packets.
Our Metron journal article [ACM TOCS 2021] expands upon the preparatory work for this project (published at NSDI 2018) by making it possible to accommodate black boxes.
PacketMill [ASPLOS 2021] grinds the whole packet processing stack and increases throughput (up to 36.4Gbps – 70%) & reduces latency (up to 101µs – 28%) and enables nontrivial packet processing (e.g., router) at ≈100Gbps , when new packets arrive > 10 × faster than main memory access times, while using only one processing core.
In our FAJITA paper [CoNEXT 2024] we have shown that a commodity server running a chain of stateful network functions can process more than 170 M packets per second (equivalent of 1.4 Tbps if average-sized payloads are stored in a disaggregated fashion as in our earlier Ribosome work [NSDI 2023]). Something else that is interesting and perhaps unexpected is that, unless the number of so-called “elephant flows” is very small, spreading incoming packets among the cores using plain Receive Side Scaling (RSS) outperforms existing approaches that perform fine-grained flow accounting and load-balancing.
Our Packet Order Matters [NSDI 2022] paper (that won the Community Award) shows a surprising result: deliberately delaying packets can improve the performance of backend servers by up to about a factor of 2 (e.g., those used for Network Function Virtualization)! This applies to both throughput and latency (including the time spent in our Reframer).
In our SoCC 2018 paper, we present Kurma, our fast and accurate load balancer for geo-distributed storage systems and demonstrate Kurma’s ability to effectively share load among datacenters while reducing SLO violations by up to a factor of 3 in high load settings or reducing the cost of running the service by up to 17%.
We have also performed additional work on low-latency distributed file systems for the next generation of durable storage. In joint work with researchers spread all over our planet with Waleed Reda as the main author we built the Assise distributed file system that colocates computation with persistent memory modules (PMM) storage. Our evaluation shows that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts. Our follow-on LineFS paper [SOSP 2021] (that won a Best Paper Award) offloads CPU-intensive tasks to a SmartNIC (BlueField-1 in our case) for about 80% performance improvement across the board. Our RedN [NSDI 2022] paper shows a surprising result, namely that Remote Direct Memory Access (RDMA), as implemented in widely deployed RDMA Network Interface Cards, is Turing Complete. We leverage this finding to reduce the tail latency of services running on busy servers by 35x!
Progress beyond the state of the art
In our work on Eurosys 2019 paper, we design and implemented CacheDirector as a network I/O solution that implements slice-aware memory management by carefully mapping the first 64 Byte of a packet (containing the packet’s header) to the slice that is closest to the associated processing core. In the follow-on work that was published in our USENIX ATC 2020 paper, we extensively study the characteristics of DDIO in different scenarios and identify its shortcomings and demonstrate the necessity and benefits of bypassing cache while receiving packets at 200 Gbps
In our work on intra-server load balancing called RSS++ [CoNEXT 2019], we proposed a new load-balancing technique that dynamically (several times per second) modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. In our follow-on work on Cheetah [NSDI 2020] we design a data center load balancer (LB) that guarantees per-connection consistency for any realizable load balancing mechanisms. Cheetah carefully encodes the connection-to-servers mappings into the packet headers so as to guarantee levels of resilience that are no worse (and in some cases even stronger) than existing stateless and stateful LBs, respectively.
Our PacketMill paper [ASPLOS 2021] introduces a new model to efficiently manage packet metadata and (employs code-optimization techniques to better utilize commodity hardware.
In FAJITA [CoNEXT 2024], we (i) propose an optimized processing pipeline for stateful network functions to minimize memory accesses and overcome the overheads of accessing shared data structures while ensuring efficient batch processing at every stage of the pipeline, and (ii) provide a performant architecture to deploy high-performance network functions service chains containing stateful elements with different state granularities.
In the context of our Geo-distributed load balancing system called Kurma [SOCC 2018], we decouple request completion time into (i) service time, (ii) base network propagation delay, and (iii) residual WAN latency caused by network congestion. At run-time , Kurma tracks changes in each component independently and thus, accurately estimates the rate of SLO violations among geo-distributed datacenters from the rate of incoming requests.
Assise [OSDI 2020] is a blueprint for the distributed file systems using the next generation of durable storage. We designed CC-NVM, the first persistent and available distributed cache coherence layer. In the LineFS work [SOSP 2021], we decompose DFS operations into execution stages that can be offloaded to a parallel data-path execution pipeline on the SmartNIC. In our RedN paper [NSDI 2022] we present a principled, practical approach to implementing complex RDMA offloads, without requiring any hardware modifications. We then use self-modifying RDMA chains to lift the existing RDMA verbs interface to a Turing complete set of programming abstractions.
Submitted Abstract
The cloud computing infrastructure that logically centralizes data storage and computation for many different actors is a prime example of a key societal system. A number of time-critical applications deployed in the cloud infrastructure have to provide high reliability and throughput, along with guaranteed low latency for delivering data. This low latency guarantee is sorely lacking today, with the so-called tail-latency of slowest responses in popular cloud services being several orders of magnitude longer than the median response times. Unfortunately, simply using a network with ample bandwidth does not guarantee low latency because of problems with congestion at the intra-and inter-data center levels and server overloads. All of these problems currently render the existing cloud infrastructures unsuitable for time-critical societal applications. The reasons for unpredictable delays across the Internet and within the cloud infrastructure are numerous, but some of the key culprits are: 1) slow memory subsystems limit server effectiveness, and 2) excess buffering in the Internet further limits correlation of data requests. The aim of this project is to dramatically change the way data flows across the Internet, such that it is more highly correlated when it is to be processed at the servers. The result is that the logical units of data will be processed at almost the maximum processing speed of the cloud servers. By doing so, we will achieve an ultra-low latency Internet. This project will produce the tools and knowledge that will be key to dramatically reducing the latency of key societal services; these include cloud services used by a large number of users on a daily basis.