Skip to content

Our USENIX ATC paper presentation “Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks”

At USENIX ATC 2020, Alireza presented our paper titled “Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks”.  Full materials (video, slides, PDF) are available at the USENIX site. The paper abstract is below. This is joint work with Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostić.

Memory access is the major bottleneck in realizing multi-hundred-gigabit networks with commodity hardware, hence it is essential to make good use of cache memory that is a faster, but smaller memory closer to the processor. Our goal is to study the impact of cache management on the performance of I/O intensive applications. Specifically, this paper looks at one of the bottlenecks in packet processing, i.e., direct cache access (DCA). We systematically studied the current implementation of DCA in Intel ® processors, particularly Data Direct I/O technology (DDIO), which directly transfers data between I/O devices and the processor’s cache. Our empirical study enables system designers/developers to optimize DDIO-enabled systems for I/O intensive applications. We demonstrate that optimizing DDIO could reduce the latency of I/O intensive network functions running at 100Gbps by up to ~30%. Moreover, we show that DDIO causes a 30% increase in tail latencies when processing packets at 200Gbps , hence it is crucial to selectively inject data into the cache or to explicitly bypass it.

RSS++ Presentation at CoNEXT 2019 in Orlando, Florida

Tom presented our RSS++ paper at CoNEXT 2019 help in December of 2019 in Orlando, Florida. Here is our video of the talk (also available directly at this YouTube link). More details about the paper are in our previous post

 

 

 

Our CoNEXT 2019 paper “RSS++: load and state-aware receive side scaling”

While the current literature typically focuses on load-balancing among multiple servers, in our upcoming CoNEXT 2019 paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). In this context, we propose a new load-balancing technique (RSS++) that dynamically modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. RSS++ incurs up to 14x lower 95th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. RSS++ allows higher CPU utilization and dynamic scaling of the number of allocated CPU cores to accommodate the input load, while avoiding the typical 25% over-provisioning. RSS++ has been implemented for both (i) DPDK and (ii) the Linux kernel. Additionally, we implement a new state migration technique, which facilitates sharding and reduces contention between CPU cores accessing per-flow data. RSS++ keeps the flow-state by groups that can be migrated at once, leading to a 20% higher efficiency than a state of the art shared flow table.

This is joint work with Tom Barbette, Georgios P. Katsikas, Gerald Q. Maguire Jr., and Dejan Kostic

Our EuroSys 2019 Paper “Make the Most out of Last Level Cache in Intel Processors”

In our upcoming EuroSys 2019 paper, we exploit the characteristics of non-uniform cache architecture (NUCA) in recent Intel processors to introduce a new memory management scheme, i.e., slice-aware memory management. We believe that we are the first to: (i) take a step toward using the current hardware more efficiently in this manner, and (ii) advocate taking advantage of NUCA characteristics in LLC and allowing networking applications to benefit from it. In addition, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet’s header in the slice of the LLC that is closest to the relevant processing core. The results of our work showed that CacheDirector could reduce the tail latencies in latency-critical Network Function Virtualization (NFV) service chains by 21.5%. Furthermore, our work demonstrated that optimizing the computer systems and taking advantage of nanosecond improvements could have a higher impact on the performance of networking applications.​

Mission

This blog highlights the contributions made by a group of faculty, researchers, and doctoral students working on Networked Systems aspects. For open positions, please consult our Projects.