Author
We are happy to announce that on May 30, 2022 Waleed Reda successfully defended his PhD thesis at both KTH and UC Louvain! Marco Canini equally co-advised Waleed with Dejan Kostic in the beginning, and by defense time Waleed’s advisors were Dejan Kostic and Marco Chiesa at KTH, and Peter van Roy at UC Louvain. Adam Morrison was a superb opponent at the defense. Waleed’s thesis (the first to come in this ERC project) is available online:
Accelerating Distributed Storage in Heterogeneous Settings
Here’s the Zoom screenshot from this hybrid defense:
Author
Our upcoming NSDI 2022 paper Packet Order Matters shows a surprising result: deliberately delaying packets can improve the performance of backend servers by up to about a factor of 2 (e.g., those used for Network Function Virtualization)! This applies to both throughput and latency (including the time spent in our Reframer). We show three different scenarios in which Reframer can be deployed. Source code is available here.
Below is the presentation at NSDI 2022:
This is joint work with:
Hamid Ghasemirahni, Tom Barbette, Georgios P. Katsikas, Alireza Farshin, Amir Roozbeh, Massimo Girondi, Marco Chiesa, Gerald Q. Maguire Jr., and Dejan Kostić.
Full abstract is below:
Data centers increasingly deploy commodity servers with high-speed network interfaces to enable low-latency communication. However, achieving low latency at high data rates crucially depends on how the incoming traffic interacts with the system’s caches. When packets that need to be processed in the same way are consecutive, i.e., exhibit high temporal and spatial locality, caches deliver great benefits.
In this paper, we systematically study the impact of temporal and spatial traffic locality on the performance of commodity servers equipped with high-speed network interfaces. Our results show that (i) the performance of a variety of widely deployed applications degrade substantially with even the slightest lack of traffic locality, and (ii) a traffic trace from our organization reveals poor traffic locality as networking protocols, drivers, and the underlying switching/routing fabric spread packets out in time (reducing locality). To address these issues, we built Reframer, a software solution that deliberately delays packets and reorders them to increase traffic locality. Despite introducing μs-scale delays of some packets, we show that Reframer increases the throughput of a network service chain by up to 84% and reduces the flow completion time of a web server by 11% while improving its throughput by 20%.
Author
We are very happy to announce that our LineFS paper was among the three papers that won the Best Paper Award at SOSP 2021!
LineFS builds upon our previous work on Assise [OSDI ’20] by offloading CPU-intensive tasks to a SmartNIC (BlueField-1 in our case) for about 80% performance improvement across the board.
Jongyul’s presentation is already available:
This is joint work with
Jongyul Kim (KAIST), Insu Jang (University of Michigan), Waleed Reda (KTH Royal Institute of Technology / Université catholique de Louvain), Jaeseong Im (KAIST), Marco Canini (KAUST), Dejan Kostić (KTH Royal Institute of Technology), Youngjin Kwon (KAIST), Simon Peter (The University of Texas at Austin), and Emmett Witchel (The University of Texas at Austin / Katana Graph).
Full abstract is as follows:
In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe.
We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel data-path execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a SmartNIC.
We implement LineFS on the Mellanox BlueField SmartNIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.
Author
We are happy to announce that Hamid Ghasemirahni successfully defended his licentiate thesis (licentiate is a degree at KTH half-way to a PhD)! Marco Chiesa has done an excellent job as a co-advisor and we are once again very grateful to Prof. Gerald Q. Maguire Jr. for his key insights. Prof. Al Davis was a superb opponent at the licentiate seminar. Hamid’s thesis (hopefully one of many to come in this project) is available online:
Packet Order Matters!: Improving Application Performance by Deliberately Delaying Packets
We couldn’t take the obligatory hallway shot, so we faked the gift giving over Zoom:
Author
ASPLOS ’21 will feature Alireza’s presentation of our paper titled “PacketMill: Toward Per-Core 100-Gbps Networking”. This is joint work with Alireza Farshin, Tom Barbette, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostić.
The full abstract (with the video and more resources below):
We present PacketMill , a system for optimizing software packet processing, which (i) introduces a new model to effjciently manage packet metadata and (ii) employs code-optimization techniques to better utilize commodity hardware. PacketMill grinds the whole packet processing stack, from the high-level network function confjguration fjle to the low-level userspace network (specifjcally DPDK) drivers, to mitigate ineffjciencies and produce a customized binary for a given network function. Our evaluation results show that PacketMill increases throughput (up to 36.4Gbps – 70%) & reduces latency (up to 101µs – 28%) and enables nontrivial packet processing (e.g., router) at ≈100Gbps , when new packets arrive > 10 × faster than main memory access times, while using only one processing core
PacketMill Webpage: https://packetmill.io/
PacketMill Paper: https://packetmill.io/docs/packetmill-asplos21.pdf
PacketMill source code: https://github.com/aliireza/packetmill
PacketMill Slides with English transcripts: https://people.kth.se/~farshin/documents/packetmill-asplos21-slides.pdf