ULTRA ERC

Presentation at CoNEXT ’24: “FAJITA: Stateful Packet Processing at 100 Million pps”

Dejan Kostic2024-12-112024-12-11

At CoNEXT ’24, Mariano presented our FAJITA paper. This work shows that a commodity server running a chain of stateful network functions can process more than 170 M packets per second (equivalent of 1.4 Tbps if payloads are stored in a disaggregated fashion as in our earlier Ribosome work [NSDI ’23])! Something else that is interesting and perhaps unexpected is that, unless the number of so-called “elephant flows” is very small, spreading incoming packets among the cores using plain Receive Side Scaling (RSS) outperforms existing approaches that perform fine-grained flow accounting and load-balancing. This happens because possible gains get dwarfed by slowdowns in accessing memory.

This is joint work with Hamid Ghasemirahni, Alireza Farshin (now at Nvidia), Mariano Scazzariello (now at RISE), Gerald Q. Maguire Jr., Dejan Kostić, and Marco Chiesa.

Our recording of Mariano’s talk is below:

Video Player

Media error: Format(s) not supported or source(s) not found

Download File: https://www.kth.se/blogs/ultra/files/2024/12/Fajita-small2.mp4?_=1

00:00

Use Up/Down Arrow keys to increase or decrease volume.

Data centers increasingly utilize commodity servers to deploy low-latency Network Functions (NFs). However, the emergence of multi-hundred-gigabit-per-second network interface cards (NICs) has drastically increased the performance expected from commodity servers. Additionally, recently introduced systems that store packet payloads in temporary off-CPU locations (e.g., programmable switches, NICs, and RDMA servers) further increase the load on NF servers, making packet processing even more challenging.

This paper demonstrates existing bottlenecks and challenges of state-of-the-art stateful packet processing frameworks and proposes a system, called FAJITA, to tackle these challenges & accelerate stateful packet processing on commodity hardware. FAJITA proposes an optimized processing pipeline for stateful network functions to minimize memory accesses and overcome the overheads of accessing shared data structures while ensuring efficient batch processing at every stage of the pipeline. Furthermore, FAJITA provides a performant architecture to deploy high-performance network functions service chains containing stateful elements with different state granularities. FAJITA improves the throughput and latency of high-speed stateful network functions by ~2.43x compared to the most performant state-of-the-art solutions, enabling commodity hardware to process up to ~178 Million 64-B packets per second (pps) using 16 cores.

Hamid Ghasemirahni’s PhD Defense

Dejan Kostic2024-11-192024-11-20

We are very happy to announce that Hamid Ghasemirahni successfully defended his PhD thesis (second and final one in the ERC ULTRA project) on November 18, 2024! Marco Chiesa has done a superb job as a co-advisor, and we are very grateful to Prof. Gerald Q. Maguire Jr. for his stellar insights (as usual). Gábor Rétvári was the opponent at the defense, while Paris Carbone served as the Chair. Hamid’s thesis is available online:

Realizing High-Performance Stateful Network Function Chains on Commodity Hardware: Improving Packet Processing Frameworks by Minimizing Memory Access Overheads

In short, this thesis contains the work on Reframer showing a surprising result that deliberately delaying packets can improve the performance of backend servers by up to about a factor of 2 (e.g., those used for Network Function Virtualization). It also includes FAJITA, which shows that a commodity server running a chain of stateful network functions can process more than 170 M packets per second (equivalent of 1.4 Tbps if payloads are stored in a disaggregated fashion as in our earlier Ribosome work [NSDI ’23]!).

A few images from the defense and the celebration are below.

Hamid presenting during the defense (image taken by Dejan Kostic).

Paris congratulates Hamid on the successfully defended PhD thesis (image taken by Dejan Kostic).

Dejan hands the traditional gift to Hamid (image taken by Voravit Tanyingyong).

Group image with colleagues and Dejan (image taken by Voravit Tanyingyong).

Group image (image taken by Voravit Tanyingyong).

Massimo Girondi’s Licentiate Defense

Dejan Kostic2024-04-102024-05-31

We are happy to announce that Massimo Girondi successfully defended his licentiate thesis (licentiate is a degree at KTH half-way to a PhD)! Marco Chiesa has done an excellent job as a co-advisor and as is customary we are very grateful to Prof. Gerald Q. Maguire Jr. for his key insights. Giuseppe Siracusano was a superb opponent at the licentiate seminar, with Amir Payberah as the examiner. Massimo’s thesis (second licentiate thesis of this project) is available online:

“Toward Highly-efficient GPU-centric Networking”

https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-344316

A few shots from the celebration are below.

Group shot of Networked Systems Laboratory members (Massimo is beneath the KTH logo). Image taken by Voravit Tanyingyong

Dejan highlighting Massimo’s work (image taken by Marco Spanghero)

Dejan hands the gift to Massimo a few weeks later in the hallway that Massimo chose for the shot. Definitely looks better than the opposite side we used in the past! (image taken by Voravit Tanyingyong)

Our recent IOTLB wall article

Dejan Kostic2023-05-17

Can networking applications achieve suitable performance with IOMMU at high rates? Our recent PeerJ CS article answers this question by characterizing the performance implications of IOMMU and its cache (IOTLB) on recent Intel Xeon Scalable & AMD EPYC processors at 200 Gbps. Our study shows that enabling IOMMU at high rates could result in an up-to-20-percent throughput drop due to excessive IOTLB misses. Moreover, we present potential mitigation techniques to recover the introduced throughput drop caused by the “IOTLB wall” by using hugepage-backed buffers in the Linux kernel. This is joint work with Alireza Farshin (KTH), Luigi Rizzo (Google), Khaled Elmeleegy (Google), and Dejan Kostic (KTH). Follow the links for PDF and code.”

RedN presentation at NSDI ’22

Dejan Kostic2023-05-172023-05-17

At NSDI ’22, Waleed presented our RedN paper that shows a suprising result, namely that Remote Direct Memory Access (RDMA), as implemented in widely deployed RDMA Network Interface Cards, is Turing Complete. We leverage this finding to reduce the tail latency of services running on busy servers by 35x! Full Abstract is below. This is joint work with Waleed Reda, Marco Canini (KAUST), Dejan Kostić, and Simon Peter (UW).

It is becoming increasingly popular for distributed systems to exploit offload to reduce load on the CPU. Remote Direct Memory Access (RDMA) offload, in particular, has become popular. However, RDMA still requires CPU intervention for complex offloads that go beyond simple remote memory access. As such, the offload potential is limited and RDMA-based systems usually have to work around such limitations.

We present RedN, a principled, practical approach to implementing complex RDMA offloads, without requiring any hardware modifications. Using self-modifying RDMA chains, we lift the existing RDMA verbs interface to a Turing complete set of programming abstractions. We explore what is possible in terms of offload complexity and performance with a commodity RDMA NIC. We show how to integrate these RDMA chains into applications, such as the Memcached key-value store, allowing us to offload complex tasks such as key lookups. RedN can reduce the latency of key-value get operations by up to 2.6× compared to state-of-the-art KV designs that use one-sided RDMA primitives (e.g., FaRM-KV), as well as traditional RPC-over-RDMA approaches. Moreover, compared to these baselines, RedN provides performance isolation and, in the presence of contention, can reduce latency by up to 35× while providing applications with failure resiliency to OS and process crashes.