Skip to content

Our ASPLOS ’21 Paper: “PacketMill: Toward Per-Core 100-Gbps Networking”

ASPLOS ’21 will feature Alireza’s presentation of our paper titled “PacketMill: Toward Per-Core 100-Gbps Networking”. This is joint work with Alireza Farshin, Tom Barbette, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostić.

The full abstract (with the video and more resources below):

We present PacketMill , a system for optimizing software packet processing, which (i) introduces a new model to effjciently manage packet metadata and (ii) employs code-optimization techniques to better utilize commodity hardware. PacketMill grinds the whole packet processing stack, from the high-level network function confjguration fjle to the low-level userspace network (specifjcally DPDK) drivers, to mitigate ineffjciencies and produce a customized binary for a given network function. Our evaluation results show that PacketMill increases throughput (up to 36.4Gbps – 70%) & reduces latency (up to 101µs – 28%) and enables nontrivial packet processing (e.g., router) at ≈100Gbps , when new packets arrive > 10 × faster than main memory access times, while using only one processing core

PacketMill Webpage: https://packetmill.io/

PacketMill Paper: https://packetmill.io/docs/packetmill-asplos21.pdf
PacketMill source code: https://github.com/aliireza/packetmill
PacketMill Slides with English transcripts: https://people.kth.se/~farshin/documents/packetmill-asplos21-slides.pdf

Our OSDI 2020 Paper “Assise: Performance and Availability via Client-local NVM in a Distributed File System”

At USENIX OSDI 2020, Waleed presented our paper titled “Assise: Performance and Availability via Client-local NVM in a Distributed File System”. The slides and video are available at the USENIX site. Alternatively, the PDF is available here, while video is available below:

This is joint work with researchers spread all over our planet: Thomas E. Anderson (University of Washington), Marco Canini (KAUST) Jongyul Kim (KAIST), Dejan Kostić (KTH Royal Institute of Technology), Youngjin Kwon (KAIST), Simon Peter (The University of Texas at Austin), Waleed Reda (KTH Royal Institute of Technology and Université catholique de Louvain), Henry N. Schuh (University of Washington), and Emmett Witchel (The University of Texas at Austin)

The full abstract is as follows:

The adoption of low latency persistent memory modules (PMMs) upends the long-established model of remote storage for distributed file systems. Instead, by colocating computation with PMM storage, we can provide applications with much higher IO performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol that manages client-local PMM as a linearizable and crash-recoverable cache between applications and slower (and possibly remote) storage. Assise maximizes locality for all file IO by carrying out IO on process-local, socket-local, and client-local PMM whenever possible. Assise minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes.

We compare Assise to Ceph/BlueStore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts, while providing stronger consistency semantics.

Our Cheetah paper at NSDI 2020: “A High-Speed Load-Balancer Design with Guaranteed Per-Connection-Consistency”

Tom will present our Cheetah paper at NSDI 2020 this February in Santa Clara, CA. This work on load-balancing across multiple servers comes right after our RSS++ paper on intra-datacenter load balancing. The Cheetah paper is available here. This is joint work with Tom Barbette, Chen Tang, Haoran Yao, Dejan Kostić, Gerald Q. Maguire Jr., Panagiotis Papadimitratos, and Marco Chiesa.

The abstract is below:

Large service providers use load balancers to dispatch millions of incoming connections per second towards thousands of servers. There are two basic yet critical requirements for a load balancer: uniform load distribution of the incoming connections across the servers and per-connection-consistency (PCC), i.e., the ability to map packets belonging to the same connection to the same server even in the presence of changes in the number of active servers and load balancers. Yet, meeting both these requirements at the same time has been an elusive goal. Today’s load balancers minimize PCC violations at the price of non-uniform load distribution.

This paper presents Cheetah, a load balancer that supports uniform load distribution and PCC while being scalable, memory efficient, resilient to clogging attacks, and fast at processing packets. The Cheetah LB design guarantees PCC for any realizable server selection load balancing mechanism and can be deployed in both a stateless and stateful manner, depending on the operational needs. We implemented Cheetah on both a software and a Tofino-based hardware switch. Our evaluation shows that a stateless version of Cheetah guarantees PCC, has negligible packet processing overheads, and can support load balancing mechanisms that reduce the flow completion time by a factor of 2-x.

Marco’s presentation at RIPE about unpredictable Internet clouds

At the RIPE meeting in October 2019, our colleague Marco Chiesa presented some of our findings from the study on the effects of recent traffic engineering trends in public Internet. The details are in the RIPE blog post. An excerpt of our findings is shown below.

TCP RTT latency measured between two Amazon VMs deployed in Oregon and Virginia. The black dotted vertical lines highlight moments when RTT latency changed and indicate potential TE activity (such as path changes).

Our USENIX ATC paper presentation “Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks”

At USENIX ATC 2020, Alireza presented our paper titled “Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks”.  Full materials (video, slides, PDF) are available at the USENIX site. The paper abstract is below. This is joint work with Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostić.

Memory access is the major bottleneck in realizing multi-hundred-gigabit networks with commodity hardware, hence it is essential to make good use of cache memory that is a faster, but smaller memory closer to the processor. Our goal is to study the impact of cache management on the performance of I/O intensive applications. Specifically, this paper looks at one of the bottlenecks in packet processing, i.e., direct cache access (DCA). We systematically studied the current implementation of DCA in Intel ® processors, particularly Data Direct I/O technology (DDIO), which directly transfers data between I/O devices and the processor’s cache. Our empirical study enables system designers/developers to optimize DDIO-enabled systems for I/O intensive applications. We demonstrate that optimizing DDIO could reduce the latency of I/O intensive network functions running at 100Gbps by up to ~30%. Moreover, we show that DDIO causes a 30% increase in tail latencies when processing packets at 200Gbps , hence it is crucial to selectively inject data into the cache or to explicitly bypass it.