On March 26, 2019 in Dresden, Alireza Farshin presented our EuroSys 2019 paper on unlocking a performance-enhancing feature that existed in Intel processors for almost a decade. We are making the video of the talk available. The slides are available as well.
CPUs typically have cache memory which increases the speed of access for the most commonly used data, and thus to a large extent masks the long latency of the main memory (DRAM). During this period we have witnessed ever-increasing “core” counts (the basic building blocks of CPUs that can operate independently). For the last nine years, the largest, last-level cache of Intel processors has been split into “slices,” each being physically faster to access by the core to which it is attached relative to other cores. Our work first showed that reading data from the nearest slice is 20% faster on Intel’s Haswell CPUs, and 40% faster on the newer Skylake architecture. We then proceeded to show that a large fraction of these potential gains can be realized when the application carefully places its working set (most commonly used data) in the slices nearest to the cores that will process the data. To showcase the benefits of our approach on real applications, we have built a transparent software layer called CacheDirector that reduced the tail latency (processing time at the 99th percentile) by 21% for packets going through a service chain working at 100 Gbps. Handling traffic at such large speeds is vital for handling increasing network demands.
Our work has a clear benefit of increasing performance “for free” or reducing energy consumption while performing the same amount of work performed even in finely tuned systems. Many applications can benefit from our contribution with relatively small changes. Future important societal applications will also clearly benefit from the higher probability of receiving predictable latency responses.