115


All right, so good afternoon everyone. My name is Don Moon. I'm from SK Heinex and we have Daniel. My name's Daniel and I'm from Intel and my colleague Sounak. Yeah, they're both from Intel and we're going to talk about CXL memory expansion in software caches. So in particular, we work together to improve Meta's cache lab with SK Heinex CXL memory expanders and also Intel's data streaming accelerators. In this presentation, we're going to share some of the insights and lessons we learned through this collaboration.

All right, so the question we're trying to address together was how to integrate new CXL memory devices into existing key value caching software, either to maximize performance or to reduce TCOs. When it comes to CXL memory expansion, there are two key aspects you need to consider. First, CXL uses, so how are you going to use it for? So for bandwidth expansion or for capacity expansion? The other aspect you need to consider is control layer. So you need to control this memory tiering, where you're going to do it, in application level or operating systems level. So there are four combinations possible out of this four. The focus of this presentation is number one, bandwidth expansion through OS level control, and number two, capacity expansion through application level memory tiering. So we're going to talk about it in details.

But before that, let's recap SK Heinex CXL 2.0 memory expansion solution real quick. It comprises of two components. Obviously, it's hardware and software. So CXL 2.0 memory expander. It comes in E3.S form factor. Currently, it supports PCIe Gen 5 x8 interface. And its performance metrics are summarized in the table on the right side. So as you can see, its latency is around 250 nanoseconds. It's about 100 nanoseconds to 150 nanoseconds higher than DDR5 latency, according to our lab measurements. And its bidirectional bandwidth is up to 27 gigabytes per second, which is compatible to DDR, one channel DDR bandwidth. Speaking of software development kit, so heterogeneous memory software development kit, HMSDK in short, we recently open sourced it. It's available online in the internet. It consists of tools and APIs that facilitates the CXL memory expansion. So we offer these explicit APIs for developers to use to locate specific data structures to either DRAM or CXL memory explicitly. And another, we also support OS level control, which is based on bandwidth aware page interleaving, which I will talk about in detail later. So basically, like I briefly mentioned, CXL memory can be deployed in two ways, either as bandwidth memory for bandwidth expansion or capacity memory for capacity expansion, or both.

So let's talk about this concept a little bit in depth. So bandwidth memory expansion treats CXL memory as the same as DRAM. And basically, it leverages CXL lanes as additional DDR memory channels. And always level interleaving across DDR CXL interfaces is required, and it has to be done smartly. So in most cases, the objective of doing this is to gain high performance through total system memory bandwidth expansion. Capacity memory expansion, on the other hand, it treats CXL memory as a slower tier memory explicitly. And like I mentioned at the beginning, we could do memory tiering through OS level, in the OS level, or application level. In most cases, the objective of doing so, the objective of capacity memory expansion is mostly achieve compatible performance to DRAM-only system through hot/cold data tiering.

All right, so speaking a little bit of our target application software key value caches, on the right side, we have its typical deployment in the data center environment. So in this case, so we have web service serving requests from users, and there's this back end database, which web service attaches data from back end of database, and of course, this database is slow, so you can place key value caches in the middle between web service and database, so they can reduce response time and increase web service throughput. So among many open source key value caches available, we chose Cachelib. Cachelib is an open source key value cache developed by Meta, and it's offered as a C++ library. For detailed information, please refer to Cachelib's official website.

So so far, we've covered all the backgrounds required to understand what we're doing. So let's first talk about the first approach we mentioned, so cache-to-memory bandwidth expansion through OS level control, or HMSDK in our case. So to make a long story short, basically what we did is as follows. We ran Cachelib on HMSDK kernel with bandwidth-aware page-interleaving feature enabled. So what's happening underneath is when Cachelib instances initiated and up and running, HMSDK kernel first scans the underlying memory configuration. And in this case, we assume that there are four memory nodes, node 0 being DRAM node and node 2, 4, 6 being CXL memory expanders. And while it scans this underlying memory devices, it measures the bandwidth of individual NUMA nodes to figure out the optimal-- to calculate the optimal bandwidth page-interleaving ratios among these four NUMA nodes. And when the memory request is actually happening, HMSDK kernel intercepts that request and allocates the memory across these four NUMA nodes according to the pre-calculated page-interleaving ratio at a page granularity level. So it turns out that this simple approach works pretty well, especially for bandwidth-hungry yet latency-insensitive workloads.

All right, so to demonstrate the effect of this approach, we set up-- we conducted an experiment. So again, we ran Cachelib on HMSDK kernel with bandwidth-aware page-interleaving features on. And we run it on software with one-socket CPU with eight DIMMs and two CXL expanders. And for workloads, we chose CDN workload, content distribution network, because it's bandwidth-hungry, yet not that latency-sensitive. As a baseline, we ran Cachelib on DRAM-only server. And the graph in the middle shows performance comparison of those two configurations. As you can see, CXL bandwidth-expanded Cachelib shows about 15% to 20% improvements in cache-to-gap throughput. And if you look at the graph on the right side, it shows a latency-- P15 latency and P99 latency of two configurations. And again, as you can see, CXL bandwidth-expanded Cachelib shows about 10% to 50% reduction in latencies. And this is because now the memory traffics are spread over four memory NUMA nodes instead of one DRAM-only remote node. So this demonstrates the potential and benefits of CXL memory bandwidth expansion. Now I'm going to hand this over to Daniel, and he's going to talk about application-level memory tiering.

Great. I think it's all of a sudden playing. Why is it? Okay. There we go. So what we talked about was an OS-level approach to performing tiering. What we did in Cachelib is we explicitly added memory tiering support. So the way Cachelib works is it has a memory allocator for each tier, and each tier has its own slab memory allocator bound to a particular set of NUMA nodes. Each tier has its own eviction list, right? So items are evicted from tier one to tier two, according to some configurable eviction policy. We use LRU by default, but other policies are supported. And then what we do in order to reduce the overhead of this eviction is we perform them in the background. So we added some background threads to move items transparently among the tiers. So you don't know if your data is residing in DRAM or CXL, and we can also do this in batch in order to reduce the overhead. Our fork is available on GitHub, and we use the develop branch as our active branch as well. So when we do this, you know, the biggest overhead that we found is moving this data among the tiers. And we do this in batch. So we found this as a great opportunity to leverage some Intel-specific hardware that Sounak's going to talk about.

Thanks Daniel. All right. So I'm going to focus on Intel Data Streaming Accelerator, or DSA as it's more commonly known as, and how we are leveraging it to get improvements in allocation tail latency and CPU usage for the multi-tier cache lib using the background batch eviction. So for those who are not familiar with DSA, it's one of the accelerators that is available on Intel fourth generation Xeon Sapphire Rapids. It's also there on the Sierra Forest as well. So we are focusing on off-CPU batch data migration to offset some of the overheads that we are facing with the CPU-based memory copy. All right. So there are three main ways that we are performing this eviction process. One is the pure CPU-based mechanism, where that is the baseline for us. The other approach is using the DSA offload, doing the entire data background batch eviction mechanism through the offline DSA mechanism. And the third approach, the approach which is working the best option-- which looks to be the best option at this point-- is a hybrid approach, where we, for example, if the batch size is 100, we can split it into a certain portion, and we offload that to the DSA. And while DSA is completing the data move, we complete the rest of the batch using a CPU-based memmove. That option reduces the overall tail latency of the background eviction process itself, and we are actually seeing good results, as I'm going to show in the next slides.

All right. So a brief mention of the configurations that we are using. So we have a two-socket. The system that we tested on, it's a two-socket Sapphire Rapids. Each with 48 cores. And it has 16 32-gig DDR5 and CentOS with 5.15 kernel. So the BKC is a kernel internal to Intel. We have tested with kernels after that, but that's not part of the presentation here. Apart from that, yeah, I think -- oh, so the other thing which I should mention here is we are using the CDN workload for testing. The reason we are focusing on CDN is because CDN has a bimodal distribution with large -- fairly large item sizes. In general, large item sizes suit DSA data of CPU data movement. So we are focusing on CDN for this presentation.

So this is a brief overview of some of the experiments that we have run so far. On a 32-gig cache setup with 16 gigabytes on the near DRAM tier and 16 gigabytes on the far memory tier. So in this case, far memory is the remote socket. The remote socket and DDR working on a remote socket. We don't have the CXL hardware currently, so the tests that we ran, we used the remote NUMA mechanism for mimicking the CXL-based memory tiering. So if you look at the data, we are seeing a nearly 10% reduction in CPU usage compared to the non-DSA-based baseline. And on the allocation tail latency mentioned as set 90 and set 99 in the diagram here, we are seeing significant improvement in allocation tail latency. The P90 is reduced by nearly one third, and the P99 is getting reduced by nearly two thirds. So I will pass back to Daniel.

Thanks, Sounak. So this experiment was done just using remote DRAM as emulation for CXL, but we did actually have a chance with our colleagues at SK Hynix to do some experiments on capacity memory expansion. In this case, we looked at two scenarios. First is, to what extent could you use CXL memory to replace DRAM while maintaining the same performance? And how much additional throughput can you be gained by adding CXL memory to your original DRAM configuration? In this case, we used some open source traces from Meta, the KV cache traces, and we have two scenarios underneath that. One is a cache without an NVMe device. So you can configure CXL to use an NVMe device to store even more data. And we limit the overall throughput in order to simulate the network I/O bottleneck. So in real deployments, most of the caches are bound by network throughput. And in this case, we take that out just to measure raw performance. And then in case two, we add in that NVMe device to see, OK, if we were to increase the additional byte addressable memory, how much performance improvement would we get by offloading that data that would be on NVMe device onto a CXL memory?

So in our case, we're comparing a couple of different scenarios. The first is application level control. So this is our explicit memory tiering function. The next is OS level control. That's using HMSDK in order to control which pages are allocated to CXL memory and which pages are allocated to DRAM. And the point is doing explicit memory tiering results in the best performance because of some of the software optimizations that we mentioned before. We can background the data migration. We can utilize DSA in order to reduce the software overhead in this data movement among the tiers. So what we have-- we show actually CXL only and a DRAM only cache as kind of a couple of baselines. And DRAM only is that black line there. And then the yellow line is CXL only. So application level control actually outperforms a DRAM only cache in terms of some of the P99 latencies because we're able to evict these items in background and keep available allocation slots ready. And then that also reduces some internal contention on locks that we found.

So in our next case, we added CXL memory to our DRAM configuration. And what we found is that as memory cache hit ratio improves with larger cache sizes, going from 2 gigabyte DRAM to 16 gigabyte, we significantly improved the throughput over our baseline 32 gigabyte DRAM cache. So in this case, we're able to outperform the OS level HMSDK because we have improved DRAM bandwidth utilization. We have the most popular objects that are placed in DRAM by the LRU ordering. And only the tail or least popular objects are being hit in CXL. So we get better DRAM bandwidth utilization compared to OS level control by doing explicit memory tiering. And then more data, as we increase cache capacity, more data then is being served from the CXL device compared to the NVMe device.

And as our call to action, we're talking about how do we, given these CXL devices, given the heterogeneous memory systems that we have now, how do we enable them and which approach should we take? Should we take the OS level or should we do something that's application specific? What we found is that key value caches, especially something like Cachelib, it's open source, they're strong candidates for evaluating these multi-tier DRAM CXL systems because they respond well to additional cache capacity. We can increase additional cache capacity. We can increase the hit ratio. And we demonstrated that our application level tiering outperforms the OS solution, but that requires application-specific changes. We also showed earlier that the HMSDK can be used for bandwidth expansion. When we're doing local caching workloads, large objects, interleaving works well to increase your overall system bandwidth. And finally, our software is under active development. We have our fork Intel's multi-tier Cachelib. We work on the develop branch there. And we're actively making pull requests towards Meta's Cachelib. So that's slowly going upstream. We have SK Hynix's HMSDK platform and there's Meta's Cachelib as well, just for your information.

And I'd like to invite everybody, if you're more interested, we have our demo set up in the Experience Center. Come check us out. And with that, we'll have some disclaimers. And any questions? Thank you.

One question. I'm curious to see if this was going through switches or additional latencies, essentially.

That's interesting. I think one thing you could do is you could set-- our CXL memory device is directly attached to the NUMA node that we're running on. So we're getting about 250 nanoseconds latency. If you were to go across NUMA nodes, I think that jumps up to like 400, for example. And we could run some experiments like that to see how much impact on performance. Is that exactly what you're saying? Yeah. We could do that. Yeah. Absolutely.

So you tried a large chunk size data. I know CDN has a bigger size. So are you expecting CXL data mostly use case will be large chunk size data access or even we can support--

Actually, for the capacity memory expansion experiments, those were smaller objects. And we switched to larger objects, then we see heavier increase in bandwidth as well. And then the system becomes bandwidth bound. So that's why we kind of took these two different workloads, two approaches. One's latency, one's bandwidth bound.