96


Hello. All right. Welcome to the talk. So the topic of this presentation is CXL-enabled heterogeneous active memory tiering. My name is Bhushan Chitlur. I'm from Intel. I'm a systems data center systems architect. I focus on next-generation data center architectures that include heterogeneous accelerators, memory, as well as intelligent fabric. So this is a great segue into the prior talk because the prior talk focused on pooling, and this presentation hopefully goes, you know, motivates people to look beyond pooling to achieve better TCO.

So I think like many speakers have spoken in the past today, right, there is a problem that has been recognized and one of the most constrained resources in the data center is memory and yet it is one of the underutilized elements in the data center. There's been a lot of analysis being done by universities as well as industry and one such analysis was done by Meta so I just--and they published a really good paper called TMO, Transparent Memory Offload in 2022 and as--so I captured some aspects from that paper and put it on the slide on the left here. So they pointed to two aspects. So first is this notion of underutilizing memory. So when an application or a VM requests a certain amount of memory, it does not actually use that whole amount. And so on an average they said, you know, 50 to 60% is the amount that is being used and the other 40% is just stranded memory. And like the prior speaker said, that stranded memory is not used by the application that requested it, neither is it available to--for other applications to use, you know, to improve their own performance. And so that is a total wasted resource. The second aspect that was highlighted in this paper is respect to cost. So if you look at a percentage of the overall infrastructure cost, the system memory cost is, you know, about a third and it's expected to grow moving forward. And so this is a huge, you know, this is a huge--you can think of it as a huge problem but also an opportunity for us to look at how we can improve that. And they also looked at different types of memory, so compressed memory as well as SSDs. And the SSD line, you know, is really good because it just is flat line across multiple generations and it is also, you know, less than 3% of the overall cost. And so the thinking here is that, you know, can we leverage cheaper forms of memory but yet achieve, you know, acceptable performance in order to receive--in order to achieve better TCO. That's sort of the problem statement of this presentation. On the right-hand side, you'll see, you know, the classical memory pyramid where, you know, what you want hot pages to be mapped to tiers of memory that are close to the CPU and cold pages to be mapped to further tiers of memory.

So given this sort of background, I want to sort of distinguish between two kinds of disaggregated memory topologies. So one is a centralized disaggregated pooling topology. I think the prior speaker spoke extensively regarding that. And another one is a disaggregated tiering topology. So disaggregated memory pooling is a great start. And I think in the past couple of years, it's been on many slides and now we are actually seeing real implementations. So that's great progress. And in this topology, you essentially have a centralized pooled memory controller that is hooked up to multiple nodes. And that pooled memory controller manages that pool of memory connected behind it and allocates and onlines and offlines and enables the efficient sharing of that, you know, memory tier 2 in that left side picture. So I think this is a great start and it does address some of the aspects with regards to, you know, underutilized memory. But it does not still address the fact of TCO, right? So it lowers TCO to some extent, but we are still using the same expensive memory, you know, that we would use on the local tiers. And so what the right side picture is trying to show is the use of different, you know, potentially cheaper forms of memory and yet, you know, achieving acceptable performance. So on the right side picture, this is a memory tiering picture. So you have the tier 1 that is really the DDR connected to the CPU. And then you have the tier 2, which is CXL connected, you know, memory. And this is via a disaggregated memory controller. And then this disaggregated memory controller can implement behind it DDR, HBM, or memory semantic SSDs or even SSDs with smart, you know, front ends. And this disaggregated memory controller can also have, you know, a remote connectivity to other remote memory nodes or even remote--other remote compute nodes that can offer memory to the overall pool.

The point here is that the DDR that is part of this tier 2 of memory doesn't have to be the latest greatest, right? And you could use an older generation DDR if you want to or you could even use SSDs and put the smarts in front of it in order to achieve, you know, acceptable performance. And we have had, you know, we have built multiple different prototypes to demonstrate the viability of this. And we see, you know, quite great promise in using the sort of heterogeneous memory, you know, for this kind of use case. So the local memory tier or tier 2 memory, right, even though it is local, physically local to the node, it is still shared amongst all the compute entities on that node. That means if you have a GPU or another accelerator, it can be used for that as part of system memory also. So, all the memory that is sitting behind the CXL disaggregated memory controller is--shows up as a single NUMA node. It doesn't matter whether it is implemented via DDR, HBM, or SSDs or it's over the network. It is still, you know, shows up as a single system memory NUMA node. And it is that, you know, it is that concept that allows you to seamlessly integrate different heterogeneous memory technologies without having the application or the host to know its exact implementation. So given this sort of a topology, right, now let's dig into one step deeper, right.

So what is active memory tiering, right? So active memory tiering again is the ability to use cheaper, slower memory with augmented with accelerator technologies to offset that slowness. And so what we're saying here is that you can use slower memory but if you augment it with the right smart front end, then you can achieve, you know, acceptable performance. So you can view this architecture as sort of being made as three different layers. So one is memory side, you know, technologies. This is your--you can view them as almost like a physical layer that connects up to the different memory, you know, technologies themselves. So it could be DDR, HBM or SSD, right. But given that these are, you know, slower memory, you know, technologies, you will need something in the front in order to offset that slowness. So in the case of SSDs, for example, right, you will need, of course, an NVMe host controller to talk to that SSD. But an SSD is a block device. A CPU makes requests as cache line--at cache line granularity. So what you would actually need to do is to build a, you know, a module or functionality that first of all converts a cache line request to a block request. And because SSDs have this huge, you know, microsecond--tens of microseconds of latency, you cannot expose that directly to an app. That does not work. And so what you would need to do is to build a caching layer, right. So you could build a cache in front of that SSD that allows you to read a block, you know, cache it up front and be able to, you know, get order of magnitude better latencies for sequential accesses and even striding patterns. And this is stuff that we have actually built and demonstrated with internal prototypes. And we have seen some really encouraging results wherein even though SSDs have, you know, 10 microsecond latencies, when you run industry standard benchmarks like MLC and other, you know, memory characterization workloads, you can see sub a half microsecond, you know, latency. So that's somewhere in the 500 nanosecond range which becomes a very, you know, practical way of implementing a tier two memory. So you could also use DDR and HBM and you could build a cache--you can use the DDR and HBM itself as a cache for SSDs or you could build an SRAM based cache in front of this in order to hide some of that latencies. And the same is true for the remote connectivity as well. So the key here is that the memory side implementation is completely transparent to the host. You may have an NVMe host controller talking to SSDs, you may have a DDR controller, you may have an HBM controller, but none of it is actually visible to the host. This is--it just shows up as a single pool of memory that is exposed via the CXL semantics and is used by an application as allocation from a single NUMA node, right? So once you have built out this layer that provides cache line granularity accesses and you manage the heterogeneity of the memory technologies using the--in the device disaggregated memory controller device itself, then you can layer on top of it, you know, infrastructure services. And I think the prior speaker also mentioned a few of those including, you know, memory, you know, memory health, you know, monitoring and detection and so forth. But in addition to that, you could do hot-cold page detection. So this allows you to seamlessly migrate pages from either the hot memory tier and the--and the further cold memory tier which is--which is this, you know, tier two memory. You could also implement things like data deduplication, compression, decompression, encryption, and even data replication, right? All this can happen seamlessly on this device itself. And you could also implement, you know, workload specific accelerators specific to either AI or database on this--on this--on such a device. So if you layer, you know, the heterogeneous memory along with these accelerator technologies all built as a CXL connected device to the host, then you will have some software touch points associated with the actual accelerator components but not the heterogeneous memory itself. And so this sort of topology we think is very well suited to really drive TCO, you know, order of magnitude. And this is--we have some point, you know, point data associated with this. And the--you know, part of the motivation of this talk is to get more people interested in exploring these sort of topologies in order to really drive, you know, TCO down order of magnitude when even compared to something like memory pooling, which is a great first step.

So now given that sort of architecture that is very broad, has, you know, multiple different touch points, what kind of a platform do we think can support, you know, that--the implementation? So we propose an FPGA-based platform. FPGAs are really well suited for this sort of an application because they support multiple I/O standards, memory standards. So they support DDR, HBM, ability to implement NVMe host controllers on the device itself. They support, you know, our latest, you know, Agilex family of FPGAs supports CXL, you know, x16 and they support high-speed Ethernet. So this sort of a device is perfectly suited not just from an interface perspective, but they can also implement the acceleration technologies that are required and you could customize that and you could actually evolve a portfolio of your own custom solutions in order to meet a certain class of workloads.

So in addition to the Intel offering, there is also a rich ecosystem of partners who are building, you know, equivalent cards from, you know, that cater to this class of applications. And, you know, you can partner with them and start with the evaluation but go all the way to deployment at scale. So we are pretty excited about, you know, this whole ecosystem as well as, you know, where this domain is moving.

So last slide. So in summary, right? So we think this is the absolute best time to explore heterogeneous memory tiering. There is a great value prop and a great opportunity here for us to really lower the overall TCO value proposition. We think FPGAs are a great device for, you know, for the evaluation of this platform and not just evaluation, but also a scale deployment of this. And the portfolio of technologies that you would actually need are many of which are already available. So it gives you a good head start to go off and evaluate, you know, evaluate this sort of an architecture. There is a demo that we have, you know, by UIUC in conjunction with Intel. So this is offloading of memory intensive kernel features to a CXL type 2 device. I encourage everyone to go in and speak to the folks there. It's a really good, excellent demo. So there are links here to our, you know, the Agilex, our latest family of FPGAs, the Agilex 7 FPGAs, and also a link to the CXL IP for those of you who are interested. And we also have, I think, Pekon Gupta here in the audience. So if anybody is interested in, you know, in learning more about the partner ecosystem or need to get touch, you know, need to get more information on either our boards, our partner boards, or anything, please feel free to speak to Pekon. He's from our business unit, and he can help, you know, facilitate those conversations moving forward. Thank you.