358


Hi everybody, Prakash and Chris, and hopefully you guys are almost ready to retire. So we're going to be talking about a lot of things that have been asked, like CXL was a very exciting technology. Where are all the products? And the thing that we're going to try to talk about is why it takes time to deploy these things, especially like in a hyperscale kind of environment.

So first of all, I want to start talking about why we think CXL is an important technology, and again, this has been brought up a number of times. So I'll go quickly through this. Basically, what the problem that we see in front of us is that memory is a growing and large portion of server's total cost of ownership (TCO), and because the cost per gigabyte has become kind of flat, every time the machine scales up, the memory cost scales up along with it. And so it's a very big piece of server cost, and we want to find ways to reduce it. The second part is that DRAM chips, by their nature, are the most predominant silicon on a server, and so they contribute to the carbon footprint of that server. And again, our sustainability and other objectives want us to kind of try to reduce that as much as possible. So how does CXL help with this? Basically, again, I'm trying to summarize a bunch of papers that have been presented by different hyperscalers. We see that across many of our workloads, a majority of our servers see a large portion of the memory being inactive or not accessed for a significant period of time. So for minutes, you can have pages that have never been accessed for up to half the memory in the server, which means that these are really cold pages and they don't necessarily need to be in a high-performance tier. The other thing that is kind of obvious, but we have a lot of DDR4 memory based on servers that have been deployed in the past. And these memory sticks are getting decommissioned, and we can put them to good use, except like the CPUs of today don't support DDR4. And that's where CXL comes in. CXL is a standard interface available on all the CPU vendors' offerings of the day, and it gives us a convenient point to attach these reused DDR4 DIMMs.

So, I want to just talk about the challenges of taking that concept and actually deploying it at scale. So, this is kind of circa last year, where we first made an attempt to take CXL to production. And Chris was at Meta at the time. So, he has actually seen a lot of this journey as well. On the right-hand side, you see the server, which has eight channels of DDR5. And in the front of it, there's a module, which has CXL ASIC and four DDR4 DIMMs on it. So, what we were trying to do was to take a 256-gigabyte machine, which would have been built with eight 32-gigabyte DDR5 DIMMs, and then split it such that a quarter of the memory would now be moved over to CXL. So, that would be eight 24-gigabyte DIMMs to get to 192 gigabytes and a 64-gigabyte module over CXL using four 16-gigabyte DDR4 DIMMs. So, naively you would expect that because the DDR4 DIMMs were "for free," you'd get one-fourth of the memory for free. But reality, obviously, is a much harder customer. And what we learned was number one: 24-gigabyte DIMMs are actually more expensive on a dollar-per-gigabyte basis than 32-gigabyte DIMMs because they're not mainline; they use 24-gigabyte chips, etc. The other part was that single-rank DIMMs actually perform worse than dual-rank from a bus utilization perspective and the overheads that we had from the cards, the controller, the DDR4 DIMM recertification costs – all those overheads kind of add up and neutralize some of the value proposition. In addition to that, we had scheduled challenges because of controllers being first generation and taking having critical bugs that caused deployment to get delayed. So, the net-net of it is that we were seeing a marginal business case improvement, but the downside risks to deployment were pretty high. And so, we decided not to take this to production. But we did use it to kind of do application-level characterization, and I'll talk about that in a couple of slides.

So, based on what we learned here, we decided to approach it a bit differently for our second attempt. And in this, we decided to target memory-heavy applications, which constitute about twenty to thirty percent of our entire fleet. So, in this particular example, you can see on the right: it's supposed to be a one terabyte capacity server. It has 12 channels of DDR5 on it, and the CPU is on top of this picture. And then, in the front, you can see two CXL ASICs – the same ones that were used or not the same one but similar ones to the one that were on the previous page – with four DDR4 DIMMs each. So, in this configuration, we wanted to get to a terabyte. If you used only DDR5, it would be basically roughly twelve ninety-six gigabyte DDR5 DIMMs. So, we made an alternate configuration where we used 64-gigabyte DIMMs for main memory and 32-gigabytes for DDR4 to get to the same similar capacity. The main reason for going here was to reduce the overhead that you have for CXL. We combined cards to one single card for the two ASICs, and the power supplies etc. could also be combined together to reduce costs. So, this is all well and good, and things looked good. But this is also not an easy thing to take to production. The reason is that we developed this tiered memory solution for our first-generation system, which was designed for 256 gigabytes. Now you've quadrupled the amount of memory in the system and also quadrupled the amount of memory behind CXL. So, this tiered memory stack has to evolve so that it doesn't take up too much memory cycles, too much CPU cycles, trying to move to monitor pages that are hot versus cold and to move them to the appropriate tier. Next, we have a lot of workloads, and depending on... (pause) ...in order to deploy this widely so that all workloads can use the whole complement of memory and not see performance problems requires us to test these workloads independently. And that requires like hundreds of systems. So, in order to be able to actually validate that performance hypothesis, we need to actually deploy in large quantities. The next thing is the complexity of the system grown because now you have multiple CXL ASICs, and you have to make sure that memory pages are allocated equally among them. And so there are several BIOS and performance knobs that need to be tweaked in order to get to something that is deployable. Finally – I think Chris will cover this in great detail – we want the same kind of monitoring, telemetry, performance monitoring, health monitoring that we have for native DDR5 on CXL. And that stack has been taking a while to build up. A lot of vendors are working on this thing, but we need like a unified solution such that things can avoid and operate as if DDR4 on CXL looks exactly like DDR5.

So now, I'll come back to this case study on application performance. This was basically done with the 256-gigabyte machine that I mentioned before. And we had built hundreds of these systems to actually make sure that we can validate that our performance hypothesis is actually validated. So, we use two different workloads: one was like a key-value cache workload, and another was a graph database cache workload. Both applications are roughly using 244 gigabytes of the 256. The rest of it is used by the system. These workloads are not necessarily very bandwidth-intensive, but they are very capacity-sensitive. So, the hypothesis was that we should be able to meet the same performance with a portion of memory replaced with CXL memory. If you look at the bar charts on the right, they're kind of trying to show... The blue bars are basically the system without CXL, and that is a baseline. And the red bars are showing the CXL-enabled system. So, the first – the leftmost bar – is basically showing the throughput of the machine. And you can see that the CXL and non-CXL are very close to each other. The second is showing the latency that is observed by the workload. This is like a user-facing latency. And there is a slight increase with the CXL-enabled system, but it's well within the latency guarantees that are needed. And the very right – you see the power of the system. As it would make sense, there's a slight increase in power because of the CXL controller and the DDR4 DIMMs that have been added. But the increase is not as big as it would seem because the sum of the power from DDR5 is actually no longer needed. And that is also reflected in the bandwidth numbers, which is the third bar. Which shows that the DDR5 bandwidth dropped a little because some of this spilled over to CXL. The net of this is that there's not much of a performance degradation in a CXL-based configuration. And there's a lot of opportunity to save cost if we can replace some of this memory with DDR4 that is reused. So, with that, I'll hand it over to Chris to talk about the deployment at scale challenges.

Thank you, Prakash. OK. So, with any new technology, there's always the need to not just build the technology and the technology components themselves, but also to build the ecosystem that goes around it. At the end of the day, it's not enough to just build a chip and the firmware that runs on it. You have to be able to integrate that into the overall system. And so, that involves a number of pieces to make that overall solution work. So what I'm showing here is the need for the application... integration... the application performance, which is something that Prakash just talked us through. There's a need for the right level of RAS – that's reliability, availability, and serviceability – and the right integration of those RAS capabilities. You have to have the security capabilities. You have to have in-band system management and telemetry. You have to have out-of-band system management and telemetry. And of course, these devices all need to be able to interop with the CPUs themselves, as well as the DIMMs that are located behind the controllers. Now, basically, this takes several years to go through and actually develop this aggregate solution stack that I'm showing here on the right. I'm going to do my best to kind of sprint through this in about seven minutes or so. So bear with me a little bit. OK, so let's take each one of these at a time. So Prakash talked about the applications already.

So let me just dive right head-first into RAS. So obviously, with RAS, you could spend probably multiple sessions just talking about this one particular topic as it's incredibly deep. As many of you guys know. So I'm just going to hit the highlights here. So from a requirements perspective, in general, depending on your system implementation, there are multiple methods of notification. So generally, those are firmware first, OS first, or out-of-band. Because there is a diversity in terms of how people build these systems, at the end of the day, as a builder of this type of silicon solution, you really have to end up supporting all of them because different customers have different requirements in this space. And then overall, there's this requirement that: hey, if I'm going to put memory behind CXL here, I want it to behave the same. I want it to have the same level of reliability as the memory that I'm putting directly attached to the CPU itself. So basically, you need to have server-grade level memory RAS as well. And of course, because now you're using CXL to connect to that memory, you need that same level of RAS capability running on that link itself. And then fundamentally, at the end of the day, you also have to be able to test all of this. You have to be able to test all of this capability, which means you have to be able to inject errors so that you can validate all these error flows and make sure everything actually works the way you expect it to.

So now that we understand the requirements, let's talk about where we are with the solutions. So what I'm focused on here is two specific aspects. First of all, I'm showing an OS first notification mechanism. And at the same time, I'm showing how memory RAS behind the controller - one example of that, basically, is working. So at the top here, I'm showing that: we have injected an uncorrectable DRAM error into the memory that is behind the CXL controller. The CXL controller will trap that. It will log it as an error within the controller itself. And then it will propagate that event record up to the OS. And that's what you see here - the second step. You can actually see that event getting logged into dmesg in the operating system. And in this case, because it's an uncorrectable error, not only are you going to log the uncorrectable error, but because it's uncorrectable, you're also going to log a poison event as well. So you can see that both of those events have occurred and have been logged in dmesg. And then finally, because that's not necessarily enough for most people to be able to isolate the errors, you also have to be able to say: 'OK, exactly where did this error occur, right?' You have to say on this particular DIMM, on this particular rank, this bank, and so forth. And so that's what I'm showing here at the bottom. This is the CXL event record that is defined as part of the CXL spec, reporting those additional details. So now we've shown that we have the OS first mechanism working. We have memory RAS behind the controllers working. So what's next?

So let's look at the firmware first piece of things. In this case, we're injecting a CXL.io error - specifically in the downstream direction, in other words, from the CPU to the CXL controller side of things. So the controller has trapped that message. It has logged it as a non-fatal error. It has logged the address as well. And then it has propagated this back to the system firmware. And that's what you can see in step two here: This is the BIOS UART log that shows that that error has been captured by the system firmware. The system firmware will then propagate this error to the operating system if necessary. In this case, that AER is now logged in dmesg as well. OK, so now we've talked a little bit about the link RAS part of things, as well as firmware first. And we've shown that we were able to do that as well.

OK, so another incredibly deep topic that I'm going to spend maybe two minutes on at this point. But the point here is that security is incredibly important to most customers these days. And it's super important that you have this full security solution when you deploy these types of things, especially with CXL memory as well. So I won't go through all the requirements here. But what I want you guys to take away is what we have been able to implement so far - that we have Secure Boot. We have Secure Firmware Update. We have the IDE. And we have Memory Encryption enabled. And all of that is working today. So you can see a screenshot of that at the bottom right-hand here, which is showing that the chip is coming up in the correct secure mode. The other point I want to make here is that you also have to make sure that your firmware has been audited and is sufficiently secure from that perspective. And the OCP Safe process that was developed here in OCP is one excellent way of doing that. And that is a process that we have completed.

All right, so let's jump into in-band management real quick. Basically, there's a bunch of things that, of course, you need - the telemetry, the health, the performance, the event records, and so forth. What I'm showing here on the right is that we have this running on both Windows as well as Linux using the open source Linux CXL CLI utility to be able to pull this information over the standardized interface. And we have this running - running not only on multiple operating systems, but on all major CPUs as well as ARM options.

Then on the out-of-band side of things, you also have to have this working with the BMC. What I'm showing here are some screenshots of OpenBMC running, doing the MCTP discovery to the device itself, and as well as reporting the DIMM FRU information.

Then you also have to make sure that you're correctly talking to all the CPUs. You pass the compliance test. You work with the platform firmware. All of your initialization flows work and so forth. I'm happy to say that we have our controller working with every major CPU supplier, and all of the latest generations - including Granite Rapids and AMD Turin platform, as well as a number of custom ARM processors as well.

DIMMs are incredibly difficult to actually work with and get the full interoperability working behind the controllers. There's a ton of coverage that has to be worked through here as well. Really what I want to say is that we have this worked out with all three of the major memory suppliers today. This is working. This is robust. We have all the memory patterns working. The training sequences are solid. Again, another piece that we have been able to complete.

So really what I want to focus on here at the end is that if you want to deploy CXL memory at scale, you're going to need to do it with a holistic solution. It needs to be fully integrated into the platform, into the system, in order to actually deploy this at scale. So we've talked about the application performance, the RAS, the security, the in-band management, out-of-band management, CPU interop, DIMM interop. And we have a complete solution stack - everything from the silicon to the hardware to the entire stack of software required to go deploy this at scale.

I'm going to skip this in the interest of time real quick, but obviously there's a lot more that we need to do in terms of CXL in the future.

But what I want to leave you guys with - if there's anything you take away from this entire conversation - is that CXL is here. It is implemented in real production-capable hardware with the entire ecosystem and the software stack required to do it. It is performant. It is secure. It is reliable. And I would encourage you to come talk to us more to learn more. I'd point you to the booth, but naturally the expo hall is closed. So come talk to me afterwards if you'd like to discuss it some more. With that, thank you guys very much for your time.