42

YouTube:https://www.youtube.com/watch?v=uDQfwzMl54A
Text:
All right. Thank you all for coming today. Thank you for coming to our CXL forum. If all you saw was the name of this talk, "Endless Memory," you would probably say that's a mighty bold claim. But hopefully I can convince you with just a quick three or four minute demo that that is really the future ahead of us. It's rather interesting. But before we get there, I always like to start my talks with a bit of a joke.

So about 20 years ago, a meme started on the internet when someone on a forum asked where they could download some more RAM, about 2004 time frame. And of course, the internet being what it was decided it would go off and make a website where you could ostensibly go download some more RAM. And if you did that, you would get forwarded to Mr. Astley's music video as one does on the internet.

And 20 years later, interestingly enough, CXL comes along and says, "You know what? That's not a bad idea. Let's do that." 

So let's do that. What I'm going to show you here, it's a little hard to see at the bottom, but this is the front end for our memory composer system, our sample front end anyway. And what I'm going to show you is what it looks like to have a cluster hooked up to an endless memory service. And down at the bottom, there's a hard to see terminal that shows the status of memory on each server. If you come by our booth later, I can show this to you live attached to the real hardware and actually working. But let's give it a play. So here, we're doing a rough overview of the service itself. So here, our composer is monitoring two hosts on a cluster and that host is hooked up to one multi-headed single logical device with 128 gigabytes of memory capacity. The cluster total right now, memory is 2.3 terabytes of memory. And what I want to do next is just go ahead and download 32 gigabytes of memory into one of the hosts.

So let's go ahead and do that. We'll put in a little bit of our information to pay for it. And there at the bottom, you can see our memory increased from zero to 32. It might be a little hard to see, but we're about to jump to another view that shows that. And this is the device view. So there, it's a little difficult to see, but 16 two-gigabyte blocks were added to that first host. And here on the side, we have a few buttons where I can increase the mapping between these two hosts in the memory. So with a click of a button, I'm able to add two-gigabyte blocks of memory to each of these systems and reduce them at a moment's notice. And while this is kind of a neat gimmick, it would be nice to be able to do this automatically, right? So what we're going to do now is apply a profile.

So this profile, for example, says if we exceed 60% memory usage, we're going to start adding additional capacity to avoid out-of-memory errors. And if we drop below 50% memory usage, we will start to release some of that memory back into the pool for other hosts to use.

So what we're going to do here is switch our service from manual mode, which we were just looking at, into automatic mode. And we will go back to our device view. Back in our device view, we can see the pool has rebalanced itself to just add a little bit of memory to each of these hosts. And we'll auto-balance it to just give each host 32 gigabytes as a baseline, just to be equitable as a baseline. And we'll jump over to one of the host views to show you what we can see and what we can monitor from that viewpoint.

So this first host right now has 180 gigabytes of memory available to it. Some of that is CXL. Some of it is DRAM. There's about 12 gigabytes of memory actually being used. And there is 96 gigabytes of CXL memory available to the system with 32 gigabytes of that mapped.

What I'm going to do here at the bottom is launch a program that's going to eat up about 150 gigabytes of memory. And we'll start to see that central graph increase its usage.

So we're starting to see we're breaking that 120 gigabyte line. And you'll see we just automatically added memory to the system, about 20 gigabytes or so.

And as that memory usage goes up, we'll continue adding additional memory from the pool. This is very similar to what you would see on a typical job, is memory usage will increase.

Then eventually, we'll hit that 150 gigabyte usage time frame. We'll start spilling a little bit of our memory usage into the actual CXL memory. And then our program will finish.

And our memory usage will drop back down to that 12 gigabytes.

At this point, the service will start releasing that memory slowly back into the pool bit by bit. We do this slow back off to avoid any thrashing issues, right? So we don't want to actually hurt the performance of our program.

If we go back to our dashboard, we can actually monitor that usage. You can see just that run of that demo. We were able to see where the memory usage spiked. We're planning on building a lot of these monitoring tools for a cluster.

It was two Sapphire Rapid servers hooked up to an SK Hynix Niagara pooled memory system. And we'll talk a little bit more about that in just a moment. And we are also working towards getting the same demo up and running with a memory expander connected to an Xconn switch, which we expect will be available the next time we run demos.

So let's talk a little bit about memory pooling and fabric management in general.

So as Charles said in the prior talk, there is an out of memory problem that most clusters run into to some extent. And this can really show up in a few different ways. But the two main ways are data spillage. And that is when your memory usage exceeds the amount of RAM available and you utilize swap. Swap is when you go to disk and you're basically paging memory in and out of disk. We waste time. We waste money. We waste power when this happens. The performance of your program goes down. And nobody likes that. So that's one way that out of memory shows up. The other way it shows up is memory stranding, which is we defend ourselves against that spillage problem by jamming as much memory as possible into every single chassis in order to avoid these conditions. And what that shows up as is wasted money on unused RAM. So you just have dark silicon in a lot of these servers. And we've seen that the price of a standard server can exceed, you know, 50% of the cost of a server can be DRAM. So elastic memory, as I demonstrated, can actually cure these problems. Memory stranding is actually a symptom. It's not the problem itself. The actual problem is avoiding out of memory errors and avoiding data spillage, as we'll see in a moment.

So let's talk about what we can accomplish on CXL 2.0 versus CXL 3.0. On CXL 2.0, we have just a few hardware capabilities available to us. Memory expanders, single switch topologies, and multi-headed SLDs. The way these look, multi-headed SLDs just have multiple hosts directly connected to the device with no switch in between. So it's a rather rudimentary way to break up memory. And then single-level switching is maybe we're going to allow reconfiguration of entire devices across a CXL network. Eventually, we'll have CXL 3.0 show up, where we're going to have CXL fabrics, if you will, complex topologies, multi-switch topologies, multi-headed devices, or multi-logical devices, where the device itself looks like one piece of memory but can be chopped up dynamically, and dynamic capacity devices, where the device itself will provide a service where you can request additional memory, and the kernel will map that for you. But in the meantime, we believe that we can backport some of these features to CXL 2.0 devices, like dynamic capacity, as I just demonstrated.

So there's a few ways to accomplish pooling. One of them is the logical device pooling system. And this is kind of a semi-static environment under CXL 2.0. The idea is the orchestrator provisions memory before boot time, and then the systems themselves, the hosts themselves, will boot and map that provisioned memory. Typically, on a system like this, you would expect, in order to reprovision memory, it's not as dynamic as we would like. We would like to be able to do it without shutting the hosts down. But right now, on the current set of hardware, this wouldn't be possible. So a reboot is usually required to do this type of provisioning on LD pooling systems.

Now, what we just accomplished was pre-dynamic capacity dynamic pooling. And the way this works, and the demonstration I showed you was on a multi-headed single logical device, each host is attached to these devices and sees the entire range of memory available. So the way this shows up to an end user in Linux is there will be a NUMA node, and that NUMA node will see a variety of memory blocks behind it that are represented by basically two gigabyte areas of memory on that device. What we use to accomplish exclusivity is a little bit of software that coordinates with our composer that enforces only one host being able to use any given memory block at any given time. We can actually also use this exact same mechanism, although not in NUMA node mode, in DAX mode, to accomplish shared memory as well, which is what Charles will talk about in a moment. So this is on CXL 2.0, pre-dynamic capacity.

Post-dynamic capacity in CXL 3.0, what this will look like is actually incredibly similar. The only difference will be those memory blocks, instead of being fully mapped for the entire device, will be dynamic. So your NUMA nodes on each system will be sparsely populated, and the multi-headed device or the switch will have a fabric manager logical device that can be utilized by an orchestrator to add and remove capacity as needed. But to the end user, this will more or less be transparent. It just looks like memory. It's a NUMA node that has memory behind it. That number goes up, the number goes down. It's usable at runtime, and they won't have to make any major changes.

But there is software needed, and this is our forward-looking plan. So dynamic expansion triggers, right? So if we want to have this elastic memory service and we want to dynamically allocate memory back and forth, we need reasons why we should do that. I demonstrated one example of that, some simple endpoint monitoring. What's our memory usage? If we exceed a certain percentage, allocate more memory. But we can also do things like scheduler integration, right? So if we're about to launch a large VM on a system and we know that that VM is going to consume more resources, we can dynamically add that before that consumption occurs to avoid any out-of-memory problems.

There's going to need to be some level of coordination that's built. That's the orchestrator and the fabric management that you'll hear a lot about throughout the talks today. And we'll need some level of data services. So there is always going to be a concern about performance. So we'll be implementing some tiering services on top of our pooling and our dynamic capacity systems, as we're going to talk about later this afternoon, memory sharing services, and then starting to get into real functional things like snapshot and live migration and things like this. So software really is key to making CXL a success, and that's where we are looking.

So most folks always have the question of performance, right? So we're putting this memory out on the PCIe bus. What's the performance of it going to be? It's higher latency. It's lower bandwidth. I have major concerns with that being an HPC. What does that realistically look like?

So CXL traditionally has been scrutinized heavily against DRAM latency. So DRAM is all the way at the top of this pyramid. It's getting anywhere from 80 to 140 nanoseconds of latency. And CXL memory ostensibly gets anywhere from 170 to 600 nanoseconds of latency, depending on how far away you are, multiple switch hops, what have you. And we think this is slightly unfair, because we think the primary use case is going to be avoiding I/O when at all possible, so avoiding going to the network and avoiding going to storage. So let's contextualize the problem against I/O or swap for just a moment.

So there's a few different problems that we can look at for I/O. One of them is data spill, which I've mentioned before. And that is when our data usage exceeds the DRAM capacity available. And that can happen, for example, in an HPC cluster with this idea of inference skew. Inference skew is when one host in a cluster experiences higher memory usage or higher load than the rest of the cluster, either due to a job being scheduled that was unexpected or the load balancer just making a poor decision, for example. Typically, this type of skew is temporary in nature. And so it needs a temporary boost of memory just to get over the hump. And then that memory could go away, for example. But there's also this problem of inelastic workloads. Some workloads simply cannot swap their memory to disk, either due to major overhead problems or because they just were not created with that in mind. And so in these inelastic workloads, this is where you run into out-of-memory areas, out-of-errors, and you simply fail to complete your job. Memory stranding, as I talked before, is actually a symptom of trying to head off all of these problems in a static environment, which is what we're in today.

So that's what we're going to try and solve. I wanted to talk about an example benchmark. So we ran the Cloud Suite 3 in-memory analytics benchmark, which is a Spark workload. The data set in this case is relatively small. It's 40 gigabytes, with 36 to 38 of those gigabytes being quote-unquote "hot," which means that that's actively being used by the system. So we can expect rather rapid degradation of performance if we utilize any amount of swap. And the runtime is in the range of 5 to 10 minutes. And our hypothesis is by utilizing CXL capacity and heading off any out-of-memory errors, we can retain most of the performance of a benchmark like this.

Some environmental information. So we ran this on the same exact hardware that we demonstrated the dynamic pooling demo earlier in this talk. The Sapphire Rapid system connected to the SK Hynex Niagara pooled memory system with 128 gigabytes. There are a few artificial constraints. One of them is that we bound the software to a single socket just for consistency stake. And in fact, it got better performance that way. Two, we utilized memory hotplug to reduce the overall capacity of the system down to 64 gigabytes per socket. But we did this along 2-gigabyte or 64-gigabyte alignment, which made sure that we used at least 2 gigabytes per DIMM to make sure that we were getting the maximum amount of bandwidth on the local socket while artificially constraining that total capacity so that we could force spillage to disk or spillage to CXL. And then we utilized a memory hog program in the background to increase that spillage at 1 gigabyte per test run so we could see where that falloff in performance occurs.

And these are the basic results, right, as one would expect. So the DRAM baseline is that theoretical baseline. If we can fit all of our hot memory into DRAM, we get the best possible performance. That's 100%. If we incur 10% spillage to swap or NVMe, we start to see the degradation in performance. That's how we start to know that this is our hot memory limit, so that 36 to 38 gigabytes. And we're already at just 8 to 10% spillage. We start to see a degradation of about 15 to 20%. But as soon as we hit 13 to 20% spillage, we start to go asymptotic, which is what one would expect when utilizing swap in a workload like this. But if we replace that swap usage with CXL memory, we see a much more linear decline and a much more predictable decline that caps out around 30% compared to this 2, 325%. So we can actually reclaim 200% or more of that performance loss by simply adding, in this case, about 4 to 8 gigabytes of memory. But if we scale that up to a full 2-terabyte server, we're talking about 2 to 500 gigabytes of memory added temporarily to avoid this problem and then backing off. But I've left out one key piece of information here. I haven't actually told you anything about the Niagara pooled memory system. What is its latencies? What is its bandwidth?

That's because I wanted to show you just the general value of CXL test and then tell you the performance of this system is relatively low because it's an FPGA development system. The latency is very high. It's 600 nanoseconds and the bandwidth is very low. It's 5 gigabytes a second. It's running on PCIe Gen 4 and it's using a x8 link and it's using DDR4. So it's not using DDR5 and because it's that FPGA, it's got this high latency and low bandwidth. That means that this 30% overhead at 20% spillage when this device shows up in ASIC form, that overhead is going to melt away to a certain extent and that's going to push that curve much further off to the right in this graph.

So that concern about what about the latencies, we consistently have seen that having the memory available is much, much more important than not having the memory at all. So to me, this is a huge win in order to be able to retain or extend performance.

So summarizing, spillage and stranding are really two sides of the same coin. Stranding is the symptom of trying to avoid I/O and spillage is lost performance. Both of these represent lost money, either in time or unused resources. Elastic memory can cure both of these problems.

Just a quick case study on pricing of a cluster that utilizes a pooled device like this. So we took a look at a typical ML pre-processing workload and in this case, it was a four-way data partition that would use a standard four host cluster. And the minimum memory usage for this workload was 512 gigabytes per cluster, but any given-- well, or per host, but any given host could skew up to about 1.5 terabytes of memory usage. And we observed that two of these hosts would-- up to two of these hosts would experience that high level of memory usage. So typically, to defend against this, what you would do is create a cluster with four hosts with 1.5 terabytes each in each of those hosts. And the cost breakdown looks like that cluster costing about $130,000. You could also go the opposite route of instead of adding additional memory by itself and maybe increasing your DIMM capacity size, simply adding additional nodes and going up to 12 node cluster instead. And this actually balloons the cost even further, $192,000. We estimate with a device like the Niagara pooled memory system that the cost of a cluster like this, you could reduce each of these hosts down to 512 gigabytes each and place two gigabytes in that pooled memory system capable of handing out that two gigabytes-- two terabytes of memory to the hosts as they need it, which reduces your overall memory cost by two terabytes worth of memory. And an additional savings because you could utilize DDR4 instead of DDR5 without losing much in terms of performance. And we estimate that the savings could be 30% to 35% or more overall, including power savings because you can power some of these things off at the same time.

So just to finish up here, software is key in getting there, though. You know, from here, it's going to require tight collaboration between software and hardware vendors. There's going to be a need for host monitors, orchestrators, tiering services, shared memory services, all of those things. So we're working towards that integrated software and hardware co-engineered solution with our hardware partners. If you don't see your name up on the slide, please come talk with us. We want to work with you and reach out with us to ride that CXL wave together with us. So that is my talk. I'm happy to answer any questions unless we need to run to the next talk.

I'm not a user of CXL or any memory. I'm just an analyst. But I wonder, I saw the Kubernetes KVM and Docker Dockers, but not the VMware. Does it also work with the VMware? Do you use these e-hosts?

It should. It very much should. It looks like memory. So it should work with VMware, no questions. Probably just that we didn't have enough space to add another logo.

Okay, thank you.