295

YouTube:https://www.youtube.com/watch?v=e9-iK3WtqWU
Text:
Hi, I'm Bill Gervasi. I'm with Wolly, and my buddy San is over in Taiwan. He's probably still asleep right now, so I'll try to do my best covering his technical part of this presentation. Let's go ahead and move on. What I'm going to talk about today is a concept that we call NVMe over CXL. And this is something that, for those of you who have been looking at how storage has evolved, is a way that we can correct some of the limitations of what's called the CMB, or controller memory buffer.

So, I'm going to start with this idea: what millennium are we in? With this picture right here, you see a CPU with memory on a memory channel. You see I/O on an I/O channel. And this is how we built systems, literally, for decades. There was no way to blur these.

And so, this CXL is more than just another I/O bus. I think people really miss the significance of the fact that CXL does something that we have not had. And that is the ability to merge I/O and memory over a common protocol. And the significance of this is something that I'm going to show you, because what this allows doing is the virtualization of memory and I/O in ways that we've not been able to accomplish in the past, due to them being kept separate on physical devices with the CPU having to do all of the transitions in between.

So, this is what we're looking at: NVMe over CXL takes NAND and DRAM, puts it behind a common interface, and makes it appear to the system. This doesn't seem to be too dramatic, but I'm going to show you that it's a lot more significant than you give it credit for initially.

So here's what the picture looks like: You have a common controller. You address it using the CXL.io protocol when you want to do NVMe functions. And NVMe, which is what? Non-volatile memory express. That translates NAND accesses to DRAM accesses. Now, normally this would be across the PCIe bus over to memory attached to the CPU. Or, in today's modern CXL world, it could be transferred over to an external CXL memory module. But what I'm going to show you is that that's highly wasteful. And we're in an environment where we need to start being a lot more conservative about where we spend bits and where we spend power. What we can do now is we can do translation from the NAND over to the DRAM in the standard NVMe 4-kilobyte blocks, but when you want to get the data, you use the CXL.mem protocol over that same physical set of wires to access 64 bytes at a time, only the data you need.

And this turns out to be really significant. So let's review what this means. The way that we can position this is that NVMe is just a cache management protocol, and literally that's all it is today as well. So that's not different.

What we can do, though—and this is an extension that we are now proposing to the NVMe organization—is update the CMB part of the specification to define that the CMB is located in CXL space in the HDM, and locate that HDM on the same module behind the common interface. This has an interesting characteristic now in that, when you're in a data center today and you want to get a storage resource and you want to get a memory resource, those two things can be on different ports and behind different controllers, and you can get into race conditions where you literally just cannot line them up easily. Having them behind a common interface now, you can allocate the storage and the memory at the same time.

Significant. Now, the processor only grabs the flits that it needs, flits meaning the flow control unit of the CXL protocol. That's the 64 byte transfer that fills a cache line in the CPU. No CPU main memory is needed. So now you've eliminated the need for the CPU to access its local DDR, and you've made the processor more efficient as well. 

Now, why would you do this? Well, we've done an analysis of many, many applications, and we come to the same conclusion across a broad spectrum of applications from databases to compression algorithms, and even just disk management. And what we find is that on average, out of every four kilobyte block, only 100 bytes are used. That means there's only an efficiency of 3% of data used versus data accessed. 

So now that we virtualize this transfer mechanism, we can do something really interesting with this, which is now we can also take that DRAM that is HDM over on the SSD controller and create virtual HDM out of that. In other words, we can present that small block of DRAM as a memory resource the size of NAND, and this is all using NVMe protocol as the cache management. Full combination.

So what problems are we solving? First one is that AI is hitting the wall. There's this roofline model that is common in analyzing AI performance that says your AI performance dies when you run out of memory. From that point forward, you're just not going to squeeze out more.And as the AI algorithms get more and more complex, they need more memory. So first thing that this problem solves is just moving the roofline. Get it, put more memory in place, and your AI is going to run faster. 

The second is that bit that I told you about, the efficiency of the fabric. If you can eliminate 97% of the traffic between the SSD and the CPU, You're also eliminating 97% of the traffic that is going to be burdening the CXL fabric. And if you have a lot of SSDs out there, each of those contributing all that traffic, and you can localize that traffic, you can improve the efficiency of the overall system architecture as well. And I have some data to back this up that I'll be showing you. 

Next one is persistent memory. Persistent memory kind of has a bad reputation in the industry, and maybe that's because NVDIMM-Ns were at least three times more costly than equivalent DRAM resources. But now we have a different situation where we can put persistence on this controller, and it will be cheaper than the DRAM because now we can make this thing virtualized. So, it looks like, say, an 8-terabyte DRAM behind a 64-gigabyte window. And with data persistence, you only need to worry about the blocks that are in the DRAM to be saved on power fail. So now we can really start thinking about eliminating checkpointing, which is 7% to 8% of data center power.

But the bottom line here is that our approach is always let the host decide.When data needs to be moved. So this is different than some other approaches in the industry, such as semantic SSD, where the controller is making those choices about caching algorithms. We want that to go up to the host and stay in the host. 

And I'll show you where it sits. Right underneath all of your standard APIs. So, this is the other maxim of this approach: don't change the application software. If you need to change the application software, you're going to fail. There's just too much application software out there that requires that it either talks to an NVMe as a file system access, or maybe some applications have moved to DAX, direct memory access. Many applications are still just using direct memory. And so, HDM maps these resources into the application memory space. Maybe, is that standard for doing backup of persistent memory on power fail and then restoring it to improve the efficiency of your system restarts? And then, now the growing CXL 3.1 algorithms for CXL pooling and sharing to allow all of these resources to be mixed and matched in the larger your data systems. But again, the key is just use the existing APIs and hide the differences in the hardware behind those APIs.

So, this raises an interesting new picture. This is not something you've seen before. These modes were all independent. Like I said before, if you had an NVDIMM-N and you wanted to do the baby backup and restore, you had to do it of the whole module. That's no longer a requirement. Now, you can allocate just one block out of the RAM space as persistent. And then, that's the only region that gets saved and restored. You can create virtual windows into the system so that that little bit of RAM looks like the much bigger NAND. Sometimes, you just want to grab chunks of that memory and use it directly and not even bother having a NAND image of it. That's what I call volatile HDM. That's fine too. You don't want to have stranded resources. And this approach allows you to take whatever memory we put on this NVMe over CXL controller and make it visible to the system as just usable free memory. And then finally, there is the CMB, that controller memory buffer, which maps into the NAND and you now have all of these interfaces supported simultaneously. And of course, the bottom line is this gives you a lower cost per bit because you're using NAND now more efficiently as a resource, which is, as we all know, much cheaper than DRAM.

There's a little bit more detail on that and I won't go into the picture beyond just knowing that these architectures and these orange blocks that I've highlighted here, these are the places where we're just going to put our driver changes to allow all of these applications to work independently.And these we'll be putting up on, in GitHub as part of this overall project.

At Future of Memory and Storage, that thing that used to be Flash Memory Summit, we showed a demo of this technology.So here's just a picture of the demo and some of the stats and I won't go into that, but I do want to at least highlight that the one thing that we focused on for this year's FMS was to demonstrate this virtual HDM mode that I've been talking about.So let's take a look and see how that, played out.

Here, what we see is the top two lines there on the graph are just native SSD. In other words, just your standard off-the-shelf SSD working on an idle machine, and then working on a busy machine. Well, that means that when there's a lot of traffic and everybody is vying for the resources in the fabric, what you see is this massive, this massive drop-off from 53% of availability when you're in that environment. That's because you're moving all these four kilobyte blocks around. If you're only going after the 100 bytes that you really need, you can see that that penalty drops to only 25% once you add the NVMe over SSD capabilities. So, the end result here: the two and a half, 2.8X reduced impact on read performance.

Next test is to compare this on an unmodified Redis in-memory key store value utility. And here, what you see is we make three transitions. So, if you're just running this in a certain amount of DRAM, from eight to 16 to 32 gigabytes, you see you get better performance as you step up. That's what the DRAM attached directly to the CPU. If you go to HDM, in other words, put a CXL memory module in place and run that same application, you can see that you lose some of your performance, and that's due to the additional latency of CXL. However, now add that NAND behind that DRAM window so that you get this virtual NAND footprint for the memory. And now you can see that you're getting significantly higher performance, not just then the CXL memory module, but much higher performance than just DRAM on its own as well. So, you can see that from this graph that you're getting four times the memory capacity, and that increase in memory capacity is giving you a doubling of performance, but the expansion is coming using NAND technology instead of DRAM. So, we can do this with a 90% reduction in cost.

Final test result I wanted to share with you is in-memory compression. And so the P20 through P100, these are various data sets that represent the amount of compression that is possible with those data sets. And you can see that when you start increasing the amount of the SSD or the NAND technology behind the NVMe over CXL controller, you can see that you're hitting great performance numbers. And so again, what you're getting here is a 4X improvement in memory compression, and it's giving you essentially the same performance that you would have gotten out of DRAM alone, but at that much lower cost point. And you can do more compression, because you have more memory to work in. So that's pretty much it.

The summary of this talk, then, is that CXL allows this concept of resource virtualization, which is a lot more significant than I think the industry gives it credit for. Our focus is on the virtualization of memory and storage, where we leverage both the CXL.io and CXL.mem protocols to give you something that's greater than either. So, with this multi-mode interface, we can support many, many types of interfaces: file.io, memory.io, and persistence. And they're all available simultaneously. You don't have to choose at the component level; you choose at the application level. We always let the host direct where the data is supposed to be moved, because we recognize the host has better visibility into what it needs. This virtual HDM then, you've seen the test results, looks pretty promising that it's going to improve the performance of all the memory-hungry applications. And we know there are a lot more coming. The reduced fabric traffic is gonna be a big win for the data centers who are struggling to move resources around. If you have all of your storage on one port and all of your data on another port, that means you're constantly moving data through that fabric. We can simplify it. We can simplify that by putting them in the same ports. And finally, it's all about cost. NAND is a lot cheaper than DRAM. Let's see what we can do to optimize, take advantage of that, and offer a low-cost solution that gives good system performance.

With that, I'd like to thank you for your time in doing this.My contact information is here and so is San's.Go ahead, feel free to grab this information, follow up with questions, and I'll be able to support you from a distance as well.Thanks again for your time.