185

YouTube:https://www.youtube.com/watch?v=3jfIjsRVWas
Text:
 I'm Michael Abraham, as Frank mentioned,  and I've been on CXL for the last several years. I've enjoyed watching it grow from a spec  to actually working on getting  some products out in the market and getting them shipping. We're going to spend a little bit of time talking about what  Micron is doing today on the memory module front,  and we'll have a little bit of time for some questions at the end.

Today's focus is really to take a look at what we see,  part of how we see the datacenter challenges that CXL is addressing. We're focused on this right now as a memory module perspective. We recognize given Arvind's presentation that there  are multifaceted approach to how we're looking at this in industry. While true, we want to make sure we get stuff moving pretty quickly into  industry and be able to really prove the value proposition of CXL,  and be able to see how it transitions and disrupts datacenters going forward. We'll spend a little bit of time talking about memory storage hierarchy,  about the memory module that Micron's been working on for the last little bit. Then we'll look at two use cases that we've got some measurements on. One of them is database-related,  TPC-H, and then the other one is around AI inferencing,  and then talk about how to get involved.

When we look at where we're headed with datacenter,  AI is definitely something that is on everybody's mind,  and also a lot of the databases that we see out there. There are several options for when we run out of memory. We run out of memory,  essentially we need higher capacity for in-memory databases,  some of the software as a service and AI training inference. We are seeing quite a wide variety of ways to get that memory  into the system, into the datacenters. That result though is fundamentally we do see as higher capacity demand. We see a lot of times where we just can't get enough,  and so oftentimes we'll add two CPUs to a socket or to  a server and to be able to make that happen,  or we scale up and add in multiple servers  as well to be able to meet the memory demand. The question is, is there a way that we can essentially provide  that memory capacity and especially as core counts  grow up to be able to have more memory per core? Bandwidth is the other aspect of what we're seeing here. We are seeing a lot of real-time data analytics that take place,  that do require quite a bit more,  especially as we have more processor cores. We are also seeing that essentially we need to be able  to have more bandwidth per core. The question is, how do you get that? We're scaling, of course,  DRAM speeds as fast as we can,  but there is a possibility of also using  CXL to help provide that additional bandwidth there. Then finally, lower datacenter, TCO. If we look at where things are headed in the datacenter,  we see that oftentimes it's handled in a binary form. For example, we might add a lot more DRAM in,  but we typically optimize that for every single socket. In the case of wanting to avoid stranded memory or having  too much in some applications memory per server,  there is some benefit as well for CXL to be able to  provide a unique amount per server of  additional memory that can help address  the minute need in some cases,  surgically almost, of  extra capacity or bandwidth based  on what the workloads are requiring.

If we were to compare and look at how  the storage hierarchy works between memory and storage,  the very top we have our absolutely fastest memory,  the lowest latency. HBM today is one that's especially getting  prevalent in the AI space and getting used there,  provides an incredible amount of bandwidth based  on the location of where it is  and right next to compute engines. DDR memory is the workhorse,  I'd say overall of the industry. In some cases, we're basically seeing  that particular case where there's additional demand,  even with the HBM and DDR  for higher bandwidth and higher capacity. By the time though we get to storage,  we've got a little bit of a problem. As we move down this hierarchy,  we essentially are increasing latency. We do have extra capacity,  but it takes extra amount of time to get that data up  next to the CPU in the memory space.

That's one of the benefits essentially that  CXL is providing is it helps bridge that gap. It provides the additional memory capacity. It's basically byte addressable,  and then it also provides  additional bandwidth expansion as well. One of the benefits is that essentially you have to go to  storage less in order to feed the data that you need to your CPU,  and that overall as you have more CPU accessible memory,  it actually can speed up workloads,  including the applications that we run on top.

For Micron, we've introduced this last July,  August timeframe, the CZ120. It is our first foray into the CXL market. We've done this with two capacities of products. We have one that's 128 gigabytes of additional capacity per module,  and the other one is 256 gigabytes of capacity per module. When you put eight of those into a server,  we have the ability with the 256 to support  essentially an additional two terabytes of incremental memory capacity. Then speaking of the TCO workload aspect of it,  one of the additional benefits is you put in as much as you  need up to this maximum essentially. This is potentially per socket. If you have a multi-socket server,  you could potentially even add more memory per server.All right. From the bandwidth side,  we are seeing as well measured with  MLC tool that Intel provides on a 12-channel server  with 4,800 megatransfer per second RDMs or DRAM modules. We're essentially seeing a 34 percent increased server memory bandwidth. This particular product uses PCIe Gen 5,  which is currently the latest and greatest to be able to  provide that extra bandwidth and so it's running. The 36 gigabyte per second is pretty close to  the theoretical bandwidth of what we can effectively achieve for a 70/30 workload,  70 percent read, 30 percent write. Then in terms of form factors,  how do we get this into servers and how do we make use of it? There are a couple of ways to do that. We've seen some add-in card deployments,  but in particular for the ability to deploy  easily and for the ability to do hot plug operations,  we're using the E3S 2T form factor defined by SNIA. Unlike an NVMe SSD,  which typically uses a by four PCIe width of by four lanes,  we're using an eight-lane interface that  doubles the bandwidth on top of that as well,  which is well-suited for  the almost 10x performance that we see in  bandwidth for a memory module as opposed to an SSD.

How does this actually impact real-world performance? If we use TPC-H,  which is basically looking at Microsoft SQL performance,  we are seeing that as we increase the number of  streams of operation between a DRAM only solution,  that's the servers based on 12 RDMs at 64 gigabytes. We're showing a 23 percent improvement just for a single stream. Then as we add more streams to that from 8 and 20,  we're seeing a significantly higher multiplier that goes along with that. Basically, with eight streams,  we almost double the amount of performance from  the query counts that we're able to achieve in that same amount of time. For 20 streams, we're over 200 percent better, essentially. If you look at it a slightly different way,  we're all about memory at Micron. We've got a look at both the 64 gigabyte and  the brand new 96 gigabyte RDIMMs that we announced just this last year. If we look at the DRAM only performance here with the 64 gigabyte,  we do see, of course,  we're recommending a 96 gigabyte RDIMM for greater performance. Part of this is when we look at this,  we're just able to put more data into the memory from a DRAM only standpoint,  which then reduces need to go to disk and  overall shows an application performance increase. But even with that,  when we add in the CXL with both the 64 gigabyte and with a 96 gigabyte RDIMM,  we see continued improvement in the overall operation,  the number of queries essentially handled in the same amount of time. You see that this scale is actually fairly nicely,  somewhere in that 16 to 20 stream,  we're at least a 60 percent improvement over with a smaller RDIMM size even,  even a little better with a bigger one. Then we seem to maintain that fairly well through 24 streams. When we get to beyond 24,  we show the really good performance against the 96 gigabyte RDIMMs.

Looking at the AI side of the house,  there's a couple of things to take away from this one. We have two ways to look at it. If we look at max bandwidth,  this is oftentimes the AI is very much a bandwidth limited thing. If you have a DRAM only solution,  typically you're filling it with a bunch of RDIMMs. Then we have the ability to add CXL directly into that processor as well. What we're seeing as we look at the read percentages,  anywhere from 0-100 percent,  that we have a little bit of a beautiful sweet spot here in  that 50-80 percent read percentage range, which is great. In some cases, we're showing almost a 50 percent improvement,  30-40 percent improvement in this window,  from just a raw bandwidth multiplier standpoint. Then what does this actually do in terms of translating this  into a final workload well with a large language model using LLAMA? Runtime, essentially, we are seeing that we are able to reduce  the runtime of a query. In this particular case,  it translates into an additional increase  in the number of tokens per second that can be processed. In this particular example,  where we're looking at DRAM only,  and then we're looking at DRAM plus CXL memory modules,  we're seeing over about 11-13 percent increase  at the application level after it's all done. This is just based on a very overall small language model  compared to what could be possible. We've got a lot of work that's going on internally,  and then also with many of our partners  at improving the overall ecosystem around CXL. We do have additional workloads that we're evaluating in-house  and working through those details.

We'll have more to announce in the future going forward. But to get involved,  we would love for you to have an opportunity to take a look at  the Micron CZ120 modules to be able to use them in-house,  to put them on your workloads,  and to get everything ready for deployment in the future. We have at micron.com/cxl,  an opportunity to be able to look at our memory modules. You have the ability to get the documentation around them,  including data sheets, the thermal models,  electrical as well, look at signal integrity, other issues,  and we would be happy to work with you on that. We encourage you to go to micron.com/cxl,  look at our technology enablement program, TEP,  and have an opportunity to collaborate on what you're working on.