342


Hello! Good afternoon. I'm Michael Ocampo from Astera Labs, Ecosystem Alliance Manager. And we have...

Hi, I'm Ahmed Medihoub. I'm the product manager for our CXL memory controllers.

All right, thanks for coming. Our topic for today is optimizing AI inferencing with CXL memory.

So, as we were putting together this presentation, I actually put a prompt into Llama3 or Meta Improving GPU Utilization with CXL. And this is the actual first image that popped up, which is pretty interesting because it shows a bunch of cables and servers in a rack, which is pretty interesting because in our booth, Astera Labs B13, we actually are showing CXL as well as PCIe cabling, both active electrical cabling as well as active optical cabling. So I thought it was a very appropriate picture to just include into the intro slides. But the core topic is really what is the value of CXL in the context of AI inferencing? So we'll talk about the memory requirements evolution of AI inferencing servers, and how we are optimizing it. And Ahmed is going to cover CXL benefits for DLRM, which is another popular AI inferencing workload for a lot of big hyperscale companies, particularly for ad revenue business. And then at the end of the talk, we'll have some calls to action.

So I think a lot of people talk about AI training, and I think the focus has really shifted to AI inferencing. And ChatGPT comes to mind, but there are other types of LLM inferencing workloads that are kind of similar to ChatGPT, GPT-like applications, including OPT, open pre-trained model, Llama, Mistral, et cetera. And one of the things that you find as you're looking at this chart on the bottom left, this diagram is actually free. It's from last year's OCP by Dan Rabinowitz. And you can see that AI inferencing is very, very memory capacity bandwidth constrained, as well as very latency sensitive. So actually, it's quite interesting when you look at how ChatGPT and other things like that are being used. Like, for example, my son, he'll say, 'Hey, I want to create a new story.' So he becomes... the main character in the story, and his stuffed animal becomes his friend in the story. So it's generating a lot of information, tons of tokens that have to be remembered and kept into the memory as he goes to sleep. So as you can imagine, it takes a while for an eight-year-old to go to sleep. So the story keeps going on and on and on. So I thought this was kind of an interesting story to tell, because that's... that is what we call KV cache. That requires a ton of memory. And every service has different context window. And context window is limited, because it comes at a cost. Comes at a cost of memory. And some servers may only have a very limited amount of memory. 768 gigabytes is pretty typical. So I think that's one of the main drivers here. For... not just CXL, but memory innovation in general. So another thing to note here on the bottom right is, if you had something like roughly ten novels, that's approximately a million tokens. So that would take roughly one terabyte of memory. So that's definitely exceeding your typical server memory footprint.

So, if you look at the evolution of what we're seeing... And this is... In the past, and then moving forward, NVMe is a very, very popular caching hardware. And a lot of the computing is done by GPUs. So, on the left-hand side, try to make a very simple illustration for how data movement happens, right? In the context of KV cache, as you ask it a question, you know, where does that data go? Is it sitting in memory? Eventually, if it's sitting in the local memory, it has to store it somewhere. So today, that's happening on NVMe. And then you have to feed that information to the GPU to, you know, efficiently process this request of telling a story. And so, on the right, we wanted to see, okay, what if we were to replace that cache with a much faster cache? That's not... That's a cache line, um, IO versus like a 4K a reinwrite operation that you see with NVMe.

So we did that test. We have a GPU server from SuperMicro based on Genoa, or a fourth-gen EPYC processors, and um, what we actually tested is only one socket and two L40S, and then two Leos. But for a baseline, we didn't include the CXL devices. We just wanted to see what would happen if we have NVMe cache and then two NVIDIA L40S GPUs. And what we saw was that there was definitely a slower time to insight due to the NVMe cache, and then high CPU utilization, and therefore very limited concurrent LLM instances per server. So, of course, when we tried to test this with two of our Leo chips, which is our CXL memory controller with four DDR5-5600 DIMMs, we saw 40% faster time to insights, 40% lower CPU utilization per query, and then we can increase the concurrent LLM instances per server.

So here's the raw data. So if you look at the characterization of the workload over time, this is the Y-axis is GPU utilization and the X-axis is time. So if you just look at the orange line, that's basically, you can see, it finishes after 700 seconds. And you can see the GPU utilization goes up. Mind you, there's two GPUs, so it gets up to around 180%. This is because there's two GPUs, so two GPUs would effectively... could be 200% if it was fully utilized. And so that's actually what you see with CXL. It goes up to about 160% and then goes up to 200%. And then it finishes at 400 seconds. So this is based off of FlexGen, which is an LLM engine, and this FlexGen software is actually a research project that is from Carnegie Mellon, Berkeley, Stanford, Yandex, and various other R&D researchers, and it was quite compelling. Their whole goal was to see how can we get more utilization out of these GPUs, and so they created this project, and that's what we used for our proof of concept here. Interesting thing to note is that the model size itself is 122 gigabytes. So what was driving this utilization of the memory is actually the KV cache I was mentioning earlier. So the run parameters you can see is prompt length of 512, which is kind of like your question, GEN_LENGTH of eight, GPU_BATCH_SIZE of 24, and then the NUM_BATCHES of 12, so that's basically asking it to do that job 12 times. And then FlexGen has an algorithm to basically realign it and try to increase the throughput as best it can with the resources given, regardless of if it's NVMe or CXL or local memory. So it aggregates all those resources and was able to get this kind of result.

On the flip side of the coin, of this workload, if you take a look at the CPU utilization, it's quite high when it's without CXL. And again, the reason why is, NVMe—it's not only just slower, but it's also congesting the CPU because there's a lot of delays and waits for the load/store operation. And then with CXL, you can see that the utilization is quite low (25%) compared to 65% at its peak.

So, I started theory crafting based on this information. Let's look at this one bar at a time. So, if you look at the purple bars, that represents the actual memory requirement for the model—what's the OPT-66B model—and then KV cache is basically, you know, both of those combined equations. It's about one terabyte. So, to make the math easier, one instance is approximately using about a terabyte of memory, and then so you're seeing all the different tiers stacked up: green, blue, and gray. The gray actually represents NVMe SSD. So, even though the SSD is quite a large capacity, it's actually not using so much because FlexGen is going to use the fastest tier available. So, that was the local memory plus the CXL, and then a little bit of the—or sorry—in this case, there was no CXL for the first one. This is what was driving utilization at 65%, and so without CXL, you can see the NVMe was used versus in when you have one instance with CXL, then it drops to 65%, and there's no NVMe usage. Everything is basically loaded in memory and processing efficiently. So, if we were to double this up and run two instances based on what you see in the first bar chart, you would have a CPU bottleneck if you try to scale this up right because you have two KV caches that you're trying to fulfill in those instances based on that story that's being told. And if you were to scale this up to four, definitely you couldn't even do that right—without CXL—but with CXL, we theorize that you'll be able to run four instances with some overhead to spare. So, there's some more work to do, but essentially, that's what we observed, and we plan on running these tests in the future on future platforms and with more resources.

So that concludes my segment. I just have a few wrap-ups here, and then I'll pass it to Ahmed. So what we saw from my segment was 40% faster times, the insight with CXL, 40% lower CPU utilization per query, and therefore we can run more LLM instances on one server.

Thank you. Thank you, Michael. So another example of these AI inferencing workloads are DLRMs, or deep learning recommendation models. Typically used in many applications, just like Michael mentioned in the very beginning, Netflix, Google, and others. Now, some of the challenges that these applications have are from, you know, left to right, large or huge data sets and huge matrix tables. Partially utilized cores due to data overhead, and with distributed computing, making um the right data available to the right node is orchestration intensive, um, and some since some of these DLRMs are still run on GPUs in many cases, um, and that could be uh Power intensive. So perhaps the largest area of optimization that we can impact with CXL are the data movements where data is pulled out of um out of the database and prepped in a format that the model can consume. We see that um this large percentage of time spent on data movement and transformation can be uh significantly alleviated simply by adding uh more memory.

So effectively, doubling the memory in a system effectively allows a larger dataset to be cached, which inherently shows a significant improvement in performance.

We ran this benchmark in partnership with AMD on a Supermicro A+ system where we have one AMD 5th generation EPYC processor with 1296 GB DIMMs directly attached to the CPU and 1664 GB DIMMs CXL-attached, connected via four Astera Labs Leo CXL memory controllers for a total system memory capacity of about 2 TB (1 TB directly attached, 1 TB CXL-attached). We see that with about 60% more memory channels added and 133% higher bandwidth, we see a little over 70% increase in performance for these workloads.

And as we showcased AI inference running on CPUs with high memory capacity and bandwidth in the previous slides, we see servers such as this one, the Lenovo SR860 V3, showcased here on the top left and also in our booth, which were traditionally designed for large in-memory database applications, are now being considered for new applications such as inferencing. This particular server features four of the latest Intel Xeon processors with 64 direct-attached DDR5 DIMM slots and 64 CXL-attached DIMM slots via 16 Leo CXL memory controllers, approximately about 4 Leos per socket. That is a total of 128 DIMM slots for the whole system, and when populated with 128 GB DIMMs, we are looking at 16 TB of memory in a single server.

So, as final remarks, I think I would like to invite you to check out our website, all our products that we have in the booth and here, but also get involved in the OCP CMS project to drive innovation. See the CXL technology in action in our booth at B13, provide input on new CXL hardware designs at the CMS table in the OCP Innovation Village, and for any other questions, you can contact me or Michael directly. And Michael, you can close.

Yeah, definitely get involved. As mentioned, this stuff is awesome. It's in our booth, B13. We also have a mock-up of what we think the next generation AI systems will look like. It actually features all of our silicon, everything from our Leo CXL controller, our ARIES 6 retimer, our TORS Ethernet cabling, as well as our new product, our Scorpio PCIe switch. And then over in the InvenTech booth, I think it's actually B8, not B9, this is a CXL box with 96 DDR5-4800 DIMMs. So, thank you for them for sharing that to the world, and thank you for joining us for our talk. And that concludes our presentation. Thank you.

Thanks, everyone.

Thank you for your presentation, and this is very interesting work. I remember that maybe NVIDIA's platform does not have a plan, in the static status, to support the CXL protocol, so you made some CXL memory devices, CXL diamond pass, and maybe the GPU is maybe PCIe in protocol. So my question is that maybe some translation of CXL to PCIe?

Yeah, so I think this is in a comment on the FlexGen data, right? Yeah, that's a good question. Effectively, yes, even though the GPU doesn't support CXL, what you're seeing is that the CPU is able to pass information quicker because of the CXL cache line performance, right, versus 4K random reads or blocks on the NVMe cache. So that's the reason for the performance, right? We're not proposing or simulating some GPU. We use the L40S GPU from NVIDIA, two of them in fact, right, to then feed it as fast as possible.

So, to quickly summarize what Michael said and add a little bit to it, the coherence agent in the CPU is the one responsible, or given access to the GPU, to system memory. So, the GPU is connected via PCIe, LEO is connected via CXL, but the whole memory is made available through the CPU. Does that answer it?

Thank you.