-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path88
60 lines (30 loc) · 12.4 KB
/
88
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Thanks, Frank. Testing 1-2. Okay, thanks, Frank. So interesting, yesterday I was speaking at the Future Technologies Symposium, and it was all about short-reach optical interconnects. And I was one of the few folks actually talking about CXL. Now here today, I'm in an all-CXL forum, and I'm the only guy talking about optics. So this will be an interesting change of pace for the group of you here listening. So I'm going to talk about optical CXL for large-scale memory pooling. And my name is Ron Swartzentruber. I'm Director of Engineering at Lightelligence, responsible for the optical fabric development.
So our talk is going to focus on first the memory-centric shift in the data center, the growth of large language models, the need for optical CXL. So we're showing in this group why we need CXL, but why do we need optical CXL? We actually asked ourselves that question, and so we set out to prove that we in fact needed it. And so I'll show you that, along with the case study, which is the work that we did using OPT inference.
All right, so to kick things off, I think what we're starting to see is that the compute is no longer the dominant resource in the data center. It's memory and the access to memory, which has become the dominant criteria in the data center. And what we're finding, talking to these data center architects, is that the applications now define which machines they run on. That didn't happen five, ten years ago. This is a new phenomenon. So with that new paradigm, you can compose and design completely different data center architectures. And so that is what we call disaggregation.
So the trend in large language models, they're not getting any smaller, right? And as that trend continues, the need for disaggregated memory architectures will grow. We need to be able to process these models at a high rate, but these models are large. They can't all reside in our server memory anymore. Furthermore, it's not just the memory and the memory bandwidth that is the issue. It's the latency. So what you can see is CXL memory is nicely placed at one NUMA hop away from system memory. As Microchip mentioned earlier today, around 170 nanoseconds. And so compare that to your network attached memory and your SSD at several microseconds away. So these large language models, they need low latency communications. And so to combine them, what's needed is optics. Because as you move your memory further and further away, you'll need optics to extend that reach.
So in the data center today, RDMA is the dominant interconnect protocol used for remote memory applications. The problem with that is the latency that's incurred. There's FEC, there's going through the NIC. Versus CXL, as we've talked about, is low latency. So what's needed to have CXL match RDMA's capabilities is an optical interconnect residing on CXL.
I don't think I need to talk about this slide. The wide adoption of CXL is pretty obvious in the industry. There's 250 member companies. And the real advantage, as we all know, is that CXL adds memory and cache coherency protocols to the ubiquitous PCI fabric.
All right, so why do we need optical CXL? So what this slide shows is that the signal loss over copper is extremely high. What you'll see from TE and Molex and others is that these Gen 5 copper cables are quite bulky. But they're only traveling a few meters. I think I've seen maybe a max of five meters with a couple of retimers on each end. Versus you look at the loss through an optical cable, it's extremely low. So 100 meters is no problem. That's typically—practically 10 meters would be sufficient. Or 30 meters, basically 30 meters, would allow you to travel across the data center. So that's the loss. And then obviously, the cable cross-sectional area is much smaller with fiber optics. Your bend radius is no longer such an issue. And you can certainly improve the amount of traffic that you can send per square centimeter.
So simply put, optical CXL is needed to break through the rack. Copper is limited to single servers, single racks. If you want to go multiple racks or you want to go across the data center, you need optical CXL. And optical CXL will give you that low latency, high bandwidth data center reach.
OK, so let's get into proving why did we need optical CXL in the first place. So we did a case study using large language model inference. By the way, we demonstrated this case study at Flash Memory Summit. We won the best of show award for our efforts. Now let me describe what it is. So on the left-hand side, you have your AMD Genoa server. So it's CXL 1.1 compliant. In that server is an NVIDIA A10 GPU that's processing your large language model, along with a photo wave card, which does PCI Gen 5 x16 over optics. It's connected to another photo wave card acting as the endpoint device, which is connected to an FPGA and two CXL memory expanders. It's a very simple memory expansion box. The reason we built it ourselves is because they didn't exist at the time. Now I've seen a few by SK Hynix and MSI here at the show. But at the time, we didn't have those. So that's the demo setup.
So then what we did-- actually, one more thing I'd like to comment on. What we set out to prove was that if we were running our large language model on the CXL memory drive, comparing that to the solid-state drive located on the same server, would the application run faster or slower? So that's the trade-off. Put the large language model 30 meters away across the fiber on the memory expansion box versus put it inside the server, which one's going to win? So about the large language model, we chose OPT-66B because essentially it fit within one Samsung 128-gigabyte memory expander. It's as you can see, 122 gigabytes. And the workload that we gave it was news text's summarization. And I'll describe that a little bit.
So the results. So what you can see on this chart, first of all, with the NVMe disk, roughly 1.9 tokens per second and decode latency of 338 seconds. Now compare that to-- so that's with the disk. Compare that to the CXL memory, which is just a NUMA hop away plus time of flight, about 2 and 1/2 times better, 4.8 tokens per second. And the latency, extremely lower, so 138 seconds to run your workload. Now comparing CXL memory to system memory, obviously if you put the model in system memory, that's going to be your best performance. But CXL memory was only about 70% of that. So it didn't do too bad compared to system memory. Now one of the reasons we're partnering with MemVerge is this next data, which they did an excellent job using their memory manager, where we put 60% of the model on the CXL drive and 40% in system memory. So MemVerge has that tool. It's completely behind the curtains to us. But as you can see, their results were just about the same as system memory. Now what happens there is they put the hot pages in system memory and the cold pages in CXL memory. But as you can see, there's very little degraded performance.
So here is the workload. So as I said, it's news text summarization. You get a couple paragraphs of news text, and you summarize it into a few sentences. So imagine you're 6 PM news anchorman trying to summarize a bunch of news. This is what the AI model does. It's pretty cool.
Some additional stats. What you can see in the big circle here is the improvement in decode throughput. So roughly 2 and 1/2 times better is the CXL memory in the blue versus the NVMe disk in orange. And some of the other factors is the GPU utilization is much higher with the CXL, which is good. We want our GPU to be fully utilized. And the reason why it's not so high in the NVMe is it's just waiting for that data to get back. Our CPU utilization is lower with CXL memory, which is good, because now I have more of my server to go use for other jobs. And of course, the CXL memory is fully utilized. What you can see on the chart-- chart's interesting, right? The first ramp is the fact that part of the model is cached in GPU memory. So there's a small amount of GPU memory that's used here. And that first ramp where the NVMe and the CXL memory are basically ramping up together shows that the model is being cached. However, the problem with SSD is as it then needs to go to the NVMe disk, the performance drops dramatically. And so that's why you see the drop in performance when you're going to disk versus the CXL memory pretty much stays flat. And this is sort of a time graph as you're processing the model. That's what the performance looks like.
All right, so in summary, CXL memory is beneficial for offloading—beneficial—sorry, efficient and beneficial for large language model offloading. There's a similar performance, as I mentioned, compared to system memory, which is great. 2.4x advantage over disk and an improved TCO. We calculated just about a 2x improvement in TCO.
So here are the products that we've launched. So first is a low-profile PCI card. It's a Gen 5 x16. And what you can see on that card is there's two optical modules that we've designed. Each of those modules are x8. So you can have a 1 x16. You can have a 2 x8, 4 x4 using bifurcation. We also came out with an OCP 3.0 card, so standard NIC form factor that customers have requested and also an active optical cable. And we've come out with CDFP, QSFPDD, and we're also entertaining other customer requests. Now the difference between those products is that the card contains a retimer. In this case, we're using the Montage retimer. So then that does add a little bit of latency. So even with the low latency CXL mode, roughly 20 nanoseconds of latency through the card versus the active optical cable, which is purely linear. So you can think of it analogous to your LPO Ethernet transceivers. This is a linear PCI transceiver. And so that latency is dramatically lower, right around a nanosecond. And what we found is that the reason why we wanted to do the clock retiming on the cards is that we never know what kind of motherboard we're going to get, whether or not linear drive would be acceptable. But for customer applications that we know, we've done the signal integrity analysis already, the active optical cable is great because it's much lower latency, lower cost.
And that's it. So here's information if you want to learn more. Our booth has been torn down, but you can come talk to us after if you have any questions. But I think we may have time for some questions here.
Hi, so how far in your example are the two ends of the link?
Yeah, so they can be up to 100 meters. Practically speaking with Gen 5, PCI-SIG recommends a 64 nanosecond maximum, which is right around 12 meters. But they don't preclude you from going longer. So typically our customers are asking for 5 meters, 10 meters. I think one customer did ask for 30 meters.
So what's limiting you physically from going further?
Nothing really. I mean, it's multimode fiber, OM3. It's just 100 meters is all we've tested.
All right, thank you.
Any other questions? Yeah.
You touched on some results related to that LLM model and speed up with respect to division between CXL and socket memory. Are there any more results like that, particularly in context of databases?
Yeah, so that was the only workload we ran. But we are actually doing more work on databases. So maybe I can talk to you afterwards and maybe share some of those results with the NDA.
Yeah, question about your business. So there has been a big wave, like big data and CXL and maybe disaggregation. Do you think CXL is kind of a big impact on your business kind of acceleration?
Yeah, I think CXL is going to really place a dominant role there because it adds the memory and cache coherency functions. It's really the-- I think it's the only protocol that has the ability to do memory processing in the data center. And in fact, I asked the question yesterday in the panel, what did the panel think was going to be the dominant optical memory interconnect in the data center? And Andy Bechtolsheim said, yeah, I think maybe around Gen 7 over optics. So I think we're right on track there.
I have a question. In your demo, you show a point-to-point. And you mentioned in the next generation, you might do some kinds of switching. Can you elaborate a little bit further? Are you doing level 1, level 2, those type of things?
Yeah, exactly. We do have plans to do memory pooling. We didn't have the CXL switch at this time, so we just did memory expansion. But that's absolutely in our plans to acquire some CXL switches and memory appliances so we can further showcase what we have here.
Thank you.
OK, Ron, thanks.