-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path188
31 lines (17 loc) · 10.4 KB
/
188
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
YouTube:https://www.youtube.com/watch?v=JU3gyDcBa-Q
Text:
Thanks, Frank. So again, my name's Ron Swartzentruber. I'm Director of Engineering at Lightelligence. I'm going to talk about Optical CXL for just disaggregated computer architectures. As Frank mentioned, what we do here at Lightelligence is all things photonic. We do photonic computing, photonic network on chip, and we also have a line of optical CXL products that provide low latency, high bandwidth, long distance connectivity for CXL. We're the first on the market for Gen 5 products, supporting PCIe Gen 5 and CXL 2.0.
So today's presentation is going to talk a little bit about the memory-centric shift in the data center. We're going to look at the large language model growth. We'll discuss the need for Optical CXL. Why do we need this technology? And then we'll show you a case study. At Lightelligence, we needed to prove to ourselves that this was a product that was worthwhile and that had market fit and something that was needed. So we performed a case study with OPT inference. So we'll show you those results during this presentation.
So first off, a little bit about disaggregated and why disaggregated. What's happening, what we're seeing, is that the compute is no longer the dominant resource in the data center. It's memory and access to memory that's become the dominant criteria for how data centers are being designed. And furthermore, these applications are now able to define the machines that they run on, allowing a greater degree of freedom. And so that's what's taking place today in the data center. So these data center architects are changing the way they compose their data centers.
Furthermore, what's compounding things a little bit is that the size of the large language models continue to grow, as you can see this chart on the left. What you have today is these models with 300 billion parameters or so need to be coherently connected to large arrays of computes. These models, as I said, continue to grow, which puts a challenge on the memory bandwidth. The second chart on the right talks a little bit about latency. So what's also needed to process these large language models is a low latency CPU or GPU to memory connectivity. So CXL is actually poised very well to process these large language models, be it training or inference. As you can see, the CXL memory is roughly one NUMA hop away, a couple hundred nanoseconds, much closer than your SSD memory, which is tens of microseconds. So furthermore, what we're finding is that there's a limited amount of space in your rack, of course, in your server, but also when one server connects to a memory within the same rack. And so companies are finding that they need to go to optics if they want to span multiple racks. So that's why optical interconnects are required for these models to-- as these models grow and also as these models connect to more and more GPUs.
What we're finding today in the data center is that the dominant protocol being used is RDMA. But the challenge with RDMA is that there's a latency penalty incurred with this technology. And what's needed is a low latency remote memory interconnect, such as CXL. CXL rides on PCI Express, which is the dominant load store interconnect used today in the data center. And so you can see the comparison here, roughly hundreds of nanoseconds compared to hundreds of microseconds for RDMA.
You've heard a lot about CXL today, so I don't really need to discuss the 200 and some member companies listed here. The important thing that CXL adds, as you know, is the cache coherency and memory protocols onto the PCI interconnect. So CXL gives us that. It also gives us hierarchical switching, CPU to CPU or GPU to GPU connectivity in the later standards. And so it's really a full-featured now interconnect that can be used for memory disaggregation.
OK, so that's CXL, but why do we need optical CXL? Well, quite simply, the signal loss in copper is extremely high, even over short distances. Electrical CXL, or electrical ethernet for that matter, can only scale a few meters, as you can see on the chart on the left. However, with optical CXL, you can span 10 to 100 meters with extremely minimal loss. And you're essentially only limited by the speed of light. Furthermore, the cross-sectional area of the fiber is much is dramatically smaller than that of copper. If you've ever looked inside your server, you'll see these large MCIO cables all over the place. The fiber cables are much smaller and also can extend further reach.
So our product extends the reach of the CXL interconnect, allowing these applications to span across multiple racks. And it extends the reach really across the data center. You're no longer confined to a single rack of equipment. You can span across the data center with a low latency, high bandwidth, cache coherent interconnect.
OK, now let's talk a little bit about our case study. So the case study is using inference. So what this slide shows is the demo that we prepared for Flash Memory Summit, where we won the Best of Show award last August. On the left, you can see your standard super micro server with the AMD Genoa CXL 1.1 CPU. In that server was also an NVIDIA A10 GPU. So the GPU ran the large language model inference. And then on the right, we built a memory expander that contained two Micron memory expanders, as you can see listed there. So this was a fairly simple demo. And interconnecting the server and the memory box were two Photowave cards. These are low profile PCI CEM form factor cards. And they were connected together with two fiber optic cables. So that was the demo setup.
The large language model that we chose was OPT66B, which is basically a model that can be used for news text summarization. We chose this model due to the size. It could fit on a signal. We had 256 gigabyte expanders, but it could fit on a 128 gigabyte expander. And so OPT66B is what we chose.
The results you can see here were collected, running inference for news text summarization on the model. So the GPU was running the model. The model itself was located on the CXL memory expander, separated by 10 meters of fiber optic cable. The important results, what you can see, is that the decode throughput of the-- when running the model in CXL memory was roughly 4.8 tokens per second. Compared that to when the model was loaded on solid state drive, located directly in the server, at 1.9 tokens per second. So as you can see, running the model in the CXL memory was a 2.4x higher throughput than running it on the disk. What you can also see is the decode latency was about a third running on the CXL memory. So we got the results that we were expecting, but actually we were quite surprised that we almost doubled the throughput compared to SSD. That was quite shocking to us, more than doubled it.
This shows you exactly what was happening for the model. So we take a file text input, as you can see here. So here's a large clip of news that comes in the wire. And then the AI model summarizes it into a few sentences. So that's essentially what was going on under the hood for this AI model. It's quite interesting.
Here's some more results. So the graph on the lower left basically shows the time horizon for the time in seconds, as you can see, as the model is run. So when it starts up, you can see that the CXL and disk are roughly the same. What happens, though, is about a third to half the way through, the GPU runs out of its cache memory and has to start fetching from the solid state drive. And that's where the throughput dramatically drops. However, the CXL memory, because it's all located on the memory, there's not buffering required. The decode throughput remains high throughout the inference. Some other things to point out is the GPU utilization was much better with the CXL memory, because there's no longer the latency penalty. Also, the CPU utilization was, in fact, lower. The CPU, in the case of the solid state drive, is continually performing memory management functions. And that's not required here. We were using MemVerge software as well. And that provided an added benefit on this inference.
So a few things to note, summary results. So first off, what we found is that CXL memory offloading is efficient and beneficial. It's something that-- it's a technique that can be employed to essentially lower your amount of system memory or on-server memory. You can offload memory to lower-cost memory, potentially remote, in a remote location. We were able to measure a 2 and 1/2 performance advantage compared to SSD. And interestingly enough, as you saw in the previous chart, the performance was comparable, maybe 75% of what a pure system memory would be. So if you could, in fact, load your large language model in system memory, which generally you're not able to because they're so large, we would see a similar performance to pure system memory. And then we calculated a 1.9x total cost of ownership improvement. And that, for a reason, is we're able to use less expensive GPUs and provide a similar throughput. All right.
So I'd like to introduce some of the PhotoWave products here. So as I mentioned, we have this low-profile PCI card. That's what was used in the demonstration. We also have an OCP 3.0 form factor card. This is your NIC type of form factor. And we also have active optical cables. Those come in a variety of connector types. We're developing QSFP/DD and CDFP at the moment. And we're also taking customer requests for other type options. The other form factor that's being discussed in the PCI SIG optical working group, of course, is OSFP. So some of the features I mentioned before-- CXL 2.0, PCI Gen 5, 16 lanes. We do support sideband signals over optics. So we have a separate fiber for the sideband signals. So on that fiber, we can transfer many signals using our proprietary encoding scheme. So SMBus, PCI reset, presence detect, wake, clock request, whatever is required. Of course, reference clock if you want to run in common clock mode. All of those signals are available to you. The PCI card and the OCP card have a retiming function built in. So that will perform the jitter reduction and signal integrity cleanup. The active optical cables do not. They're more like a linear drive or direct drive. So the latency on the active optical cable is much, much lower, right around 1 nanosecond, of course, plus your time of flight. And if you are interested in hearing more about these products, please contact [email protected].