188

YouTube:https://www.youtube.com/watch?v=JU3gyDcBa-Q
Text:
Thanks, Frank. So again, my name's Ron Swartzentruber. I'm Director of Engineering at Lightelligence. I'm going to talk about Optical CXL for just disaggregated  computer architectures. As Frank mentioned, what we do here at Lightelligence  is all things photonic. We do photonic computing, photonic network on chip,  and we also have a line of optical CXL products  that provide low latency, high bandwidth,  long distance connectivity for CXL. We're the first on the market for Gen 5 products,  supporting PCIe Gen 5 and CXL 2.0.

So today's presentation is going to talk a little bit  about the memory-centric shift in the data center. We're going to look at the large language model growth. We'll discuss the need for Optical CXL. Why do we need this technology? And then we'll show you a case study. At Lightelligence, we needed to prove to ourselves  that this was a product that was worthwhile  and that had market fit and something that was needed. So we performed a case study with OPT inference. So we'll show you those results during this presentation.

So first off, a little bit about disaggregated  and why disaggregated. What's happening, what we're seeing,  is that the compute is no longer the dominant resource  in the data center. It's memory and access to memory that's  become the dominant criteria for how  data centers are being designed. And furthermore, these applications  are now able to define the machines that they run on,  allowing a greater degree of freedom. And so that's what's taking place today in the data center. So these data center architects are  changing the way they compose their data centers.

Furthermore, what's compounding things a little bit  is that the size of the large language models  continue to grow, as you can see this chart on the left. What you have today is these models  with 300 billion parameters or so  need to be coherently connected to large arrays of computes. These models, as I said, continue  to grow, which puts a challenge on the memory bandwidth. The second chart on the right talks  a little bit about latency. So what's also needed to process these large language models  is a low latency CPU or GPU to memory connectivity. So CXL is actually poised very well  to process these large language models,  be it training or inference. As you can see, the CXL memory is roughly one NUMA hop away,  a couple hundred nanoseconds, much closer  than your SSD memory, which is tens of microseconds. So furthermore, what we're finding  is that there's a limited amount of space in your rack,  of course, in your server, but also  when one server connects to a memory within the same rack. And so companies are finding that they  need to go to optics if they want to span multiple racks. So that's why optical interconnects are required  for these models to--  as these models grow and also as these models connect  to more and more GPUs.

What we're finding today in the data center  is that the dominant protocol being used is RDMA. But the challenge with RDMA is that there's  a latency penalty incurred with this technology. And what's needed is a low latency remote memory  interconnect, such as CXL. CXL rides on PCI Express, which is the dominant load store  interconnect used today in the data center. And so you can see the comparison here,  roughly hundreds of nanoseconds compared  to hundreds of microseconds for RDMA.

You've heard a lot about CXL today,  so I don't really need to discuss the 200 and some member  companies listed here. The important thing that CXL adds, as you know,  is the cache coherency and memory protocols  onto the PCI interconnect. So CXL gives us that. It also gives us hierarchical switching, CPU to CPU or GPU  to GPU connectivity in the later standards. And so it's really a full-featured now  interconnect that can be used for memory disaggregation.

OK, so that's CXL, but why do we need optical CXL? Well, quite simply, the signal loss in copper  is extremely high, even over short distances. Electrical CXL, or electrical ethernet for that matter,  can only scale a few meters, as you can see on the chart  on the left. However, with optical CXL, you can span 10 to 100 meters  with extremely minimal loss. And you're essentially only limited by the speed of light. Furthermore, the cross-sectional area of the fiber  is much is dramatically smaller than that of copper. If you've ever looked inside your server,  you'll see these large MCIO cables all over the place. The fiber cables are much smaller  and also can extend further reach.

So our product extends the reach of the CXL interconnect,  allowing these applications to span across multiple racks. And it extends the reach really across the data center. You're no longer confined to a single rack of equipment. You can span across the data center  with a low latency, high bandwidth,  cache coherent interconnect.

OK, now let's talk a little bit about our case study. So the case study is using inference. So what this slide shows is the demo  that we prepared for Flash Memory Summit, where we won  the Best of Show award last August. On the left, you can see your standard super micro server  with the AMD Genoa CXL 1.1 CPU. In that server was also an NVIDIA A10 GPU. So the GPU ran the large language model inference. And then on the right, we built a memory expander  that contained two Micron memory expanders,  as you can see listed there. So this was a fairly simple demo. And interconnecting the server and the memory box  were two Photowave cards. These are low profile PCI CEM form factor cards. And they were connected together with two fiber optic cables. So that was the demo setup.

The large language model that we chose  was OPT66B, which is basically a model that can be  used for news text summarization. We chose this model due to the size. It could fit on a signal. We had 256 gigabyte expanders, but it could  fit on a 128 gigabyte expander. And so OPT66B is what we chose.

The results you can see here were collected,  running inference for news text summarization on the model. So the GPU was running the model. The model itself was located on the CXL memory expander,  separated by 10 meters of fiber optic cable. The important results, what you can see,  is that the decode throughput of the--  when running the model in CXL memory  was roughly 4.8 tokens per second. Compared that to when the model was loaded on solid state  drive, located directly in the server,  at 1.9 tokens per second. So as you can see, running the model in the CXL memory  was a 2.4x higher throughput than running it on the disk. What you can also see is the decode latency  was about a third running on the CXL memory. So we got the results that we were expecting,  but actually we were quite surprised  that we almost doubled the throughput compared to SSD. That was quite shocking to us, more than doubled it.

 This shows you exactly what was happening for the model. So we take a file text input, as you can see here. So here's a large clip of news that comes in the wire. And then the AI model summarizes it into a few sentences. So that's essentially what was going on under the hood  for this AI model. It's quite interesting.

Here's some more results. So the graph on the lower left basically  shows the time horizon for the time in seconds,  as you can see, as the model is run. So when it starts up, you can see  that the CXL and disk are roughly the same. What happens, though, is about a third to half the way through,  the GPU runs out of its cache memory  and has to start fetching from the solid state drive. And that's where the throughput dramatically drops. However, the CXL memory, because it's  all located on the memory, there's not buffering required. The decode throughput remains high throughout the inference. Some other things to point out is  the GPU utilization was much better with the CXL memory,  because there's no longer the latency penalty. Also, the CPU utilization was, in fact, lower. The CPU, in the case of the solid state drive,  is continually performing memory management functions. And that's not required here. We were using MemVerge software as well. And that provided an added benefit on this inference.

So a few things to note, summary results. So first off, what we found is that CXL memory offloading  is efficient and beneficial. It's something that-- it's a technique that  can be employed to essentially lower your amount of system  memory or on-server memory. You can offload memory to lower-cost memory,  potentially remote, in a remote location. We were able to measure a 2 and 1/2 performance  advantage compared to SSD. And interestingly enough, as you saw in the previous chart,  the performance was comparable, maybe 75%  of what a pure system memory would be. So if you could, in fact, load your large language  model in system memory, which generally you're not able to  because they're so large, we would  see a similar performance to pure system memory. And then we calculated a 1.9x total cost  of ownership improvement. And that, for a reason, is we're able to use less expensive GPUs  and provide a similar throughput. All right.

So I'd like to introduce some of the PhotoWave products here. So as I mentioned, we have this low-profile PCI card. That's what was used in the demonstration. We also have an OCP 3.0 form factor card. This is your NIC type of form factor. And we also have active optical cables. Those come in a variety of connector types. We're developing QSFP/DD and CDFP at the moment. And we're also taking customer requests  for other type options. The other form factor that's being  discussed in the PCI SIG optical working group, of course,  is OSFP. So some of the features I mentioned before--  CXL 2.0, PCI Gen 5, 16 lanes. We do support sideband signals over optics. So we have a separate fiber for the sideband signals. So on that fiber, we can transfer many signals  using our proprietary encoding scheme. So SMBus, PCI reset, presence detect, wake,  clock request, whatever is required. Of course, reference clock if you  want to run in common clock mode. All of those signals are available to you. The PCI card and the OCP card have a retiming function  built in. So that will perform the jitter reduction and signal integrity  cleanup. The active optical cables do not. They're more like a linear drive or direct drive. So the latency on the active optical cable  is much, much lower, right around 1 nanosecond, of course,  plus your time of flight. And if you are interested in hearing more  about these products, please contact info@lightelligence.ai.