191

YouTube:https://www.youtube.com/watch?v=UP1a-OUuLSE
Text:
I'm here to talk about memory fabric. At GigaIO, we're doing memory fabrics today with PCIe,  and we look forward to the evolution of that into CXL. So we do composable systems, both hardware and software. So let's get into it.

So at SC23, just to recap a couple of slides  that we talked about at SC23, where  we kind of fall in the different components  that you might find in, say, a server or in a cluster. There's storage, network, NIC cards, accelerators, CPUs,  and memory. You can see those blocks there. Legacy networks, like InfiniBand or Ethernet, no problem. Been doing storage for a while and more. With GigaIO, what we call our fabric is--  we call it FabreX. And today, we do up to all these components, including memory. And it will get even faster or with less latency with memory  as new components and devices and equipment  come on the market.

 Another slide from SC23, so just the memory tier. I'm sure you've all seen this. The hard drives, down at the bottom of the tier,  you see hard drives and storage SSDs,  very kind of slow, high latency. At the top, you have cache and main memory  right on the CPU or the motherboard. And then in the middle, there's disaggregated memory  that Fabrics does today. And then a CXL memory, both external to the server  and inside the server. Those are new classes of high latency memory. They're not quite as fast as DDR. But we'll open up new opportunities or use cases  for memory. So let's see. There's a lot of discussion about using less memory. With memory pooling and starving memory  on servers or other devices. But as we know, we might make that more efficient. We hope to with disaggregated memory. But history kind of tells us that, as you  can see with AI large language models,  they're just consuming a lot more memory. So that memory curve will continue  to rise as far as purchases. So there's a lot of work to be done here,  especially as CXL memory comes out. And really to be able to make production ready solutions. And that's something that we struggle with every day  to make it more robust and reliable when  you're talking about a Fabric. So a lot of experimentation and research has started. Even today, we've got our hands on large numbers  of accelerators so that we could, in our labs,  test larger Fabrics and see how they perform. So let me get into that a little bit.

 Oh, just backing up to AI. So of course, it's driving enormous data for analysis. We've got lots of challenges for compute and storage clusters. What people need today is kind of a flexible, easy upgrade  architectures. And then having low utilization of very expensive accelerators  and memory pools is just not good, right? So because they consume a lot of power, they're very expensive. You want to make sure that they're being utilized. So having a composable or disaggregated Fabric  helps with that.

So let me back up a little bit and tell you  what composable means. I'm sure a lot of you know that. But just as an intro, it's all about whether the devices  are inside or outside the server. So we call it converge when all the components  are inside the server. On the left there, you can see the server. It's got CPU, memory, NVMe storage, accelerators,  and a GPU server, typically. And then as you disaggregate or you take them out,  that's called composition. So you're composing these components in different boxes  over a Fabric. So on the server, the memory will still be there. The DRAM memory might have 24 sticks of DRAM. JBOF is just a bunch of flash. That's NVMe storage. Those have been out for many years. That's kind of a known commodity. Again, NVMe or storage can deal with low latency--  I mean, a high latency. And then JBOG, this is a product that we  offer today, which is a pooling appliance. So it has all the GPUs in there. So the accelerators, think of them  as an NVIDIA H100 or a AMD MI210. These are PCI cards. We could fit eight of them in that box. What's nice about having that box as a separate box,  as opposed to crammed into GPU server--  and I used to work on GPU servers--  is just being able to access those components easily. There's trays to pull out the accelerators. And then JBOMs, that's just a bunch of memory. So that's, I say, coming soon. But it will take some time. Could be this year, next year, before we really see those--  and other suppliers may have been talking about this earlier. It'll take some time for those to emerge on the market,  to be able to see what CXL can do, what CXL memory will look  like, and have it work reliably over the fabric. So there's some work to do as far as BIOS on the servers,  Linux OS, things like that. If you remember back to NVMe, it took some time for that  to mature. And the same thing will happen with CXL memory.

 So here are the building blocks. These are the different components  that we offer the products. At the very top, you have the fabric switch. This is one piece of GigaIO gear that's very important. This is a PCIe switch. It also handles the fabric management. So it connects to all these different components. Just a bunch of flash. We call it storage pooling appliance. So just a bunch of GPU. We call it accelerator pooling appliance. And then we will, at some point, have a memory pooling  appliance in the future. I didn't show the fabric card, but cable's there. The fabric card looks like a typical PCIe card. That would get slotted into the server  to provide the PCIe cabling up to the switch  and then distributed to these different pooling appliances. And then below that, you see optimized servers. So you can use off-the-shelf servers  with a limited number of GPUs you might be able to compose. We've worked with two partners, Dell and Supermicro,  and others, to come up with engineered solutions. So we work with them to optimize the BIOS so that it can handle  large amounts of accelerators. As an example, with a NVIDIA, we could do 24 V100s and H100s  as well. And with AMD, a close partner of ours,  we're getting up to 32 of their MI210 GPUs.

And this is what-- so let's build a rack. So what we call this stack, which is a half rack,  24 rack units high, so about half the size of a 48U,  exactly half the size of a 48U rack, we can fit in 32 GPUs. So you can call it appliance. So a single server, very down at the bottom of the stack,  you'll see a single server. That's there shown, Supermicro. It could be a Dell. And then there's fabric switches. Depends on the topology. We have different topology configurations. For this one, we kind of engineered the solution  so it has three fabric switches and then  four of those accelerator pooling appliances. So that single service, when you look in the lspci  or in PyTorch, you just see a bunch of GPUs. Again, I used to work on GPU servers. And typically, in a 4U, you can get 8 or 10 GPUs, no problem. But to connect those GPU servers,  you have to use InfiniBand or Ethernet. And a lot of times, it will add a protocol translation tax  because you're adding a lot of overhead. Because this is native to using PCIe, it's much faster. And when CXL comes out, it'll be even faster. So these accelerators will have lower latency. So we call this the easy button for AI workloads,  meaning that one server, just put your software on there. All the GPUs are showing up as if they  were in a single server. So this is really, to me, a killer app.

So what is a memory fabric, a GigaIO memory fabric? So really, when you think about fabrics, it's any to any. So all these components can talk to each other,  as opposed to a network where you might have some limitations  on who's talking to who. But a fabric or a cluster is everybody talking to everybody. So on the left, I have upstream host servers. They're showing Dell and Supermicro. They will have memory on them. Could be 1.5 terabytes, those DRAM,  or 3 terabytes on the other one. And then it goes up to our switch with cabling,  down to the storage pooling appliance. And then those four GPU are accelerated  or pooling appliances, getting up to 32 GPUs. Those GPUs, like H100, will have 80 gigabytes of memory. So if you add that up, it's about 2.6 terabytes. So you have terabytes of memory on both sides here. And again, as we add the memory pooling appliance from CXL,  depending on the quantities you could fit in a box--  you're talking about terabytes more of memory. What's nice is that memory can be accessed  by devices in the fabric.

So let's talk about software. So Linux software libraries that we're using today,  you can see listed in the middle--  NVMeOF, MPI, Message Passing Interface, IP Networker,  TCP/IP, and then a very important one for AI workloads  is GPU Direct RDMA. So these GPUs are all talking to each other  and sharing memory over the fabric's memory fabric. And these are Linux software libraries  available today that we use. Eddie, the next slide.
 
Eddie, there's a question in your pointer,  so I'll let you know there's a question in your pointer. Is the FabreX switch built over an Ethernet switch?

It's over a PCIe switch. So if you were to look in these boxes,  the servers sometimes will have PCIe switches  that are fabric switch at the top of the rack,  or multiple of those boxes would have PCIe switches. These PCIe switch chips typically  come from Microchip and Broadcom. Same thing with the pooling appliances. They almost always have a PCIe switch chip on that.

So let's talk a little bit about how  CXL will apply to the fabric's memory fabric. So what's really important is those PCIe switch chips  next year or at the end of this year  start to come out with CXL 3.0, which  allows for fabrics and devices talking to each other. That will move from PCIe to CXL. We start to enable those lower latency memory and devices  talking to each other. So this will benefit the fabric. The servers from AMD or from Intel--  AMD Genoa today and Sapphire Rapids from Intel today  can do CXL 1.1. But next year, we hope to see 3.x on the newer servers, which  will, again, allow the CXL fabric to start to emerge. And what's important here, what's  shown in the middle is the cache coherency. So the data in the memory, the hard part  is getting it so it's all synced up. So you need a cache coherency protocol,  which CXL allows you to do so that those bits look  the same whenever they're at any of these devices. We'll talk a little bit more about that.

 So the need for coherent sharing. Path forward with CXL. CXL.mem, CXL.io, CXL.cache that you heard about,  the coherency protocol is key. And that's what I was just mentioning. So any device can read on the fabric  and then sees the latest value right. So somebody writes it, and then anybody gets updated. This is all done today with PCIe. CXL is built over PCIe. And since that's the backbone of our server and all  components, as I mentioned, as the CPUs in the PCI switch  vendors start to add CXL, this allows  us to have that lower latency memory  and avoid the Ethernet and InfiniBand protocol  translation tax.

 So there's some challenges to overcome before we get there. But we can get there from here. So today, the max number of devices on a PCIe fabric  is 256 to one host. And that might translate, depending on how many bus IDs  that you consume, that might translate to 80 nodes or x  number of devices. Again, we're able to do 32 GPUs and expect to do more  as we roll out new servers. But with CXL 3.1 and port-based routing,  we should be able to offer up to 4K nodes and devices, which  really starts to mean a larger fabric, larger clusters. So what are the use cases or workflows  to take advantage of that composed memory? We have to really get in the lab and try and see  what works with these new memory devices  as they come out on the CXL marketplace. Optical-- so copper, no problem, 2 meters, 3 meters. That'll get you just a few racks. And with T256 devices, it's doable within a few racks. But when you start to get to 4,000 or thousands and thousands  of accelerators, you definitely need  to go to optical just to cross the rows in the rack. So there's a need for optical in the larger fabric. So that's something that we're working on. And kind of quick, but that's all I had. Any more questions?