-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path191
29 lines (16 loc) · 12 KB
/
191
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
YouTube:https://www.youtube.com/watch?v=UP1a-OUuLSE
Text:
I'm here to talk about memory fabric. At GigaIO, we're doing memory fabrics today with PCIe, and we look forward to the evolution of that into CXL. So we do composable systems, both hardware and software. So let's get into it.
So at SC23, just to recap a couple of slides that we talked about at SC23, where we kind of fall in the different components that you might find in, say, a server or in a cluster. There's storage, network, NIC cards, accelerators, CPUs, and memory. You can see those blocks there. Legacy networks, like InfiniBand or Ethernet, no problem. Been doing storage for a while and more. With GigaIO, what we call our fabric is-- we call it FabreX. And today, we do up to all these components, including memory. And it will get even faster or with less latency with memory as new components and devices and equipment come on the market.
Another slide from SC23, so just the memory tier. I'm sure you've all seen this. The hard drives, down at the bottom of the tier, you see hard drives and storage SSDs, very kind of slow, high latency. At the top, you have cache and main memory right on the CPU or the motherboard. And then in the middle, there's disaggregated memory that Fabrics does today. And then a CXL memory, both external to the server and inside the server. Those are new classes of high latency memory. They're not quite as fast as DDR. But we'll open up new opportunities or use cases for memory. So let's see. There's a lot of discussion about using less memory. With memory pooling and starving memory on servers or other devices. But as we know, we might make that more efficient. We hope to with disaggregated memory. But history kind of tells us that, as you can see with AI large language models, they're just consuming a lot more memory. So that memory curve will continue to rise as far as purchases. So there's a lot of work to be done here, especially as CXL memory comes out. And really to be able to make production ready solutions. And that's something that we struggle with every day to make it more robust and reliable when you're talking about a Fabric. So a lot of experimentation and research has started. Even today, we've got our hands on large numbers of accelerators so that we could, in our labs, test larger Fabrics and see how they perform. So let me get into that a little bit.
Oh, just backing up to AI. So of course, it's driving enormous data for analysis. We've got lots of challenges for compute and storage clusters. What people need today is kind of a flexible, easy upgrade architectures. And then having low utilization of very expensive accelerators and memory pools is just not good, right? So because they consume a lot of power, they're very expensive. You want to make sure that they're being utilized. So having a composable or disaggregated Fabric helps with that.
So let me back up a little bit and tell you what composable means. I'm sure a lot of you know that. But just as an intro, it's all about whether the devices are inside or outside the server. So we call it converge when all the components are inside the server. On the left there, you can see the server. It's got CPU, memory, NVMe storage, accelerators, and a GPU server, typically. And then as you disaggregate or you take them out, that's called composition. So you're composing these components in different boxes over a Fabric. So on the server, the memory will still be there. The DRAM memory might have 24 sticks of DRAM. JBOF is just a bunch of flash. That's NVMe storage. Those have been out for many years. That's kind of a known commodity. Again, NVMe or storage can deal with low latency-- I mean, a high latency. And then JBOG, this is a product that we offer today, which is a pooling appliance. So it has all the GPUs in there. So the accelerators, think of them as an NVIDIA H100 or a AMD MI210. These are PCI cards. We could fit eight of them in that box. What's nice about having that box as a separate box, as opposed to crammed into GPU server-- and I used to work on GPU servers-- is just being able to access those components easily. There's trays to pull out the accelerators. And then JBOMs, that's just a bunch of memory. So that's, I say, coming soon. But it will take some time. Could be this year, next year, before we really see those-- and other suppliers may have been talking about this earlier. It'll take some time for those to emerge on the market, to be able to see what CXL can do, what CXL memory will look like, and have it work reliably over the fabric. So there's some work to do as far as BIOS on the servers, Linux OS, things like that. If you remember back to NVMe, it took some time for that to mature. And the same thing will happen with CXL memory.
So here are the building blocks. These are the different components that we offer the products. At the very top, you have the fabric switch. This is one piece of GigaIO gear that's very important. This is a PCIe switch. It also handles the fabric management. So it connects to all these different components. Just a bunch of flash. We call it storage pooling appliance. So just a bunch of GPU. We call it accelerator pooling appliance. And then we will, at some point, have a memory pooling appliance in the future. I didn't show the fabric card, but cable's there. The fabric card looks like a typical PCIe card. That would get slotted into the server to provide the PCIe cabling up to the switch and then distributed to these different pooling appliances. And then below that, you see optimized servers. So you can use off-the-shelf servers with a limited number of GPUs you might be able to compose. We've worked with two partners, Dell and Supermicro, and others, to come up with engineered solutions. So we work with them to optimize the BIOS so that it can handle large amounts of accelerators. As an example, with a NVIDIA, we could do 24 V100s and H100s as well. And with AMD, a close partner of ours, we're getting up to 32 of their MI210 GPUs.
And this is what-- so let's build a rack. So what we call this stack, which is a half rack, 24 rack units high, so about half the size of a 48U, exactly half the size of a 48U rack, we can fit in 32 GPUs. So you can call it appliance. So a single server, very down at the bottom of the stack, you'll see a single server. That's there shown, Supermicro. It could be a Dell. And then there's fabric switches. Depends on the topology. We have different topology configurations. For this one, we kind of engineered the solution so it has three fabric switches and then four of those accelerator pooling appliances. So that single service, when you look in the lspci or in PyTorch, you just see a bunch of GPUs. Again, I used to work on GPU servers. And typically, in a 4U, you can get 8 or 10 GPUs, no problem. But to connect those GPU servers, you have to use InfiniBand or Ethernet. And a lot of times, it will add a protocol translation tax because you're adding a lot of overhead. Because this is native to using PCIe, it's much faster. And when CXL comes out, it'll be even faster. So these accelerators will have lower latency. So we call this the easy button for AI workloads, meaning that one server, just put your software on there. All the GPUs are showing up as if they were in a single server. So this is really, to me, a killer app.
So what is a memory fabric, a GigaIO memory fabric? So really, when you think about fabrics, it's any to any. So all these components can talk to each other, as opposed to a network where you might have some limitations on who's talking to who. But a fabric or a cluster is everybody talking to everybody. So on the left, I have upstream host servers. They're showing Dell and Supermicro. They will have memory on them. Could be 1.5 terabytes, those DRAM, or 3 terabytes on the other one. And then it goes up to our switch with cabling, down to the storage pooling appliance. And then those four GPU are accelerated or pooling appliances, getting up to 32 GPUs. Those GPUs, like H100, will have 80 gigabytes of memory. So if you add that up, it's about 2.6 terabytes. So you have terabytes of memory on both sides here. And again, as we add the memory pooling appliance from CXL, depending on the quantities you could fit in a box-- you're talking about terabytes more of memory. What's nice is that memory can be accessed by devices in the fabric.
So let's talk about software. So Linux software libraries that we're using today, you can see listed in the middle-- NVMeOF, MPI, Message Passing Interface, IP Networker, TCP/IP, and then a very important one for AI workloads is GPU Direct RDMA. So these GPUs are all talking to each other and sharing memory over the fabric's memory fabric. And these are Linux software libraries available today that we use. Eddie, the next slide.
Eddie, there's a question in your pointer, so I'll let you know there's a question in your pointer. Is the FabreX switch built over an Ethernet switch?
It's over a PCIe switch. So if you were to look in these boxes, the servers sometimes will have PCIe switches that are fabric switch at the top of the rack, or multiple of those boxes would have PCIe switches. These PCIe switch chips typically come from Microchip and Broadcom. Same thing with the pooling appliances. They almost always have a PCIe switch chip on that.
So let's talk a little bit about how CXL will apply to the fabric's memory fabric. So what's really important is those PCIe switch chips next year or at the end of this year start to come out with CXL 3.0, which allows for fabrics and devices talking to each other. That will move from PCIe to CXL. We start to enable those lower latency memory and devices talking to each other. So this will benefit the fabric. The servers from AMD or from Intel-- AMD Genoa today and Sapphire Rapids from Intel today can do CXL 1.1. But next year, we hope to see 3.x on the newer servers, which will, again, allow the CXL fabric to start to emerge. And what's important here, what's shown in the middle is the cache coherency. So the data in the memory, the hard part is getting it so it's all synced up. So you need a cache coherency protocol, which CXL allows you to do so that those bits look the same whenever they're at any of these devices. We'll talk a little bit more about that.
So the need for coherent sharing. Path forward with CXL. CXL.mem, CXL.io, CXL.cache that you heard about, the coherency protocol is key. And that's what I was just mentioning. So any device can read on the fabric and then sees the latest value right. So somebody writes it, and then anybody gets updated. This is all done today with PCIe. CXL is built over PCIe. And since that's the backbone of our server and all components, as I mentioned, as the CPUs in the PCI switch vendors start to add CXL, this allows us to have that lower latency memory and avoid the Ethernet and InfiniBand protocol translation tax.
So there's some challenges to overcome before we get there. But we can get there from here. So today, the max number of devices on a PCIe fabric is 256 to one host. And that might translate, depending on how many bus IDs that you consume, that might translate to 80 nodes or x number of devices. Again, we're able to do 32 GPUs and expect to do more as we roll out new servers. But with CXL 3.1 and port-based routing, we should be able to offer up to 4K nodes and devices, which really starts to mean a larger fabric, larger clusters. So what are the use cases or workflows to take advantage of that composed memory? We have to really get in the lab and try and see what works with these new memory devices as they come out on the CXL marketplace. Optical-- so copper, no problem, 2 meters, 3 meters. That'll get you just a few racks. And with T256 devices, it's doable within a few racks. But when you start to get to 4,000 or thousands and thousands of accelerators, you definitely need to go to optical just to cross the rows in the rack. So there's a need for optical in the larger fabric. So that's something that we're working on. And kind of quick, but that's all I had. Any more questions?