38

YouTube:https://www.youtube.com/watch?v=YdJWhqeT5DM
Text:
Okay, so I guess I will talk about CXL memory as well, but I'm going to actually bring a twist to this that you have not heard about yet. In fact, I may respectfully disagree with some of the statements that have been made about how the system is supposed to operate.

So a little bit about us. We were a company that we have been around for about three years. We are still pre-silicon. We're working very diligently on it. The company was formed by trying to put together people from various walks of life, from various departments, products and projects. We have between us people who build the TPU side at Google, people who build server systems like UCS. I worked at SGI building their high-end graphics systems before they ceased to exist, and people who built much of the cloud-scale software that runs almost all the services that you people have all used.

This is our rogues gallery. I'll not spend too much time on it, but I only mention it to set the stage that we looked at the problem a little bit more holistically than a relatively narrow view of what memory means to a system.

So I'll walk you through a little bit of history. And bear with me. It's data that you've probably seen before, but I want you to look at it a little bit differently.

This graph, it's a rather famous graph. It's actually generated by John McAlpin and has been used as the reference for understanding the relationship between flops and memory across the board. The upper two lines are trend patterns that indicate or show the relationship between the growth of flops and latency for network and for memory. The bottom two lines show the same for bandwidth relative to flops. Now the interesting thing from this graph that you should all walk away with is you cannot make memory fast enough. It's just not going to happen. And this is a trend from 2000 to 2023. But there is one thing that's very interesting here. One of these lines do a very interesting thing in that space between 2017 and 2023, and we'll talk about it.

In 2000, if you look at the ratio between what bandwidth, just bandwidth for a second, the ratio of bandwidth from between I/O and memory, it was 1 is to 13. This is SGI's origin 2000 using NUMA as its core backbone. In 2012, that ratio reduced to 1 is to 2. This is UPI versus the 40 gigabit Ethernet that existed at the time. In 2023, that is not true anymore. In fact, the graph has actually flipped. Network bandwidth is now caught up to memory bandwidth. And the fundamental premise that I want you to walk away with, and we can argue whether you agree or not, is that when you want to add memory capacity now, it no longer needs to be actually closed. So I actually assert that it can be added through the I/O layer, and it can be far away because the bandwidth is already there. And we'll obviously examine this in greater detail. As Chal said earlier, the biggest memory driver of the day is machine learning. And machine learning is really a problem of finding a pattern through a mass of data. So really, you're not talking gigabytes, you're not even talking terabytes, you're talking petabytes most of the time. This graph is showing you the flop, again, the same thing. The line on the top is showing you the growth of flops. This includes GPUs, TPUs, and every other accelerator that we could find and plot on the graph. And the bottom two lines show you the progression of capacity across PCI and various DDR technologies. What is also self-evident from this graph is that the flops are racing away. This was not catching up. And the assertion still is that unless you want to hurt flops, adding memory capacity through I/O is indistinguishable from adding it locally, because anything that you do, like CXL one tier away, is going to make your flop rate lower. It's just going to affect performance. There's no two ways around that. And the important part, where networking has arrived today, why this is an inflection point, is that an 800 gig network, a single channel of 800 gigabits, which is 100 gigabytes per second, is about twice the fastest DDR5 channel you can find. The memory wall, as Charles also mentioned, is real. The capacity ask is growing at 240x every two years. We are not talking about two ports of DRAM that's going to solve this problem. We are talking about growth that is so exponential that any kind of locality is not going to work out. So the fundamental point that I want to make here is that the growth is exponential, and we have to think in rather big ways as to how we are going to address this memory growth problem. Everything that was said till now is required and essential and is foundational. But the way we wanted to look at this problem was, what is the big envelope? What is the top line problem we need to solve? And then break it down into the lower level pieces. Again, like I said, since machine learning is sort of the thing that's driving memory, this is an interesting data point. GPU memory has gone 2x every two years. So it's losing at the rate of 120x every two years.

So this is a graph that you will see, I suspect, eight to 10 times today. But we have a slight twist on this. It's the last column. And I'm claiming this taxonomy for myself. So if you think about the layering of memory, think of it as L1, L2, L3, L4, L5. L1 is your CPU's local memory. There's nothing you can do. That's the fastest layer. You don't want to get inside there. L2 is your NUMA coherent domain. You're going to get some memory from there and use it. And notice, I say that the software programmer's view of that could be dereference a pointer. But almost always, if you care about performance, it's going to be a high-performance memcpy. You want it in your local MMU and in your local caches. And Greg talked about this quite a bit. If you think about L3 as what we call near-far memory, it's somehow a hop away, a switch layer. And you can see the numbers of 170 to 250 nanoseconds if you're uncontended. If you're on a switched fabric, 300 to 400 nanoseconds is going to be what you should expect. But there can be real problems here. I mean, a thing that everybody should think about is if I have 1,000 consumers of memory and I have a gigahertz clock, how long is your last thread going to have to wait? It's physics, right? At that point, you can't just come around and say, "Oh, no, no, don't worry about it. It's fast." It's going to take time. And how you organize this memory, how you set up lanes, how do you parallelize is going to be the thing that will define whether you'll have predictable performance on the load-store path or not.

And then there are the other layers that we talked about. And like I said, if you think about that L4 layer, and you realize that that memory is available at 100 gigabytes per second, you can now make the statement that really what I want to do is I want to optimize the access pattern at every layer. I want to think about L1 being the fastest it possibly can be. I want L2 and L3 to be the fastest it possibly can be, but not such that performance of L1 is impacted. I want the L1 layer to be the place where the hottest data lives. And then I want to have L3, 4, and 5 to be the hierarchy that allows me to go and attack the problems that are not a few gigabytes, that are not a few terabytes. Because at the end of the day, if the problem is a petabyte-sized problem, I have to solve it somewhere. It doesn't help me if I can say that I could access two terabytes really fast, but after that I went back to the old regime. You have to solve the continuum. And the purple line, or the purple border, is showing the layers that we spend most of our time focusing because we are not a CPU company. And so the question became, what is the right way to layer memory and networking such that you can address all of those layers simultaneously?

So let's examine some history as well. 

So the picture on the left, it turns out I used to be the-- I was the last technical leader-- was SGI's NUMA system based on the i64 CPU line. You might recognize exactly what it did. This is exactly where the CXL memory system is trying to get to. And to Olivier's point, once you get type 1 and 2 added, this is sort of the holy grail. A bunch of NUMA-connected nodes, memory is fully distributed, I/O is fully distributed, and you can talk to each other. This machine, by the way, circa 2003. But what ended up winning was the picture on the right. That's a Google data center with a 40 or 100 gig network, and it was built to be completely sharded, highly fault-tolerant, and extremely resilient. And there was an almost maniacal approach to share nothing. And there was a reason for those two to coexist. And we'll talk about, in the way we think of what the solution needs to be, how these two sides need to be balanced.

Let's talk about actual applications again. So you look at GPUs, which is, again, the largest driver of data. The way people are building GPU systems today is they are actually over-provisioning cores. The way you solve the problem of memory is you say, I will use N number of GPUs such that the HBM capacity of those GPUs can completely absorb the problem, and I will keep cores dark. The other way people have solved the problem is they have put CXL memory, as I think Greg was showing on the right-hand side of this picture, they put CXL memory attached to the CPU. But that has its problems as well, because now you have the link that connects the external world to the internal world as a bottleneck. You have the memory controller that is going to have to balance between the DDR and the CXL DRAM that is not particularly optimized for this kind of balancing, having to deal with that disparity. And in addition, as you can see in the picture, the NIC, or the network rather, and accelerators have second-class access to that CXL memory if it's attached this way. Clearly, you can attach it with a CXL switch as a first-class device, but that creates a new kind of problem, and we will not really get into that today, but we can talk about it offline if anybody is interested. But the bottom line in all cases is that, and this is the part that I kind of agree with previous statements, but in a slightly different way, what has happened is over-provisioning of memory has been used to offset relatively poor I/O designs. So instead of, I would restate that statement of saying, instead of the solution being to avoid I/O, do I/O correctly such that it cooperates with memory in the right way.

So far, we've talked about all the problems. Clearly, no one wants to hear that. So let's talk about what the solutions are. 

So this is how we envision, so we are a silicon and software company, and we are building a device that fundamentally looks like that. I mean, obviously, that's a cartoon, but what it does is it allows you to move memory in a host of different ways, and it integrates the networking aspect into it. This device allows, without waiting for a CXL 3.0, and even in CXL 3.0, which is still a home node-driven data movement model, it allows you to break away from that, move memory independently between devices that are connected to it, use CXL as a way to create coherent access, if that's what you want to do. If you don't want to do that, then it allows you to do standard DMA-type operations behind the scenes and fill in that L3 and L4 layers on the fly as much as you want. And most importantly, it allows the network and everything else that's attached to it to talk to it as if they were pages of memory. And this is rather important because if you imagine a scenario like this, or a manifestation like this, it solves many of the problems that plague common system architectures that are being built today. We've addressed the bottleneck issue. We've created a massive switching fabric that allows you to get the bandwidth that you need and the capacity that you need. And as you'll see later in the talk, by having the network such that it is integrated into the system and completely transparent because pages are all that everybody sees. Your endpoints don't even know that they are sometimes talking to the network. By building that model into the system, by baking it in, we can create hierarchies and access patterns that are otherwise not possible. So the device, we call it the accelerated compute fabric.

And the two important things to note about it is that it has no proprietary interfaces. All the interfaces are completely standard and open. And if you have a device that meets those standards, we can talk to it. So I'll pause for a second, let that sink in, and then show you this picture.

So if you have been around data centers, you should recognize this picture. This picture looks like a standard, what's called a CLOS or CLOS topology. This is how every one of the world's largest hyperscaler infrastructures look. Multi-noded, multi-homed. The lowest level are your compute elements. The layer in the middle is typically how services are served. And the entire capacity of the cluster is the capacity of the system. Like I was saying earlier, by integrating networking into the memory hierarchy natively, we allow you to build a system that looks like that. Tier 1 is all your compute elements. Tier 2 now allows you to supply memory as a service. You know where your firewall and your load balancer used to live? Now you can have memory as a service. And Tier 3 allows us to effectively use RDMA-type technologies to make it such that the memory capacity of your system now is the memory capacity of your entire cluster. This is how you solve the petabyte problem, because if you're trying to put a petabyte into a server, that's not going to work out. If you try to solve it any other way, you're going to have to deal with the cliff. So our answer is build the hierarchy, make it seamless, make it all about memory, and supply all the things that you need to make that seamless hierarchy work.

So let's look a little bit, obviously it's theoretical, but a little bit as to how having access to the full hierarchy could really affect the outcomes. 

A very big problem today, thanks to the rise of ChatGPT, is inference at scale. And inference at scale is really all about maximizing the number of users you can keep running and manage the token and KV caches that are associated with that user. You think about the problem in a very simplistic way. If I pick the server on which I'm going to do all the caching, I may leave capacity stranded because I may have picked all the wrong servers for all the wrong users at any one instance in time. We assert that if you build your system this way, and if you remember the hierarchy, the GPUs, I guess I can't really point, the GPUs are the L1s and the L2s because it's going to use their HBMs to keep the current context hot. But locally attached CPU, these are CXL attached, can be your L3, or it can be CXL DRAM attached to us, and your L4 and L5 can go out to the network. What this allows you to do is it allows you to keep your context caches always in L4 and L5 and bring it into the GPU on demand as the user warms up. And by doing that, our simulation data is shown in the graphs on the right, you get to a point where your utilization, the number of GPUs you need to be able to satisfy your SLA, shown by the red line, dramatically shifts because all we have done is effectively said the memory is now in the network. We'll bring it close to you when you need it, and then you can use it. It's a relatively simple thing, but when you have 800 gigabit network and multiple lanes of those, you can do it. It's a couple of microseconds against a couple of seconds inference task, so it just works out. The net result for the example that we, or the numbers that we analyzed is effectively a 50% reduction in GPUs, which is a pretty dramatic reduction in cost, power, and heat.

you have a question? Okay.

So, one question on this graph, so gen AI based inference, large language models, which is used, there is more like two use cases, right? I mean, one is more like chatbot, which is a latency driven, and the other is a document processing.

Sure, offline, yeah.

So, and of course, it's, one of the challenges with it is that it's a sequential processing of talking about tokens, which for the latency case becomes a pure bandwidth problem. As you point out, I mean, the way that we solve this today is basically by clustering GPU with eight pins. So, it appears to me that with your approach, I mean, how do you solve the memory theory and how do you solve the bandwidth? Because you still have a limited set of each service lanes per CPU.

Well, so, depends, right? So, the benefit here is that you actually can do the sharding the way you want to do it. This may be a longer conversation and maybe worth taking offline, but the general statement is this, right? If you think of this as a uniform template of horizontal scaling, you decide the GPU size such that your HBMs fit your active context because no matter what you do, you have to put it there. But your per user token caches, you can move around whenever you want. Your model can also be actually shared and replicated. We have a few magic tricks we can do here, and as I said, that may be a good offline conversation to have, but there are things we can do that can dramatically improve your net GPU utilization. And your exposed communication time or your setup time can be dramatically reduced. And that really is what it comes down to. That 50% number comes from, in fact, the graphs also that we are showing, is saying for the same GPUs, you're getting much better utilization simply by-- you got it? Okay. And I'm happy to talk later as well.

Okay. And I know we are running behind on time, so I'm going to zoom through a few of these things. So this graph, hopefully, is super non-controversial. Memory is faster than storage. I think that we've established in a few talks today already. However, it is an interesting thing to note that I always thought the Optane business went away because there wasn't much user demand. That actually is not the case. We've learned over the years that there was actually a pretty big demand for it. There were some other reasons that made it go away. But Optane was the thing that people were looking at as the way to expand memory. So what can we do that's different?

Let's look at two specific use cases. This one, hopefully, is fairly familiar to everybody, readers. You know, your hash-based object store or object database that is pretty popular today. I think the text might be really hard to see, but a very typical reader scaling design allows you-- what they do is they just take the hash, they create a bunch of shards and nodes, also known as computers. Each of the shards could be a container on its own, and then you just stripe them out. And performance now is a function of how big the memory capacity of that node is. Now, we talk about sharing, and there was a bunch of very interesting and super exciting demos earlier. But that sharing in this kind of context is problematic because it is not statistically highly predictable. And you could lead to super contention, which creates a new layer of problems. But if you look inside, what are we saying, right? Each of those nodes is just some compute and some memory. Our answer is, build it this way. Build a compute on a CXL peer line, build a memory on a CXL peer line. This is no different from all the pictures that Greg was showing earlier, with one difference, the orange lines.

The orange lines allow you to move the memory out to the network, which allows you to refactor and reframe the problem to be one of-- I can now have a configurable node size, I can have a configurable shard size, and this can move dynamically. And the in-memory capacity is now the capacity of your entire data center. It's not the size of the node, it's not the two terabytes you plugged in into that particular rack, it's wherever you can find it. Because if you're doing a Redis query, and you were really only looking for memory capacity and the ability to pull it in quickly, we can do that on that multiple 800 gigabit line.

And we can feed it in very large contexts in time, such that your shards are well-- are satisfied in time.

Another example we'll talk about is memcached. Memcached, again, hopefully a fairly well understood problem. Storage is the memory, so to say. And the whole point is to provide smaller, faster caches through memory of a server. The replication and sharding responsibility is given to the storage layer, and the servers just do the front-end presentation. On our system, you can re-imagine the implementation to look like this. And this looks just like memcached at this point, right? You have the storage up top, you have the CPUs in the front, which is your web servers, and you have this very high-speed network sitting behind it. But now, imagine I did that. The memcached is now actually completely separated from the servers. The servers you provision for your compute requirement or your query rate requirement, and the memcached you provision for the memory that you need. Since memcached is really a latency ask, by exploiting the full hierarchy, the L4, L5, all the way through L3, L1, and L2, you can provide the latency guarantees that you need while providing memcached services that are fundamentally run out of memory. I mean, the fact that there is storage is sort of incidental in this case. You could actually eliminate the disk completely. For people, again, who are running petabyte memcached servers, this can be a pretty dramatic difference.

And the point, a couple of points in closing. If you think of the system this way, this is actually an orthogonal and complementary axis of growth to the axis of CXL switches and the CXL technology, which will take, you know, it'll take its time to get there. It's not that it's going to be a snap the fingers and it just all works tomorrow. While we make that work well, and while that allows us to build better L1, L2, L3 layers, we can already put into service L4 and L5 layers and get around the data set explosion that we are all having to live with today. This sort of system architecture applies to HFT grid computing type applications. It allows you to build effectively multi-socket systems without having to build multi-socket systems. And it eliminates a significant amount of the I/O, memory, and NUMA locality binding that you have to deal with today, which is not going to be solved by just expanding CXL memory and leaving it there. So, and yeah, we can talk more about how it also helps with other accelerator devices like FPGAs and data transformation type logic, which is also becoming more and more popular in today's data centers and systems.

More information can be found here. I'm around. If you have questions, come find me and I'm happy to talk more. That's all I had. Thank you for listening and happy to be here. And thank you for the forum, by the way, Chads.