332


All right. So, switching gears, Samir Rajadnya from Microsoft and I are going to talk about our hyperscaler use cases. So, we're going to switch gears a little bit and come back to say, "Why do we need this? How do we think we can use it?" or at least share some of our current thoughts. Let's see where it goes from there. Right, Samir?

So, let me start off with some of the ways how Meta looks at it. At Meta, I'm going to start connecting it from what Meta does to why we are trying to do what we are doing today. So, we are trying to connect people; we are trying to get them new experiences, connect them with their family and friends. So, Meta has been using AI for a long time. This is not the first time we are using it. But now, the general AI is pushing a significant weight on the things that we use. How do we make sure that, you know, we let people use services in a much better way, how to use the content properly without having any harmful effects on them? And what has changed significantly in the last few years is the overall effort through our open LLM models, Llama.

And so, gen AI today runs on LLMs. So, gen AI growth has been phenomenal. Just to give an example, ChatGPT reached its 100 million people in approximately two months—I think it's like two months. Whereas Instagram and Facebook, it took two and a half to four and a half, four years total to reach that kind of number. So, this is the kind of adoption that we are seeing for gen AI right now. Likewise, the model growth is also continuing to, you know, be exceptional; that is being just transformed, fueled by the transformers right now. To train Llama 2, we required approximately 2K to 4K GPUs back in the time.

As we start looking into the future, this is continuing very high. We don't know exactly where the return on investment becomes low or where the diminishing returns lie. But, right now, it is quite phenomenal. We know that we can achieve a significant increase as we move from text to speech, to videos, to images, to audio—all of this to bring it closer and closer to the physical experiences that Meta can offer. To keep pace with this kind of development, we really want to make sure that we know how to build the systems. How do we provide the exceptional performance that it requires? But also, we need to consider the speed at which it is expanding, the speed at which the AI use cases are expanding. So, that's the challenge in front of us from the infrastructure perspective.

The AI cluster size continues to grow; this is not any proprietary knowledge. Of course, Mark has talked about it, Microsoft has talked about it, and NVIDIA has talked about Reflex, and everybody holds the AIs—how it is AI's doing. So, it's quite clear the infrastructure continues to grow really, really fast. And we want to make sure that we can build systems that are optimized, that allow the performance, but at the same time are significantly balanced from the solution perspective. So, let me now pause and get back to saying that if you want to run those kinds of use cases on the infrastructure that is rapidly growing, the use cases that are rapidly increasing, what does it mean for the infrastructure? Which is what most of us, actually, in this room, perhaps are going to think about.

Hyperscalers have been doing a large number of systems and large applications in the data center for a long time, but there are fundamental differences. If you think from the general purpose computing perspective, they are very CPU-centric applications. They scale out significantly, with most of the stateless applications running quite independently. Failures are acceptable. Failures are taken into part of the consideration. We scale the performance by adding more and more systems to the clusters. Accelerator-based applications, which are AI applications, are different from the context that even when they are running on multiple systems, it's a single job that runs across all the systems and it works together. So, a single job that has one failure somewhere in that cluster can bring the whole job that is running for weeks or months back to the start. So, this is where the challenge from the reliability perspective, and the systems that we build—from the GPUs or accelerators, memory, network, the whole cluster the way it is built, the power the way it is delivered—has to make sure that this large cluster continues to run for the job to the completion.

There is a further challenge for us as we delve into these systems. As we know, for the hyperscalers, which we have talked about in other sessions before, we want to make sure that the systems are optimized for TCO. For the systems that they are trying to deliver certain functionality, what kind of compute it should have, what kind of networking it should have, how much memory, and what is the storage—all of this composed creates a SKU that is optimized for that use case. The challenge for AI systems is that, if you look at these requirements for different use cases—whether it is recommendations, training, inference, or GNI training or inference—or within that, actually, you have pre-fill and decode phases. If you consider a lot of these things as different use cases and layer the requirements for different system components over each other, you'll see this complex spider chart. This chart basically pushes the envelope for different functionalities in the system or different components in the system to extremes, which makes our job very challenging for what we put in. The kind of clusters that we build, and the kind of functionality that we need to put into the accelerators nowadays, pushes the envelope for power. How many systems you put together, how much bandwidth each requires—all of this basically pushes different envelopes, and you're seeing that in the expo hall where those things are driving.

We will focus on the memory today only, though. Although the power, the cooling, and the rack compositions are very interesting problems.

Coming back to the memory requirements, as we talked about, let's say, even if you say that we can't run all the job on a single GPU or single accelerator, let's focus on the single accelerator first. So, on each single accelerator, we need an amount of memory that actually can keep the flops going. That means the bandwidth needs to be very high for the memory, and also the capacity, where I can really put the models in that GPU to make sure that we don't strand the GPU and have enough memory capacity. So, these are the two things that people have already talked about; talks about AI wall from a bandwidth perspective and from the capacity perspective.

The bandwidth, of course, is required. This is where the HBMs come into the picture. This is where the functionality for HBMs, as we move from HBM 3 to 3e to 4 and onwards, all these are addressing how HBMs can continue to provide the memory bandwidth. The capacity continues to grow also as the HBM capacity grows with the higher number of stacks, and then the core size will continue to increase for the memory node. As we start increasing the memory, though, we have to keep in mind the power is an important aspect, but also the reliability is an important aspect. As we said, each node that fails, or each component that fails in the cluster, can bring the whole job down. So, the reliability is a very important aspect, and the power density.

Coming to the capacity, as we look into the models, we talked about how the Llama models are increasing in their size. That translates into how much memory we need. And if you think of that, actually, the amount of capacity that we can put inside a package or inside a silicon for GPUs is limited because it's a memory that provides the highest bandwidth. So, HBM, which is a tier 1 memory, may not be completely enough for many of the use cases, which means we need to figure out how to provide additional memory. I'm just calling the name "Memlink," which just says that there is additional memory somehow connected to this accelerator outside of the accelerator and provides additional memory that is tiered and completes the solution. Now, the important thing here is that actually, the AI jobs are such that you can use the tiering. You can decide where the memory is placed. And even if that memory has a little bit of a higher latency as compared to HBM, we can make sure that it can be fetched at the right time. So, this does allow the use cases for AI to be such that whether I'm putting embeddings or activations, I have a way of using the tiered memory that allows us to scale it just beyond the existing tier 1 solutions.

So now, thinking about the memory expanding, and when I call it node-native, if you think of a node as basically a single system or a single rack unit that has one CPU and a bunch of accelerators, and I'm going to add more memory to that, I have ways to do that. You are looking at a GB200 or Grace solutions outside. One way that they are doing what they're shown is you can expand the memory for the accelerator through the host and go through any bus that you have, whether it is NVC2C for NVIDIA's Grace solution or PCIe, potentially CXL, or any other fabric. So you have that infinity fabric kind of solution that allows you to connect to the host, and the host is where your memory is attached. Another way of doing is basically you can put that memory directly on the accelerator also. It could be DDR, LPDDR, or even CXL if the interconnection allows you, out of the accelerator, to attach to such a memory. So, you have two ways of getting a higher amount of memory in the system: get it across the host or get it directly from the accelerator.

Let's talk a little bit more in detail. If you talk about interconnect, that means if it is connected into the host, you want to make sure that there is enough bandwidth to access that of capacity. So, the capacity of the host will be something that I need from the GPU as a tier 2 memory. But to access that memory, you need to have a certain amount of bandwidth to it. Rule of thumb: if you are using tier 2 memory, you want to get somewhere between 1/8th to 1/10th of memory bandwidth to the tier 2. Use cases vary. You may be able to do with less if the use case permits, as we're talking about originally. Different use cases may have different requirements. If you are very capacity-bound and not bandwidth-bound, the bandwidth could drop. But if you're trying to build a platform that addresses multiple use cases, you tend to basically make sure that you can provide the high bandwidth. And so, such high speeds may be required. This may be critically important if you're going to use CXL or PCIe. You want to make sure that there is enough number of lanes to provide the kind of bandwidth we need. You could have DDR, LPDDR coming off of the accelerator directly. We want to make sure that the solution takes into consideration the RAS considerations, energy efficiency from a picojoule per bit perspective. But in general, I think as we go into the solution, the way to build the solution would be to require enough bandwidth, which basically means enough number of channels, if there's DDR, LPDDR, and the right amount of speed and capacity for the models. Coming to the models, this is again an important thing as we start looking into the future and solutions that are more power-optimized. We have enough consideration for how the model should be there, what can be defined from the capacity perspective for those models, how the reliability is going to be guaranteed, especially if you're considering LPDDR. LPDDR is coming from the mobile space, has different expectations getting into the server space. We want to make sure that those solutions are baked for capacity as well as for the reliability, and most importantly, serviceability. So these are all the considerations that go into for us to consider the node native memlink.

You could take the memory and attach it across a, uh, fabric—uh, a classic example exists today. NVLink basically allows you to connect multiple GPUs over an NV switch. Today, that is actually fabric-attached memory; you are accessing it over fabric. The remote, uh, HBM memory—one could consider putting memory onto the fabric. The use cases are evolving here, whether I'm going to do embeddings or activations offload, or KV cache, or ephemeral checkpoints, and these are the use cases. I think we'll have a little more discussion on this one, but we want to make sure that for this, what is the interconnect? What is the memory? Is it a memory semantic interconnect that is established with whether it is CXL, NVLink, UALink, or is this something that is a message-passing interconnect that Ethernet allows, or something in between? But overall, the expectation is that you want to have high bandwidth; you want to have the kind of low latency characteristics that enable this kind of connectivity to the memory. So, we have all these options. These are all the things that evolve. As I said, AI use cases are growing very fast, very rapidly; what used to be a two-to-three year cycle for new innovation coming in, or new platforms coming in, they are really coming down to one year, or less sometimes. So, we need a lot of community effort to make sure that these problems are solved together. Let me hand it to Samir to go into the next-level details of what these high-level requirements mean.

Hi everyone, good morning. I'm attending these OCP meetings over the last two years, and it's a very good collaborative environment where you can test new ideas. Some of these ideas are highly complex, so we need the whole ecosystem to come together. So, with that, and considering what Manoj said so far, and the previous team, they talked about the composable memory, right? The point that I'm trying to make here is, when we come to AI, there are a lot of similarities we can draw from the CPU server into the AI infrastructure, and the good thing about the CPU infrastructure is it's kind of stable, right? Whereas AI is changing so fast that if you want to throw another variable, people may think, "How many things are we changing?" So, with that in mind, let's talk about the secondary memory.

The point here is we are not going to replace the HBM; that will always be the primary memory, but how do we connect this secondary memory, and what are some of the characteristics? I think Manoj's slide talked about the key priorities, right? So, for me, I think the bandwidth is extremely important; capacity immediately comes after that, and then the latency and the cost, whereas if you look at the CPU server, maybe the priorities are different. You are more sensitive to latency, to the cost, and bandwidth. The capacity may not be that important, right?

Now, when we are talking about secondary memory, the point here is you have to look at the HBM roadmap. Where is it going, right? There is HBM 3, there is HBM 4 coming, and HBM is—they're adding more stacks, they are increasing bandwidth there, but when you add the secondary memory, you have to have certain ratios in mind. The secondary memory's bandwidth has to be at certain multiple of the HBM bandwidth. If I go too low, this memory will be useless. So, the numbers I have heard through various academia and the people in the industry, this in this equation, n has to be 1 by 3, 1 by 2, 1 by 1. So, if my HBM collective bandwidth—let's say many HBM components connected to my compute die—and I want to put a secondary memory, my bandwidth has to be somewhere in this ring, right? So, that puts us more pressure on the interface bandwidth.

Let's see what is there today, right? There is CXL, we all are aware, and CXL is moving to gen 6 that doubles the bandwidth. There is a feature in CXL where we can take 1 by 16 lane, put them together, make it 2 by 16. That feature exists, that double CXL bandwidth further, right? And then, if you're still not satisfied for your use case, you need more bandwidth, there are other options such as UALink, which is coming up, and other things are there. Right now, when we look at the bandwidth, and now we look at the memory, now you have two options, right? There is a distance, and then there is bandwidth, and the trade-off is always if you want to go with smaller distance and less bandwidth, copper is your media, but if you want to go with a larger distance and more bandwidth, you have to go with optical, right? So that's another thing to keep in mind. How do you want to connect this secondary memory, right?

Okay, okay, now there are parts here. How do you want to connect this secondary memory? One option is you put it closer to your compute die. I call it near AI compute die, and the second option is away from the AI compute die, and the pros and cons of each, and we, we have to choose what suits us. So, for example, on near AI compute die, it's a shorter distance, hence more bandwidth. You can live with copper interconnect, no additional cabling complexity, right? From the system point of view, it is cleaner. The disadvantage is all these additional DIMMs or the LPCAMMs; they take physical space. You may not have space on your board to connect so much memory, right? So, you have to go over a larger distance. So, this is one disadvantage. What I keep hearing from my infrastructure team is they're already power-limited in the rack, so maybe if you move memory into another rack, that gives me an option when I'm power-limited in the rack, right?

And now, the third point is, the memory is not shared because you are dedicating that memory only for that compute die. Whereas, if you look at the other option, we are moving away from the compute die. It gives you flexibility on space, power, cooling, and a higher capacity option is possible because you're moving away. That means you can put a lot of memory somewhere in the other rack. This memory is shared, and now you have a configurable option. You know, people talk about composable memory in the CPU server. The same concept can be carried here. I have a use case in a couple of slides where you see that in the AI, there is a prompt phase and the token phase. One is more compute-bound; one is more memory-bound. So, if I can configure the memory, and I can throw more memory at that token phase, I can solve that problem, right? Now, the disadvantage: longer distance means optical. Optical is challenging; it's expensive, reliability—all these things are there. So, we talked about all these options.

So, now let's go to the—what I mean by compute. Everybody has seen this, right? It's from NVIDIA's Grace Hopper. What I mean by "near compute die": the memory is close to the Hopper. Hopper is a compute here, and LPDDR is a secondary memory, and it's very close, and it is exclusively for that Hopper GPU, closer to it. Of course, others can access it going over the NVLink, but this is what I mean by your "near compute die" right now.

This is the other concept. Again, these things are more future-looking, and what I'm trying to say here is, you're moving away from the compute die. Right now, let's take this cluster of eight GPUs. Each one has its own HBM. Now, I'm connecting these GPUs over some interconnect, going over copper or optical, depending on your distance, and maybe CXL or UALink. But you are going to have another entity called Memory Services Unit. And that's away from the GPUs. And you're going to connect DDR or LPDDR. I think in the OCP, there are a lot of other companies, such as Celestial and others. They have talked about this. But this is another concept where you're moving memory away from the compute die. So, this memory is configurable, and this memory is shared. Whereas, in the previous case, memory was not shared, and it was not configurable.

So, when I throw these ideas, I have to have a use case in mind. This was something that came out of Microsoft research recently. The name of the paper is called "Splitwise." And what they are saying is, if you look at the picture on the top left, there is an LLM, and there is a user. You ask a question. It's a prompt. And then, the first phase is you process the prompt. You generate the first token. And the next phase, the subsequent tokens are generated by looking at the KV cache. So, essentially, what they are saying is there are two phases: there is a prompt phase and a token phase. And the point they are making here is the prompt phase is more compute-intensive, and the token phase is more memory-intensive. So, when you go back to my previous picture, you can see that now memory is configurable. And if you have a certain set of GPUs, of course, I cannot reduce the compute, but I can throw more memory. So, I can help my token phase, which is more memory-intensive. So, this is one of the use cases where you can benefit from something like that.

When you look at the paper, there's this graph that talks about the prompt phase, token phase, memory requirement. The y-axis is the memory requirement. So, as the batching increases, the amount of memory in the token phase really explodes. So, when we talk about this configurable, composable, secondary memory, maybe this is one area where we can benefit from.

So again, the same thing in another slide. There is a prompt machine, and then there is a token machine. That is what this slide is going to tell you. And then, you configure your prompt machine and token machine by adding more memory into the token machine. You guys can go and look at the paper. It's online.

But there is another point I want to make at the end. What's possible today? The ideas that I talk about may not be there today, but what is available? This is the deployment model everybody is talking about. There are x number of GPUs. There is a CPU head node for so many GPUs. And now, there is a memory sitting behind that CPU. Manoj talked about that. And if you take the previous team, they were talking about a memory pool. Now, if I have a memory pool, and I have these two CPU head nodes, and if I have shared memory, maybe there's a case where you can share this memory between these CPU head nodes. And this is something we all need to discover further. But this is something possible. And again, CXL has made it possible. Other interconnects will also make it possible. But this is where we are today.

OK, so call to action. Again, OCP is a very collaborative place where we can test your new ideas. So, please participate there. AI systems need high-performance memory components, interconnects, and system-level solutions. Memory technology, we always talk about this: high-performance memory, capacity, and reliability. Interconnects—I think there is a lot of talk in this conference—but we can bring the interconnect with this solution in mind. So, speeds, radix, connectivity, and memory systems. So again, Friday morning is the OCP CMS call. Please join, contribute. All the links are here. So, thank you. We have a couple of minutes for questions.

Yes. Nice presentation. Vikram from NVIDIA Research. I have two questions. I think I'm observing conflicting directions. One conflicting direction is that on one side, we want to optimize for picojoules per bit. On another side, we want to optimize for memory bandwidth and capacity. Now, both of them, if you want to do near compute, near compute versus part of a compute, it's a conflicting direction immediately. So, I'm trying to understand what is the mental picture here. Trying to understand, OK, how do we solve this conflicting direction? Because from the trend side, it's not going to work. And I'll do the second follow-up a bit later.

OK. I will go first. I totally agree with you. OK. And these are two different directions, and specifically, for the far memory, we have a lot of dependencies. Specifically, if you want to go to optical, the picojoule per bit, every time you cross a that adds to power. So yeah, I'm fully aware of that. No conflicts there. But again, at the same time, if you cannot put that much memory in your compute die, you're hitting some space limitation, your IO limitation on your compute die. We have to find that alternative. It may not be there today, but we have to find a solution there.

If I may just add, I think you're absolutely right. Farther you move the data, the picojoule per bit goes up. The question is basically, where is my option between capacity needs and then where it needs to land? If I need capacity at higher bandwidth, they start coming closer together. The move further away, I can give more capacity, but then bandwidth is going to be compromised. So, it is really use case basis. On the training side, perhaps I can address most of my capacity needs closer to the CPU. On the inference side, I have a potential for increasing that. Today, that's where it stands. When that happens, basically, what are our solutions? And these use cases are evolving very fast. If you see recommendation models versus Gen-AI, those requirements move so quickly in six months that we don't have time to develop new technology if we decided to do that even right now. So, what we're seeing is that these are the options. This is where it lands right now. Most of the use cases are really near compute right now, near the die. That's where the solutions are. But the way the capacity needs are increasing, we want to be ready for that fabric-attached solutions or anything that allows me to go further.

This gives a good segue to my second question. So, I'm trying to understand the balance between memory capacity versus the scale that you're thinking of. In one direction, we can do what kind of architecture did; you can get 13 terabyte kind of capacity, which is huge. It's not small, just on the HBM memory itself, right? So, an HBM generation changes every couple of years, or whatever the trend is, and you're increasing more and more. On the other side, the beachfront that you have on your chip is very limited. Now, you are actually trying to trade off on the other side, on going far away, which is also a troubling scenario, because you don't have more beachfront to put on something else. So, where is that trade-off, really?

You bring up a great point, and I think something that maybe will take longer time, but let me address it a little bit. Your beachfront is limited, which basically tells, or even shoreline is limited if you're trying to do chiplets, right? So, you are going to make a decision between how much capacity I put inside and how much to go outside. The moment you say you go outside, you are trading different things. Now, you're talking about where is my network shoreline that I need to use, my host shoreline that I need to use, and now this additional one, which is memory, which brings us to an important point of the decision that you make for interconnects also. Different interconnects, I'm going to use the beachfront for different ways of connecting out, which basically, to your point, is going to add not only my beachfront, but also per bit, because I'm going further. So, this is the challenge. I mean, luckily for all engineers, none of these problems are easy. So, we are going to continue for a long time, but this is why we want to put it out. What Samir brought up, saying that these are the solutions we need to work through and figure out what we have. These are absolute trade-offs. The capacity, bandwidth, picojoule per bit, RAS, all of these are trade-offs that we are doing as we start building. And it's getting better or harder. You're trying to pack so much in a single rack, because you're going to be limited by bandwidth reach, as we talked about. So, you get into another problem that we didn't talk about, is basically powering it and cooling it. So more things are tied together. It becomes even harder for builders to find solutions. Now, most of that power comes from one component, I know. But having that more inside makes it more difficult to cool it. And we can get into the HBM challenges further, as we start going up from a capacity perspective.

Thank you so much.

All right, thank you. Have a conversation later. Yes, let's absolutely. So, yes, well, let's just take one here, and then, yeah.

Hello from Samsung. Great talk, Samir and Manoj. So, my question is: Yesterday, I saw a talk from Nvidia, who are trying to interface NVMe SSDs through the GPU directly. So, what are your thoughts on having an NVMe storage connected to a GPU versus a CXL-attached memory with a memory controller with the accelerator?

I don't think you have to look at it as one or the other. Think of CXL maybe as fast storage. That's one way to look at CXL. So, if you are going with that SSD approach, that's fine. But if you want to improve it further, think of CXL as another layer on. You always look at the pyramid, right? So, CXL is about that. So maybe you can get into CXL memory, and then maybe select CXL to feed your GPU.

Thank you.

Thanks.

So, Vishnu here from Micron, actually. So, you talked about—let's keep the HBM as a primary bandwidth and capacity, right? For the secondary memory, we're talking about 1 by 10, 1 by 20, 1 by 30. So, can you give a sweet spot, the secondary memory bandwidth and capacity that works for most workloads based on your analysis, the bandwidth as well as capacity?

Yeah, again, it all depends on your use case, right? There are many, many use cases. KV cache sharing is one use case that I know, and there are many of them emerging. And again, how much you want to exploit with the secondary memory, right? So, if I tell you an answer, that it has to be, say, 1 by 10 of the HBM. Now, HBM is going to HBM4, and now you're really putting yourself in the spot, because now this memory is some multiples of terabytes for the secondary memory. So, you are limited by interconnects. You are limited by media. So, this is where you have to make a decision: what is possible versus what is required. So maybe you can take a step back, saying maybe 1 by 10 is not possible. 1 by 20 and 1 by 30 may be possible with CXL or UALink. So maybe. I cannot solve all the use cases, but I can solve a particular use case. And maybe, if I had more bigger KV cache, now a smaller KV cache, so the gain may be reduced a little bit. But yeah. So, there is no ideal place. It is something that we all want, but we are not going to get there. We have to solve this in baby steps.

I'm with Samir here. And I'm going to take one minute and then go now. As we talked, we looked at that spider chat. You know? That's where the challenge is. We are also looking for it. We have recommendation systems for training and inference. We have GenAI, token phase, later phase. They have different requirements. And the challenge is basically they do push these envelopes differently. And we all of us continue to look for it. We know only the constraints. The constraints right now, we know that as the previous question came up, there's going to be a limit on how much I can put it inside. There's going to be a limit how much, not limit, but at least it's becoming more and more difficult to put more, and more stacks, and more and more dies. So you're going to see that your tier one memory is going to get constrained. If the tier one capacity increases, my bandwidth needs on the tier two also go down a little bit. Because my caching, I can start prefetching even more earlier on. So they are going to be moving for some time. One thing is for sure, tier two memory will be required, at least from the inference use case perspective. Bandwidth will ebb and flow based on how much HBMs continue to grow, how many HBMs I can continue to integrate, and how our use cases start optimizing things. Samir's ballparks are absolutely right. We have seen between 1 10th to 1 30th, somewhere in between, and they vary according to use cases.

Thank you very much.

All right, thank you, everybody. We have to move to the next presentation. Oh, we have a break. All right, OK. So, does anybody have more questions for Samir and all the speakers who spoke before? And how long do we have a break for? Yes. So, all of us will be back at 9:45. We'll hopefully have our laptops set up by that time, so we'll start right away. All right. Thank you, guys.