-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path80
36 lines (18 loc) · 14.8 KB
/
80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
All right, good afternoon. Thank you for joining us today. So as Amber said, my name is Chris Petersen and this is Prakash Chauhan . We're both from Meta. So I'm sure at this point you're probably not tired about hearing about AI systems. So if you're not, not to worry, it's only Tuesday. So we will spend a few minutes in talking about some of the challenges in building AI systems and then how CXL may be able to help us with them. We do have a lot of ground to cover and we don't have that much time, so I'm just going to jump right into things.
So designing AI systems ultimately presents us with numerous challenges. So today we're just going to focus on three of those. First as we've discussed in other presentations, and you've seen some of this this morning as well, in many cases we're really approaching a memory wall. This is really caused by the difference in the rapid growth rates in computational intensity and model sizes versus the DRAM density and packaging technology improvements that we're continuing to see. We're able to keep pace with the compute requirements, at least to some degree, but memory capacity and memory bandwidth are not scaling fast enough, especially in the context of unpackaged memory such as HBM. So we found a number of ways to help delay the impact of this disconnect, for example by scaling across more and more accelerators, but we do need to explore additional solutions.
The second challenge here is the rapid evolution of AI models. Here we can see the volume of publications on new AI models over time, which we're using as a proxy for the rate of change of models, in addition to the rate of growth in model parameters. So you saw this in some of the keynotes this morning as well, new model releases happening at least every six months, and often it's much faster than that. As AI system development time is ultimately measured in years, this really presents us with an enormous challenge to not only develop the systems sufficiently fast enough to keep pace with all of this change, but also to manufacture and actually deploy these systems in production. Ultimately hardware simply can't evolve fast enough, and we can't build it fast enough. So the obvious recent example that everyone's been talking about, of course, is this unexpected surge in Gene AI this year.
Finally, the last challenge we'll focus on is the tight vertical integration of AI systems, as this ultimately starts to limit deployment flexibility. So there's clearly a lot of benefits in tightly integrating, but in this context, when we look at tightly coupled memory that is integrated on accelerators, it results in a fixed compute to memory ratio that we can't adjust over the life of the deployment of the accelerators or the AI systems, which might be five years or more at this point. AI models will continue to evolve over this time frame, and it takes tremendous effort to reoptimize for all these model changes, especially without stranding compute, memory, or both. Ultimately, utilization and efficiency will suffer.
So how do we address this challenge? We're proposing that we really need to be focusing AI system design with flexibility as a core tenet of the design. If we implement sufficient flexibility, we can address the increasing disconnect between the memory and compute growth, develop and deploy AI systems much faster, and also improve sustainability. So we haven't talked much about sustainability in this particular presentation, but you've heard a lot about it this morning, and I will point back to the traditional reduce, reuse, recycle, which we've talked about for decades. And I think this approach can also be applied to AI system design. And really, flexibility will help us get there. But what does flexibility actually mean in this context, and how do we get there?
Flexibility ultimately involves separating systems into modular, reusable components, and then interconnecting them to still allow for efficient communication between them. As there are many types of interconnects or fabrics, we're going to classify them into three categories for AI systems. First, the node memory interconnect, which connects AI accelerators to CPUs and node-level memory. Second, the scale-up interconnect, which connects hundreds of accelerators and their memory together. And finally, the cluster-level interconnect, which is used to scale up to thousands of systems in a cluster. For this presentation, we'll focus only on the first two, and we'll explore them in more detail in the coming slides.
HBM, or other unpackaged memory, that's tightly coupled to an accelerator, provides us with terabytes per second of memory bandwidth at a reasonable memory latency, typically on the order of about 300 nanoseconds. The memory capacity, on the other hand, is quite limited. So models often don't fit into the available HBM, and we have to spill over into other tiers. So if we want to decouple some portion of this memory or replace it with something more scalable in addition to expanding it, we need to get reasonably close to these characteristics. Today, with a traditional PCIe Gen 6x16 connection, we're limited to roughly one quarter of a terabyte per second of bandwidth, and the latency is closer to about one microsecond. There's a clear opportunity here for further improvement, as it will unlock the potential for us to access the CPU-attached memory with HBM-like characteristics. While we can work around this to a limited extent today, it involves significantly more software complexity in the form of more memory copies, careful scheduling, hiding latencies, and often caching. We'd like to avoid or at least minimize some of this going forward, as it really slows down adoption, optimization, and flexibility.
To maximize the amount of parallel computation, it's necessary to partition AI models across the accelerators. The second interconnect provides the backend communication between those accelerators for this approach to work. To get the flexibility that we desire, we'll need terabyte per second links with latency in the hundreds of nanoseconds, and load/store semantics, all while trying to keep both the hardware and software implementations as simple as possible. Finally, we need standardization in this space, as the R&D required to keep pace with this constant evolution of requirements is significant, and collaborating as an industry is really the only way that can help us scale innovation here. If we can build this and enable more flexibility, then we can start building future AI systems that can be reconfigured to solve memory-heavy workloads, compute-heavy workloads, or something else in between. I'll now turn it over to Prakash to tell us a little bit more about what needs to be improved for interconnects to actually achieve this vision.
Thank you, Chris. So I'd like to recap a little bit about some of the important things that we've learned. One is that on-package memory is not really meeting the needs as far as the model size growth is concerned. So we have tens of gigabytes of HBM on package, whereas your models are running into hundreds of terabytes, and the outlook is always up and to the right. So basically we need some node-level memory which has higher capacity but retains some of the good aspects of lower latency and high throughput. The second part is that we need a fabric which is able to allow us to compose an AI system, and that fabric basically connects CPUs, GPUs, accelerators, and memory in a collaborative fashion so that we can do shared memory and peer-to-peer transfers amongst the elements that compose that fabric. And finally, this interconnect, which you're calling scale-up CXL, basically needs to be an open technology so that multiple vendors and multiple players can interoperate with each other seamlessly. And that allows end users like us to build large-scale AI systems which can be composed by mixing and matching different vendor types and right-sizing the right number of accelerators and general-purpose compute and memory ratios. So this particular figure kind of shows what Chris talked about in terms of using a node-level interconnect which can provide access to high bandwidth, low latency node memory as an expansion for on-package memory, and then there is a scale-up interconnect which is basically used for accelerators to communicate with each other and to centralize memory resources. And we believe that CXL is the right solution here because it's an open standard, and with CXL 3.0, many of these capabilities that are listed here were supported. For example, at node level, we can do low latency and high bandwidth transfers directly to CXL memory or to CPU-attached memory, and the load/store semantics available are very useful in that regard. And for the scale-up interconnect level, CXL 3.0 allows us to scale to up to 4K nodes and supports peer-to-peer transfers, hardware-based coherence with shared memory, and attaching capacity memory with GFAM devices. So CXL seems like a good fit on the face of it as a fabric for AI.
However, there is one major impediment. So this chart kind of shows the comparative bandwidth of a stack of HBM versus 16 lanes of Ethernet and 16 lanes of PCI Express. Looking at it, we can see that HBM is of the order of a terabyte per second and PCIe is down there at around 256 gigabits per second. So CXL, which relies on PCI Express PHY's in the 2026 time frame, is very far away from what we need in order to have a scalable fabric with CXL. In addition, CXL currently is not an optically enabled standard and that makes it difficult to scale systems with reach across multiple racks.
So here we have kind of listed out some improvements that we can make to CXL to make it an AI-capable fabric. And I've broken this up into what we need at the scale-up level and at the node level. So at the scale-up level, as we said, we need lots of ports at high data rates and so we need to basically push for PHY data rates to increase up to 100 to 200 gigabits per lane so that we can at least get close to what the Ethernet level promises us. And it's not 1 terabyte per second but still close to it. At the link layer, we need to support natively optical interconnects with near and co-packaged optics and that means that not only -- we have to be basically aware of the channel characteristics that optics provides and the standard should have the right FEC, CRC and retry scheme to deal with the kind of errors you see on optical links instead of leaving it to the user. The third part is at the protocol level, we want to have new features like symmetric memory access so that accelerators can actually access other accelerators' memory. So while these are things that we want to add to improve the performance requirements of CXL fabrics to meet the AI needs, there are some simplifications that can be made and those can help reduce the overhead that is needed currently in CXL. For example, the control path or discovery and enumeration which is basically very dynamic today could be very static and kind of predefined and hard-coded by the system architecture. Also things like link training and negotiation, how wide and how fast your link is, being CEM compliant, having like ultimately small bifurcation granularities, these are important for general purpose compute to connect to general purpose devices but may not make much sense for an AI fabric kind of thing, so we can relax some things to get the real features that we need. At the local level, we know that the CPUs will have to support the backward compatibility features. So we would like to increase bandwidth by actually having link aggregation capabilities go to larger than x16, maybe x32 or x64 kind of virtual lanes so that we can get the aggregate bandwidth from that. We also need to increase the data rate here as quickly as possible, but clearly that doesn't need to be as fast as what we have to do on the scale-up kind of fabric.
So finally a call to action for folks that are passionate about getting an open AI interconnect solution, please join the OCP composable memory systems workgroups and the CXL consortium discussions to drive the requirements for the AI specific use cases. I would also highly recommend people in the CXL space to kind of join the effort in developing kernel drivers, memory management tools, and data center telemetry and other tools, fault management, and contribute to those in the open source ecosystem. And finally, it's very important that as we build out these gigawatt-sized clusters that we keep our focus on sustainability and figure out how much stuff is going to waste because we didn't design it so that it could be flexible and modular. And that's it. And hope that we'll have a lot of folks contribute to this effort.
Can you go back? I think it was two slides where you talked about what you'd like to see happen. Yeah. I noticed there was a lack of latency mentioned there. Can you speak to that?
So I think as far as latency is concerned, we do want it to be low, but I think there is a compromise here possible where, for example, let's say we want to have a channel that is higher error rate and we want to support it via a more powerful FEC. We can trade off latency with that kind of data rate because generally the machine learning accelerators can tolerate latency more than general purpose computing.
Yes. Very good presentation. I was just wondering, like, you know, you haven't talked about GPU, but if you just distill it out, GPUs in an AI environment are pretty much used like an accelerator, accelerating vector operations and matrix operations. So I was wondering, like, do you think the relevancy of a GPU, because it's programmable, will be lessened as because your whole presentation was all focused on AI accelerators? So either you're using GPU and AI accelerator interchangeably. So give us some sense where do you see the world is two years from now? Thank you.
Sure. So, yeah, we're using AI accelerators as a generic term, right? You can say GPU, XPU, TPU, accelerator, whatever, right? The point is you have some special purpose compute engines out there that run AI functions of some kind, right? How you implement it is up to you.
Hi. This is Nilesh from Zero Point Technologies. I had a question. I've seen the publications or papers from Meta and also Google talking about compression, decompression being deployed on the storage infrastructure, and it's taking up, you know, 3 to 5% of cycles. When we talk about scalable memory systems, do you see compression and decompression being deployed on the storage infrastructure?
That's a pretty broad question, so I'm going to answer it in the context of AI systems. So in general, a lot of the AI data isn't particularly compressible because it's already been, you know, trimmed down as much as possible. So I don't think in the context of AI systems there's a whole lot of opportunity. I think maybe in general purpose systems that's a different situation, right? It's really going to be data dependent more than anything else.