-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path222
24 lines (12 loc) · 13.3 KB
/
222
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
So I'll talk about the angle of memory. So we've heard in the previous talks from Meta, and now from the cooling challenges that are presented in the data center. We've heard from Meta that these new models coming out, they need more and more weights and parameters. Which means they need more memory to run. And one of the challenges we are seeing with memory is that memory is not scaling, as we've seen from TSMC and Intel and the other foundries. As the process node is shrinking, the logic is scaling well, but not the memory. So if you look at this chart out here, for example, this shows that the SRAM, which goes on the silicon as the first level of memory, has pretty much stopped scaling. So that means if you want to support larger models and support more compute, you're going to have to do something different. Because the memory is not keeping up with the logic capacity.
Then if you look at some of the accelerators that are out there, the AI accelerators, of course, the most popular one right now, you can say, is maybe the GPU for AI use cases. And that's already using chiplets in the form of HBM memory. So you need the high bandwidth. You need the high capacity. And you need to be close to your compute units. So that's a little bit of a difference from when you look at CPU-attached architectures, where you have multiple levels of memory hierarchies, maybe like an L1 cache, L2 cache, and then so on and so forth. So that's slightly different. And then so that's the first image out there in the middle. But then even if you look at some of these custom accelerators that are coming out in the market, they're taking a totally different architecture approach when it comes to memory. And for example, if you look at what Groq has done, for example, it's an all SRAM architecture. They've kind of skipped the memory hierarchy, and then essentially pooling a bunch of SRAM together. And then with the GPUs, that's a similar thing happening with some of the proprietary interconnect links where the GPUs are sharing HBM memory amongst themselves.
So we are seeing some of these things emerge. So you need more memory. You need more capacity. You need the low latency. So what's the other alternative to just putting more and more memory onto the silicon die? So a sustainable approach, we think you can take one example, is compression, memory compression. So this is where what we've seen is a lot of the data that is out there in the data centers. There's a lot of redundancy out there. So you could actually compress some of this data. And of course, we are familiar with storage compression files and pages and blocks. So those can operate in a slightly higher latency domain. So maybe we are talking microseconds, or even seconds. If you're talking about huge files, you can wait for a few seconds to compress. But when you talk about memory, then you need to do it extremely quickly. And by quick, I mean nanoseconds, a couple of nanoseconds. So one of the things that at least we are working on is compressed SRAM technology. So out here, you can take data in 64-byte cache line granularities and then compress them in a ratio of 2 to 4x. So that means you can pack a lot more effective memory in the same physical capacity, along the lines of sustainability and being able to scale these AI models and the workloads that are emerging.
Then now, if you look at these designs, if you look at a classic cache design, what you might find are maybe three major components. There's a cache controller. Maybe you have a data array and a tag array. So the key requirements are if you are going to compress this cache in real time, then you need to do it. You do need to deliver a high compression ratio. So it has to be compelling enough. But it has to be low latency. And at the same time, you can't take up a lot of area. If you're going to add something to this space in order to save the space, it has to be relatively small. And of course, it has to be transparent to the user because you don't want to introduce new layers of software for the user because users and data centers have already invested money and resources in the existing programming models.
So our solution out there is coming up with a cache compression IP block, which can easily be integrated into any SOC or even in a chiplet format. We're talking about chiplet technologies. So it could be instantiated on a chiplet or integrated as a separate chiplet unit. And the way we do this is with minimal modification of the tag array. So if you look on the upper left side, the tag array, we modify the tag array, but then leave the data array as it is. And by doing this and then adding our compression and decompression accelerators in the controller, we can achieve this 2 to 4x compression capacity at the extremely low latencies. We're talking single digit clock cycles. Which is something that you can tolerate in an L3 or system level cache environment, which could go into a chiplet format or an SOC level. And one of the key requirements for this kind of an IP block would be that it needs to be portable across different process nodes. Because different companies may be using-- maybe some are using the latest nodes like the 5 nanometer or 3 nanometer. Others may be on other nodes. But it has to be portable and it has to be compact. So that's the solution we've come up with.
And in this case, the specific results that we are seeing with cache line compression, we are seeing ratios of anywhere from 2 to 4x compression across a variety of data center workloads. So if you look at some of the benchmarks out there, the spec int, the spec floating point, and then Renaissance. And some of the-- of course, we're talking about AI, so we have to cover some of the machine learning performance benchmarks as well. So we are seeing extremely efficient compression ratios. And this too at extremely low number of clock cycles. So this can happen in single digit clock cycles. And the area is also pretty efficient. So if you look at 0.1 millimeter square, that's what we've measured in at least in the 5 nanometer TSMC processes. And these do operate at the line speed of the cache, which is the speed at which the processor expects it to operate.
So that's the first opportunity, how we can get to more memory in the same area. Now, there's another approach. So recently, we've seen the resurgence of emerging memories coming back. So MRAM is one such technology, a magnetoresistive RAM. So one of the companies that we are collaborating with called NuMem, they were actually featured on a Meta's-- the extended reality chip. It was an 80 MRAM type of implementation. So this is a block of-- if you look at the yellow block out there, it's a 4MB MRAM. And if you compare it to the SRAM tile, which sits right next to it, so it's about 2 and 1/2 x denser than the SRAM block. And it's providing the same amount of capacity, 4MB. And it's doing this in a non-volatile fashion. So we were talking about some of the cooling challenges, the power challenges, how do you have stranded capacity, et cetera, thermal. So one of the things with this kind of emerging technology is it is a non-volatile technology, so you don't have to keep refreshing it. So that provides the sustainability benefit. And so that's another way that you can actually achieve higher density. Because if you look at a typical SRAM cell, it uses six transistor cell design, whereas the MRAM is much more compact. So you can achieve area efficiency and power efficiency.
And the next thing now you can imagine-- so we are talking about future out here. What are the future technologies that you could achieve? So to achieve the economies of scale, what you could do is pair up compression along with these emerging technologies like MRAM. So you get a 2 to 4x compression ratio benefit because of the compression IP. And then if you apply it actually to technologies that employ these emerging media like MRAM, so now you have this combined multiplier effect of a net gain. And of course, at the OCP, we talk about the ODSA, the open domain specific acceleration, and chiplets, and how to make chiplets more attractive and really get them going into the market. So one of the things that need to happen is you need to make these chiplets cost effective. And if you are going to instantiate memories in a chiplet format, you do want to amortize the cost over, let's say, a larger capacity. So instead of coming out with, let's say, 4 MB of memory, if through compression and some of these alternative media, if that 4 MB effectively becomes 8 MB or 16 MB, now the total cost of my chiplet is actually amortized over a much larger capacity, which makes the economics more attractive.
And then moving on, so the way that you could innovate also with this talking about chiplets is, of course, we are seeing there are several different alternative interconnects. So there's HBM already in the market. UCIE is coming up as a spec. Now from the OCP, we have the bunch of wires protocol to enable this interconnect technology. So one of the things we can also do in the future is actually to mix this media. We have advanced packaging technology now available. So you can actually stack different media together. For example, DRAM stacked with MRAM. So these are the kinds of things that we want to explore with the community, talking about what could be the future technology possibilities to actually deliver on sustainable computing.
And then again, going back to some of the GPU use cases. So at GTC, at NVIDIA's keynote, when Jensen gave the keynote, one of the features that was pointed out in the keynote was actually a dedicated decompression engine on the Blackwell GPU. So obviously, this only goes back to highlight the importance of memory, and the scarcity of memory, and the need for added memory in these kinds of use cases to facilitate the transfer of data across different GPU memories. So that's why there is actually a decompression engine featured on the Blackwell GPU to actually enable these use cases. Now while this is great, this only benefits one company, because this is actually on the GPU SoC itself on the Blackwell GPU. So thinking about future possibilities, one of the things that you could do is actually put compression and decompression engines inside the HBM memory itself. So now again, we are talking about chiplets. And so HBM, or some of these new things, new media memory technologies we are talking about, they could be instantiated in a chiplet format. And then some of these decompression, compression engines could actually be placed inside of the chiplets themselves. And thereby kind of democratizing access to more people than just one company that could benefit from these use cases.
And then finally, when we talk about chiplets, we're talking about single components. And certainly there's great momentum to standardize some of the interconnect technologies, which is great. But we need to get beyond point to point interconnects for chiplets. So again, at the OCP, several of us are participants in the composable memory workgroup, et cetera, which applies at the system level architecture. But when you think about chiplets and what's going to be required in the future, it's not just going to be point to point memory connections that are going to be needed. You're going to actually need composable chiplet style memories. And in order to do that, you have to go beyond the physical interconnect. So for example, bunch of wires or UCI, the standard protocols, they help with the physical link. But then when you go on chip, you have other protocols running, for example, AXI or CHI. So now you have to think about how do you connect two completely different pieces of chiplets together or a chiplet with an SOC together. And how do you make it work in harmony? And then to extend this paradigm, if you are going to interconnect multiple chiplets together, how do they play well together? How do they mesh well together? So that's where ARM has a coherent mesh network. So some of these technologies need to be paired together where you can attach multiple processors on chip. You can go off chip in a chiplet scenario and yet maintain the coherence with the memory and add in accelerators like compression, or there might be other accelerators that people want to put on.
So with that, the key takeaway is we do have this kind of compression capability. We do have this emerging media coming out. There's many possibilities in the future combining some of these technologies and then extending chiplets to just from point-to-point interconnects to more of a mesh. So my call to action out here is we want to work together with the OCP community to figure out what are the most important use cases that we should be focusing on, and then collaborate on test chip type implementations, and maybe even talk about some of the TCO metrics. What is the total cost of ownership at which these kinds of solutions become interesting or attractive? And then finally, of course, with the chiplets, how do we go beyond point-to-point to more like a scalable mesh network, more of a composable SoC system architecture for memories, not just the individual components, but actually scaling out the memories themselves? So with that, I'd like to open it up for any questions.