222


So I'll talk about the angle of memory. So we've heard in the previous talks from Meta,  and now from the cooling challenges that  are presented in the data center. We've heard from Meta that these new models coming out,  they need more and more weights and parameters. Which means  they need more memory to run. And one of the challenges we are seeing with memory  is that memory is not scaling, as we've  seen from TSMC and Intel and the other foundries. As the process node is shrinking,  the logic is scaling well, but not the memory. So if you look at this chart out here, for example,  this shows that the SRAM, which goes on the silicon  as the first level of memory, has pretty much stopped  scaling. So that means if you want to support larger models  and support more compute, you're going  to have to do something different. Because the memory is not keeping up  with the logic capacity.

Then if you look at some of the accelerators that are out  there, the AI accelerators, of course, the most popular one  right now, you can say, is maybe the GPU for AI use cases. And that's already using chiplets  in the form of HBM memory. So you need the high bandwidth. You need the high capacity. And you need to be close to your compute units. So that's a little bit of a difference  from when you look at CPU-attached architectures,  where you have multiple levels of memory hierarchies,  maybe like an L1 cache, L2 cache,  and then so on and so forth. So that's slightly different. And then so that's the first image out there in the middle. But then even if you look at some of these custom  accelerators that are coming out in the market,  they're taking a totally different architecture  approach when it comes to memory. And for example, if you look at what Groq has done,  for example, it's an all SRAM architecture. They've kind of skipped the memory hierarchy,  and then essentially pooling a bunch of SRAM together. And then with the GPUs, that's a similar thing  happening with some of the proprietary interconnect links  where the GPUs are sharing HBM memory amongst themselves. 

So we are seeing some of these things emerge. So you need more memory. You need more capacity. You need the low latency. So what's the other alternative to just putting more and more  memory onto the silicon die? So a sustainable approach, we think you can take one example,  is compression, memory compression. So this is where what we've seen is a lot of the data that  is out there in the data centers.  There's a lot of redundancy out there. So you could actually compress some of this data. And of course, we are familiar with storage compression files  and pages and blocks. So those can operate in a slightly higher latency domain. So maybe we are talking microseconds, or even seconds. If you're talking about huge files,  you can wait for a few seconds to compress. But when you talk about memory, then you  need to do it extremely quickly. And by quick, I mean nanoseconds,  a couple of nanoseconds. So one of the things that at least we are working on  is compressed SRAM technology. So out here, you can take data in 64-byte cache line  granularities and then compress them in a ratio of 2 to 4x. So that means you can pack a lot more effective memory  in the same physical capacity, along the lines  of sustainability and being able to scale  these AI models and the workloads that are emerging.

Then now, if you look at these designs,  if you look at a classic cache design, what you might find  are maybe three major components. There's a cache controller. Maybe you have a data array and a tag array. So the key requirements are if you  are going to compress this cache in real time,  then you need to do it. You do need to deliver a high compression ratio. So it has to be compelling enough. But it has to be low latency. And at the same time, you can't take up a lot of area. If you're going to add something to this space in order  to save the space, it has to be relatively small. And of course, it has to be transparent to the user  because you don't want to introduce  new layers of software for the user  because users and data centers have already  invested money and resources in the existing programming  models.

So our solution out there is coming up  with a cache compression IP block, which can easily  be integrated into any SOC or even in a chiplet format. We're talking about chiplet technologies. So it could be instantiated on a chiplet  or integrated as a separate chiplet unit. And the way we do this is with minimal modification  of the tag array. So if you look on the upper left side, the tag array,  we modify the tag array, but then leave the data array  as it is. And by doing this and then adding our compression  and decompression accelerators in the controller,  we can achieve this 2 to 4x compression capacity  at the extremely low latencies. We're talking single digit clock cycles. Which  is something that you can tolerate in an L3 or system  level cache environment, which could go into a chiplet  format or an SOC level. And one of the key requirements for this kind of an IP block  would be that it needs to be portable  across different process nodes. Because different companies  may be using--  maybe some are using the latest nodes like the 5 nanometer  or 3 nanometer. Others may be on other nodes. But it has to be portable and it has to be compact. So that's the solution we've come up with.

And in this case, the specific results  that we are seeing with cache line compression,  we are seeing ratios of anywhere from 2 to 4x compression  across a variety of data center workloads. So if you look at some of the benchmarks out there, the spec  int, the spec floating point, and then Renaissance. And some of the--  of course, we're talking about AI,  so we have to cover some of the machine learning performance  benchmarks as well. So we are seeing extremely efficient compression ratios. And this too at extremely low number of clock cycles. So this can happen in single digit clock cycles. And the area is also pretty efficient. So if you look at 0.1 millimeter square,  that's what we've measured in at least in the 5 nanometer TSMC  processes. And these do operate at the line speed  of the cache, which is the speed at which the processor expects  it to operate.

So that's the first opportunity, how  we can get to more memory in the same area. Now, there's another approach. So recently, we've seen the resurgence  of emerging memories coming back. So MRAM is one such technology, a magnetoresistive RAM. So one of the companies that we are collaborating with  called NuMem, they were actually featured on a Meta's--  the extended reality chip. It was an 80 MRAM type of implementation. So this is a block of-- if you look at the yellow block  out there, it's a 4MB MRAM. And if you compare it to the SRAM tile, which  sits right next to it, so it's about 2 and 1/2 x denser  than the SRAM block. And it's providing the same amount of capacity, 4MB. And it's doing this in a non-volatile fashion. So we were talking about some of the cooling challenges,  the power challenges, how do you have stranded capacity,  et cetera, thermal. So one of the things with this kind of emerging technology  is it is a non-volatile technology,  so you don't have to keep refreshing it. So that provides the sustainability benefit. And so that's another way that you can actually  achieve higher density. Because if you look at a typical SRAM cell,  it uses six transistor cell design,  whereas the MRAM is much more compact. So you can achieve area efficiency and power  efficiency.

And the next thing now you can imagine--  so we are talking about future out here. What are the future technologies that you could achieve? So to achieve the economies of scale, what you could do  is pair up compression along with these emerging  technologies like MRAM. So you get a 2 to 4x compression ratio benefit  because of the compression IP. And then if you apply it actually  to technologies that employ these emerging  media like MRAM, so now you have this combined multiplier  effect of a net gain. And of course, at the OCP, we talk  about the ODSA, the open domain specific acceleration,  and chiplets, and how to make chiplets more attractive  and really get them going into the market. So one of the things that need to happen  is you need to make these chiplets cost effective. And if you are going to instantiate memories  in a chiplet format, you do want to amortize the cost over,  let's say, a larger capacity. So instead of coming out with, let's say, 4 MB of memory,  if through compression and some of these alternative media,  if that 4 MB effectively becomes 8 MB or 16 MB,  now the total cost of my chiplet is actually  amortized over a much larger capacity, which  makes the economics more attractive.

And then moving on, so the way that you could innovate also  with this talking about chiplets is, of course,  we are seeing there are several different alternative  interconnects. So there's HBM already in the market. UCIE is coming up as a spec. Now from the OCP, we have the bunch of wires protocol  to enable this interconnect technology. So one of the things we can also do in the future  is actually to mix this media. We have advanced packaging technology now available. So you can actually stack different media together. For example, DRAM stacked with MRAM. So these are the kinds of things that we  want to explore with the community,  talking about what could be the future technology  possibilities to actually deliver  on sustainable computing.

And then again, going back to some of the GPU use cases. So at GTC, at NVIDIA's keynote, when Jensen gave the keynote,  one of the features that was pointed out in the keynote  was actually a dedicated decompression engine  on the Blackwell GPU. So obviously, this only goes back  to highlight the importance of memory,  and the scarcity of memory, and the need for added memory  in these kinds of use cases to facilitate  the transfer of data across different GPU memories. So that's why there is actually a decompression engine  featured on the Blackwell GPU to actually enable  these use cases. Now while this is great, this only  benefits one company, because this is actually  on the GPU SoC itself on the Blackwell GPU. So thinking about future possibilities,  one of the things that you could do  is actually put compression and decompression engines  inside the HBM memory itself. So now again, we are talking about chiplets. And so HBM, or some of these new things,  new media memory technologies we are talking about,  they could be instantiated in a chiplet format. And then some of these decompression,  compression engines could actually be placed inside  of the chiplets themselves. And thereby kind of democratizing access  to more people than just one company that could  benefit from these use cases.

And then finally, when we talk about chiplets,  we're talking about single components. And certainly there's great momentum  to standardize some of the interconnect technologies,  which is great. But we need to get beyond point to point interconnects  for chiplets. So again, at the OCP, several of us  are participants in the composable memory workgroup,  et cetera, which applies at the system level architecture. But when you think about chiplets  and what's going to be required in the future,  it's not just going to be point to point memory connections  that are going to be needed. You're going to actually need composable chiplet style  memories. And in order to do that, you have  to go beyond the physical interconnect. So for example, bunch of wires or UCI,  the standard protocols, they help with the physical link. But then when you go on chip, you  have other protocols running, for example, AXI or CHI. So now you have to think about how  do you connect two completely different pieces of chiplets  together or a chiplet with an SOC together. And how do you make it work in harmony? And then to extend this paradigm,  if you are going to interconnect multiple chiplets together,  how do they play well together? How do they mesh well together? So that's where ARM has a coherent mesh network. So some of these technologies need  to be paired together where you can attach  multiple processors on chip. You can go off chip in a chiplet scenario  and yet maintain the coherence with the memory  and add in accelerators like compression,  or there might be other accelerators  that people want to put on.

So with that, the key takeaway is  we do have this kind of compression capability. We do have this emerging media coming out. There's many possibilities in the future combining  some of these technologies and then extending chiplets  to just from point-to-point interconnects  to more of a mesh. So my call to action out here is we  want to work together with the OCP community  to figure out what are the most important use cases that we  should be focusing on, and then collaborate on test chip type  implementations, and maybe even talk about some  of the TCO metrics. What is the total cost of ownership  at which these kinds of solutions  become interesting or attractive? And then finally, of course, with the chiplets,  how do we go beyond point-to-point  to more like a scalable mesh network,  more of a composable SoC system architecture for memories,  not just the individual components,  but actually scaling out the memories themselves? So with that, I'd like to open it up for any questions.