357


All right, good afternoon, everyone. We are going to speak about integrated CMS memory solutions with AI and caching services today. I'll be speaking a little bit on the FBGEMM benchmark that Meta has, and Klas is going to continue with the caching services.

So we all have seen this graph, the AI memory wall. This is a famous paper that's out there for a while now, and I'm not going to go deeper into it. So, that's interesting. All right. So, basically, the takeaway from this particular slide is: AI memory wall is rather than a problem, it's a challenge for innovation as well as new technologies to come up to resolve the memory challenges that we have today in terms of capacity, as well as capacity expansion, as well as bandwidth. And today, we see the gap that the paper exclusively mentions with respect to the number of parameters that are growing, demanding higher memory capacity, as well as high bandwidth use cases in AI solutions. It may be Gen AI as well as traditional AI. The right side to your right, you can see. So, typically, in other words, what we can say is, as and when there is capacity increase, or accelerated capacity increase, we can say data scientists have delivered, you know, like bigger models, and they are able to do more work. So, we don't want the hardware to be the limit. So, we need to basically see, what are the challenges, how can we innovate in these spaces? Let's pause here for a bit and get back to this memory challenge we have. We're basically here calling out on leveraging the AI use cases that we have, and AI benchmarks that we have, and how we can validate our hardware and products and come up with general metrics that we can show like, as an industry-wide, we can address this memory challenge as well.

So, we'll dive into FBGEMM, which is Facebook General Multiplication, and this is a high-performance kernel library that we have optimized for server-side inference. Why does it matter? So, GEMM operations are basically fully connected operations that are the biggest consumers of floating-point operations in AI applications. So, any application can be translated into the GEMM operation, and these are basically fully connected operator operators.  And in specifically to this, the Split Table Embedding benchmark is basically that represents like the batched embedding workloads where we are splitting, we are basically representing different use cases such as recommender systems as well as natural language processing. These are the basic building blocks of these use cases, and the Split Table benchmark represents these, you know, fully connected operators. So, as an example, I'm not sure that's visible, but we are basically there are different kinds of embedding tables that we have, and depending on the input data here, we are taking example of a word embedding, but this can be an image or a video as well, and depending on the input data type, the you know, like the table size and all those parameters vary as well. In this example, what we have here is a phrase where we are converting that to a set of indices, and the indices is then basically mapped into different split tables, uh, depending on the features set characteristics and the vector dimensions that we have which basically defines the accuracy uh of the uh data type. So, there are different stages in terms of uh uh uh in terms of this Benchmark. The first stage is initialization, and in the initialization phase, we basically uh take the complete batch set of indices, and we provide that and also initialize the output buffer that we intend to find. And the next step is basically batch processing, where we are having split-table embeddings. We basically map initially from index to the input batch to the indices and indices to the table offsets where they need to go. And then retrieve those. And that stage is called gathering these embeddings, basically. So, to the right, the image what you see is in the gathering phase, we are retrieving the embeddings that we want, and we are initializing and putting them into an output buffer.

So we can go into detail in terms of the embeddings in the batch processing, how we initialize. So, in the initialization phase, we are taking an example here where we have input indices, embedding tables, and a buffer. The input indices is basically the number of indices and the size of the data type. In this example, we are taking 4K indices and 4 bytes. So, you can see that constitutes 16 kilobytes. One more aspect is, in each of the stages, the memory is being consumed. And the capacity of the memory plays an important role.  Considering we have a million tables here, assuming that the vector dimension of 4K, it consumes around 16 gigabytes. So, you can imagine as we scale with higher indices, as well as the model sizes get larger, these keep on growing. That is where we have the actual challenge today. So, as part of the second step, we retrieve these embeddings. So, you can see, initially, there are multiple sub-stages here, but we have consolidated into a single table. Underneath this, basically, each index is mapped to a table ID, split table. And then we calculate the table offset so that we can retrieve these embeddings. You can see there are, depending on the vector dimension, the row size, the table row size also increases. As a final step, we gather these embeddings into an output buffer. And so, this is like a single operation that we have shown here with a single embedding, single batch. But in reality, in parallel, multiple batches happen. And this basically increases the efficiency of our workload as well.

Once we are done with this, the measurements are very crucial here. For specifically the FBGEMM benchmark, the critical parameters, or the metrics that we want to measure, are the embedding retrieval time, mentioned at the end, and latency. So, throughput here basically shows us the batch efficiency, batch processing efficiency, how best we are able to do. Latency is the per-embedding retrieval time that we can execute. And the embedding retrieval time is, basically, for the overall batch, how much time does it take to retrieve the complete set of output buffer?

There are many key challenges here in terms of split-table management, memory access patterns. But today, let's focus on the scalability aspects as we are speaking about memory capacity. We already heard about in the panel the challenges with respect to memory capacity, as well as the memory bandwidth, the AI workloads are demanding. And so the main takeaway from this is: memory capacity is today limiting the batch size, as well as embedding size, how much it can grow. There are, like, we are using unified memory and different techniques to optimize it, optimize the size of the parameters that we have, as well. However, at some point, all these metrics, like all these techniques, saturate. That is where the hardware needs to provide more capacity and bandwidth.

So how can we address these things? There are multiple solutions that are coming out in the industry. This is nothing like new or specific, specifically. Yeah, specific to what we are showing here. This is a traditional system architecture that we have. Here we have main memory and the GPU HBM memory, where it stores embedding tables, indices, as well as the output buffer today across the memory as a unified memory or however we basically operate it. So industry is also exploring with respect to memory expansion native to CPU. We have been looking into this as well. And the second-year memory, which can store some cold embedding tables so that we can access them fast and efficiently. I've put a star on PCIe because, as discussed, PCIe has limitations in terms of how it can scale. So that interconnect can be anything in reality. It may not. We restrict it to PCIe. Similarly, if we have an architecture where the memory is shared, like there are new architectures coming up with the accelerator vendors as well as the CPU vendors where the CPU and accelerator can share the memory. And the last two here are crucial because we need a scalable, efficient, and long-term solution for the memory.. Where we can have a second-year memory for the accelerator irrespective of the accelerator technology that can scale, that can be more cost-efficient as well as power-efficient. This is something on the long term that works out as a whole. And these technologies need to be evaluated before it comes to the market or before the hyperscalers or other companies, so that they can start using them as well. So that is where we are calling out to leverage the open source benchmarks that we have. As part of CMS group, we are also going to publish certain white papers on recipes on how to leverage these benchmarks, how to execute for specific memory evaluations and validation, starting with FBGEMM. And we will be scaling more with other benchmarks as well. So this is basically calling in this. Asking, seeking help from the industry to leverage these benchmarks and provide relevant metrics that people can consume and it can be time to market can be much faster. With that, I'll hand over to Klas, who will continue on caching services. Thank you.

Thank you so much. So, my name is Klas Moreau. I am the CEO of Zero Point Technologies, and I will continue along the lines of looking into relevant benchmarks and data loads.

And we will have a particular look at CacheLib. CacheLib is a meta workload where you have several different types of caches all over the system. And this is extremely interesting to have a look at, to both from a memory point of view directly connected, but also how can you expand the memory further away from there.

So the challenge is validating the theoretical gains via experimentation, and then we have a look at the particular cache benchmark that we will be using today.

So, the opportunity is the AI memory wall. And we see here that the cold tier application opens the door for improved TCO. We know that the hyperscalers have a massive requirement on increased memory capacity. And here we have an opportunity with the cold data being highly compressible and cold data bandwidth being a lot smaller than the system bandwidth.

And going forward from there, we also recognize that the hyperscale is already today spending significant dollars on software-based compressions. So spending between three to five percent of CPU cycles for compression/decompression is, of course, something that they would like to recover into something that is more useful. And as a response to that, there is a paper presented last year at OCP where there has been a call out for a CXL-connected memory expansion. But not only connecting more memory, but connecting more memory, more efficiently. And the ask is to add three DIMMs per channel, inline memory compression, in order to expand the capacity and the possibility to cache these pages that are read from memory.

I mentioned shortly about the cold data being highly compressible and how the also cold data is asking for very little extra system bandwidth.

So, the opportunity that we see is to divide the compressed DRAM, or divide the DRAM tier, into the directly connected DRAM and add another tier beyond that that is compressed, and that could open up a significant opportunity.

So, this OCP paper that I mentioned was highly specified with respect to latency requirements, particular compression algorithms, and the compression algorithm asked for was LC4. Particularly, now that is a good algorithm; it’s well known, it’s well tested, and it’s also compatible with software alternatives. So, beyond that, there is also an opportunity to add potentially more latency-efficient algorithms. And that is what I’m going to talk a little bit more about here.

So this is what it would look like: you could have a compressed tier combined with an uncompressed tier, and this would all be combined in the CXL memory controller.

So, what we have done from ZeroPoint is to read this specification and make a product out of it. We have an OCP-compatible hardware-accelerated CXL memory solution that can do compression and decompression, compaction, and transparent memory management in the very short latency domain.

Now, these are, of course, just numbers, but looking into what *could* the potential opportunity be using CXL-connected memory in combination with compression? Well, we know today that about 30% of a server *is* memory, and some people even say that it's up to 50%. Now, what if you could do *with* half that memory, or double that capacity? That *could* be a substantial cost-saving, energy-saving, and performance improvement.

And just to give you some numbers, we see today that we can compress on par with LC4, but in the cache line granularity rather than the 4-kilobyte granularity.

So the setup that we have is that, this show at OCP, we have had a demonstration of our compression technology on an FPGA running workloads. And in this particular case, it has been CacheBench. I am actually missing that slide. It would have been the last one, but I don't think that's been updated. So if you had the opportunity to stop by ZeroPoint, you would have seen the demonstration.  Now, the show has stopped. The next opportunity you will have is at Super Compute. And then we will not be running only the FPGA, but rather a server end-to-end running the application on the host and running the expansion board with hardware-accelerated compression with the Danceman product. And the numbers that we see are an expansion factor of two to four times with a fantastic low latency number.  So I thank you for that. And I encourage you to either reach out to me for a demo or wait until Super Compute. And now you have a 15-minute break. Right, or? Yes.  We can. Yes. So are there any questions for me, Nara?

Yeah, that work is just beginning. Or do you have results also? Yeah, we do have results on different platforms. We will like, along with the white paper, we plan to release those normalized benchmark results as well.

Okay, and that class, for your part, a quick question: Do you need DCD in order to expose the variable capacity that results from compression?

No, we don't need DCD in the first generation, so that's something we're targeting later.