217


All right, folks. Yeah, thanks for staying back for this session. I'll be talking a little bit about how we at ZeroPoint,  we are helping to streamline CXL adoption  in the hyperscale settings. And so just a quick introduction,  I do business development for ZeroPoint Technologies. We are an IP licensing company based out of Sweden.

So first of all, let's talk about what  is the challenge to it before I jump into the solution. So we know from published papers by actually  all the hyperscalers, Google, Meta,  that compression is taking up a significant amount of CPU  cycles in the data center today. So almost up to 5% of all CPU cycles  are being used for software compression,  decompression in the hyperscale infrastructure. So this is monetizable compute cycles,  which are being used for this kind  of administrative workload. And so that is one challenge. And then to address that challenge recently,  and by recently I mean in September,  or actually I think it was October, end of October,  when the OCP published the hyperscale CXL tiered memory  expander spec. And if you read through the spec, what is stated out here  is the need for hardware accelerated memory  compression, lossless memory compression. And it's really to address the challenge  that I mentioned, which is it's taking up precious CPU cycles. 

So this, obviously, whenever there is a challenge,  then that comes with an opportunity. And so the opportunity out here now  is for the industry to deploy not just a CXL memory tier. So you have your DRAM tier, and then you have the CXL tier. But now there is an opportunity to deliver  a third tier, a compressed memory tier, which  exists within the CXL device itself. And if you go through the OCP spec,  it explains the rationale behind this. The benefit is reduction in total cost of ownership,  because imagine if you have a one terabyte device,  and through compression, if that is now exposed  as a four terabyte device. So your total cost of ownership of deploying CXL  composable memory drops down. And then, of course, in terms of sustainability and efficiency,  this is definitely playing into that angle,  because it's not sustainable to simply keep increasing  the number of memory modules. I mean, if you walk around this OCP summit,  you're going to see a lot of fantastic cooling solutions. And certainly, we've heard a lot of talks  about how this challenge is being addressed. So we are addressing this in a slightly different manner,  which is in the same physical capacity,  we are packing in much more. And so out here, we point to this spec  as kind of a game changer. Because this  is providing guidance to the CXL module developers  how they should actually build the hardware. So if you look at the arrow, it's  pointing to the CXL controller device. And that's really where the spec is  targeted towards, those devices.

And then if you kind of do a double click into the spec,  and you read through the spec and what's stated out there,  it's pretty stringent requirements. We already know that CXL is very sensitive to latency. If you look at all the talks about CXL and composable memory  and NUMA configurations for memory,  so latency is definitely a very sensitive parameter. So you don't want to get in the way of latency. The second thing is we all know compression algorithms, right? Compression algorithms have been around for years. So well-known industry standards like LZ4 and Z standard  have been around for storage compression. Now, if you look at this bright yellow chart  on the right-hand side, so those algorithms  work wonderfully for storage workloads. But they come with a penalty, because storage, you  can operate at block sizes or page sizes. So you have microseconds in which you need  to complete these operations. But now we are talking about memory. We are talking about CXL memory. You have a few nanoseconds. If you read through the OCP spec,  you have at most 90 to 150 nanoseconds  to complete whatever compression, decompression  tasks that you're going to do. So in these scenarios, of course,  you can use LZ4 and Z standard. But we've developed a proprietary technology  which operates at the cache line granularity, so 64 bytes. So we're able to compress data at 64-byte cache line  granularity. And why is this important compared  to some of these other algorithms  is because it's mostly during the read cycle. When I'm reading data, I don't want  to pull an entire page out of my DRAM memory  and then find the cache line and then decompress it. With our algorithm, you can just go  fetch that single cache line that you care about,  decompress it, and pass it on. And so we can keep the latency extremely low.  Like well below these levels that  are specified in the spec. So of course, if you read the spec,  there's requirement to support legacy algorithms like LZ4. But then there's also room for innovation. And so that's where in our IP block,  we actually support both algorithms. Because composability, this is a new concept. And you heard Suresh talk about some of the emerging use cases  with the OAM form factors. And there could be potential need  for memory attached to these accelerator units as well. So there are many future use cases  which are yet to be tested. So we want to leave the door open for different use cases. In some cases, you might need standard algorithms  for compression ratios. But in other scenarios, you might  need the proprietary algorithms for the best latency  performance.

So our solution, like I said, ZeroPoint  is an IP licensing company. So we have this IP solution, which plugs in. It's a plug-and-play solution which plugs  into pretty much any ASIC device. And so we already announced our first integration  with the CXL customer. But if you look at the purple block out there,  we have this IP block which just slots right in. It's an AXI interface. So it's a pretty standard system bus interface. It just drops into your ASIC. And then transparently, we compress and decompress data  as it's coming in from the host over the CXL.mem interface. And just before it hits the memory,  we conduct these operations. And the nice thing about it--  by the way, we have a demo on the show floor later,  if you haven't already seen it. And then Evangelos and Dimitrios, they're out here. They can show you the demo of this. It looks much nicer in real life. But what we do is we dynamically adjust the memory tiers. So on the fly, as the data is coming in,  we compress, decompress on the fly. And the capacity increases and decreases dynamically  without the user having to babysit or manage  this kind of third compressed tier. 

And again, like I was mentioning before, because what we deliver  is a soft IP, we want to democratize access  to this kind of a solution. In the spirit of the Open Compute Project,  we want to make this IP available to pretty much work  across any process node to any ASIC vendor. And so that's why this block will essentially  work under any process node if you're  developing at 7 nanometer or 5 or whatever, maybe 3 nanometer. So this is portable. It can work. And the other thing is it does provide a transparent solution. That means you don't need to change your software stack  in order to take advantage of this. That was another motivation part of the OCP spec  was to preserve the software investments that companies  have already made and not make everyone  change the software stack. So we do offer that transparency. And if you look at some of the size of our IP,  it's pretty minuscule, the basic IP blocks. So it plugs in pretty compactly and helps  deliver this 2 to 4x compression in a very compact area.

The other thing which is of interest  from an end user perspective, what they care about is, OK,  so this is great. You do cache line compression. It's transparent. It's pluggable easily. But what's in it for me as an end user? So yes, I get this capacity. What this effectively does is help lower  your total cost of ownership. So when you look at these four bars,  so let's say what we've done is we've characterized  the total cost of ownership. And we have some detailed models, by the way. Feel free to reach out to me. I'm happy to share these models. And by the way, we are also introducing  some of these models at the OCP Composable Memory Solutions  sub-workgroups, the workload workgroups. So if folks have different thoughts on TCO models,  we'd love to hear them. But with our initial model, there's  benefit by moving to CXL. Because you get rid  of this stranded memory that we've heard  in several talks out here. But then with compression, you can further improve  the total cost of ownership. And then when you combine CXL with compression,  that actually dramatically reduces your total cost  by up to 25%. And this is both from a CAPEX perspective  as well as from an OPEX perspective  because if you are going to invest in CXL infrastructure,  then what we help you do is with compression,  if a terabyte of memory looks like four terabytes  effectively, you can amortize the cost of your infrastructure  over a larger memory pool.

So like I was saying, I just took a snapshot of the demo. But it looks much nicer if you see it in real life. So talk to these guys. We have the booth out there. And you can see a pretty cool live demo. 

In terms of performance, so one thing  about lossless compression, the performance  depends on your incoming data. So what we've done is we've characterized the performance  of this IP across some of the popular data center  workloads and benchmarks like SPEC and Renaissance, which  is a Java-based benchmark. And of course, MLPerf, you can't have any slide out here  which doesn't talk about AI benchmarks. So of course, we've looked at that. And you're seeing three bars out here. The purple one is the proprietary algorithm. And then we also implement LZ4, which is the industry standard  algorithm. But we've highly optimized that for performance. So we do provide 2 to 3x compression ratios. And it's pretty competitive, actually,  even compared to LZ4, even though we only  operate on single cache line granularity, which is not  a lot of data to work with. Because when you think of compression,  typically you're operating at the file size. You're compressing PDFs, or you're  compressing pages and blocks. But we are able to achieve these ratios at 64-byte  granularity.

So I'd like to summarize. So this IP solution is available. It is already OCP spec compliant. It is portable across multiple nodes. And also, it's a drop-in IP solution. And the other thing is this is production-ready,  made of this year. And we've actually had our first customer, a major manufacturer,  that has licensed this IP. And there's more to come. So the call to action out here, while we've  developed this IP solution, we need  to work together to actually develop this third tier  and then deliver the promise of CXL  as a total cost of ownership value proposition. So that's where we'd like to work and partner  with several of the other OCP members,  especially if you're working with the controller  manufacturers, or if you are the controller manufacturer or memory  vendor. We want to work to address these requirements,  the hyperscale requirements in the OCP spec. In terms of software, so we've done a lot  of work on the software. We will be upstreaming a Linux driver that  supports this capability. But again, we need the help of the community  to provide feedback and think of these different use cases. Because in the end, what we deliver,  we want it to be widely accessible to everyone. If you're not a Meta or a Google or a Azure,  if you're some enterprise company that  wants to leverage it, we want to enable it for such use cases. And there's more info in the link. So let me stop here for a quick set of questions  if anyone has any.
 
Yeah, Alan.
 
So on the latency question, that's a big one, right? I mean, CXL just by itself is a big enough latency error. So do you provide a body pass? So if I say latency is critical for this particular application  function, can you body pass the compression algorithm  and keep the latency?

Yeah, absolutely. Absolutely.So the question is, is there a way to shut this off? So yeah, there will be controls provided to the hyperscalers. So they can actually target the right algorithm  for the right latency. So actually, the proprietary algorithms  operate in a few couple of nanoseconds,  not hundreds of nanoseconds. But yeah, we do have that knob.

 Any other questions? We have a couple more seconds. And yeah, by the way, we'll still be around out here  at the exhibit booth. So please drop by with some additional questions. Yeah, go ahead, Suresh.

So you said you support both your Zip and LZ4, right? Yeah, so the question is, do we support both algorithms?

 Yes, in this IP solution, we will  support the proprietary algorithm and LZ4.

 But is that dynamically selectable,  or do you have to recycle?
 
So we have a roadmap of how we can dynamically select. But at the initial product, it would be something  you can select at boot time. Thanks. I think with that, I'm out of time, maybe? So thank you. Thanks.