-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path217
36 lines (18 loc) · 12.9 KB
/
217
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
All right, folks. Yeah, thanks for staying back for this session. I'll be talking a little bit about how we at ZeroPoint, we are helping to streamline CXL adoption in the hyperscale settings. And so just a quick introduction, I do business development for ZeroPoint Technologies. We are an IP licensing company based out of Sweden.
So first of all, let's talk about what is the challenge to it before I jump into the solution. So we know from published papers by actually all the hyperscalers, Google, Meta, that compression is taking up a significant amount of CPU cycles in the data center today. So almost up to 5% of all CPU cycles are being used for software compression, decompression in the hyperscale infrastructure. So this is monetizable compute cycles, which are being used for this kind of administrative workload. And so that is one challenge. And then to address that challenge recently, and by recently I mean in September, or actually I think it was October, end of October, when the OCP published the hyperscale CXL tiered memory expander spec. And if you read through the spec, what is stated out here is the need for hardware accelerated memory compression, lossless memory compression. And it's really to address the challenge that I mentioned, which is it's taking up precious CPU cycles.
So this, obviously, whenever there is a challenge, then that comes with an opportunity. And so the opportunity out here now is for the industry to deploy not just a CXL memory tier. So you have your DRAM tier, and then you have the CXL tier. But now there is an opportunity to deliver a third tier, a compressed memory tier, which exists within the CXL device itself. And if you go through the OCP spec, it explains the rationale behind this. The benefit is reduction in total cost of ownership, because imagine if you have a one terabyte device, and through compression, if that is now exposed as a four terabyte device. So your total cost of ownership of deploying CXL composable memory drops down. And then, of course, in terms of sustainability and efficiency, this is definitely playing into that angle, because it's not sustainable to simply keep increasing the number of memory modules. I mean, if you walk around this OCP summit, you're going to see a lot of fantastic cooling solutions. And certainly, we've heard a lot of talks about how this challenge is being addressed. So we are addressing this in a slightly different manner, which is in the same physical capacity, we are packing in much more. And so out here, we point to this spec as kind of a game changer. Because this is providing guidance to the CXL module developers how they should actually build the hardware. So if you look at the arrow, it's pointing to the CXL controller device. And that's really where the spec is targeted towards, those devices.
And then if you kind of do a double click into the spec, and you read through the spec and what's stated out there, it's pretty stringent requirements. We already know that CXL is very sensitive to latency. If you look at all the talks about CXL and composable memory and NUMA configurations for memory, so latency is definitely a very sensitive parameter. So you don't want to get in the way of latency. The second thing is we all know compression algorithms, right? Compression algorithms have been around for years. So well-known industry standards like LZ4 and Z standard have been around for storage compression. Now, if you look at this bright yellow chart on the right-hand side, so those algorithms work wonderfully for storage workloads. But they come with a penalty, because storage, you can operate at block sizes or page sizes. So you have microseconds in which you need to complete these operations. But now we are talking about memory. We are talking about CXL memory. You have a few nanoseconds. If you read through the OCP spec, you have at most 90 to 150 nanoseconds to complete whatever compression, decompression tasks that you're going to do. So in these scenarios, of course, you can use LZ4 and Z standard. But we've developed a proprietary technology which operates at the cache line granularity, so 64 bytes. So we're able to compress data at 64-byte cache line granularity. And why is this important compared to some of these other algorithms is because it's mostly during the read cycle. When I'm reading data, I don't want to pull an entire page out of my DRAM memory and then find the cache line and then decompress it. With our algorithm, you can just go fetch that single cache line that you care about, decompress it, and pass it on. And so we can keep the latency extremely low. Like well below these levels that are specified in the spec. So of course, if you read the spec, there's requirement to support legacy algorithms like LZ4. But then there's also room for innovation. And so that's where in our IP block, we actually support both algorithms. Because composability, this is a new concept. And you heard Suresh talk about some of the emerging use cases with the OAM form factors. And there could be potential need for memory attached to these accelerator units as well. So there are many future use cases which are yet to be tested. So we want to leave the door open for different use cases. In some cases, you might need standard algorithms for compression ratios. But in other scenarios, you might need the proprietary algorithms for the best latency performance.
So our solution, like I said, ZeroPoint is an IP licensing company. So we have this IP solution, which plugs in. It's a plug-and-play solution which plugs into pretty much any ASIC device. And so we already announced our first integration with the CXL customer. But if you look at the purple block out there, we have this IP block which just slots right in. It's an AXI interface. So it's a pretty standard system bus interface. It just drops into your ASIC. And then transparently, we compress and decompress data as it's coming in from the host over the CXL.mem interface. And just before it hits the memory, we conduct these operations. And the nice thing about it-- by the way, we have a demo on the show floor later, if you haven't already seen it. And then Evangelos and Dimitrios, they're out here. They can show you the demo of this. It looks much nicer in real life. But what we do is we dynamically adjust the memory tiers. So on the fly, as the data is coming in, we compress, decompress on the fly. And the capacity increases and decreases dynamically without the user having to babysit or manage this kind of third compressed tier.
And again, like I was mentioning before, because what we deliver is a soft IP, we want to democratize access to this kind of a solution. In the spirit of the Open Compute Project, we want to make this IP available to pretty much work across any process node to any ASIC vendor. And so that's why this block will essentially work under any process node if you're developing at 7 nanometer or 5 or whatever, maybe 3 nanometer. So this is portable. It can work. And the other thing is it does provide a transparent solution. That means you don't need to change your software stack in order to take advantage of this. That was another motivation part of the OCP spec was to preserve the software investments that companies have already made and not make everyone change the software stack. So we do offer that transparency. And if you look at some of the size of our IP, it's pretty minuscule, the basic IP blocks. So it plugs in pretty compactly and helps deliver this 2 to 4x compression in a very compact area.
The other thing which is of interest from an end user perspective, what they care about is, OK, so this is great. You do cache line compression. It's transparent. It's pluggable easily. But what's in it for me as an end user? So yes, I get this capacity. What this effectively does is help lower your total cost of ownership. So when you look at these four bars, so let's say what we've done is we've characterized the total cost of ownership. And we have some detailed models, by the way. Feel free to reach out to me. I'm happy to share these models. And by the way, we are also introducing some of these models at the OCP Composable Memory Solutions sub-workgroups, the workload workgroups. So if folks have different thoughts on TCO models, we'd love to hear them. But with our initial model, there's benefit by moving to CXL. Because you get rid of this stranded memory that we've heard in several talks out here. But then with compression, you can further improve the total cost of ownership. And then when you combine CXL with compression, that actually dramatically reduces your total cost by up to 25%. And this is both from a CAPEX perspective as well as from an OPEX perspective because if you are going to invest in CXL infrastructure, then what we help you do is with compression, if a terabyte of memory looks like four terabytes effectively, you can amortize the cost of your infrastructure over a larger memory pool.
So like I was saying, I just took a snapshot of the demo. But it looks much nicer if you see it in real life. So talk to these guys. We have the booth out there. And you can see a pretty cool live demo.
In terms of performance, so one thing about lossless compression, the performance depends on your incoming data. So what we've done is we've characterized the performance of this IP across some of the popular data center workloads and benchmarks like SPEC and Renaissance, which is a Java-based benchmark. And of course, MLPerf, you can't have any slide out here which doesn't talk about AI benchmarks. So of course, we've looked at that. And you're seeing three bars out here. The purple one is the proprietary algorithm. And then we also implement LZ4, which is the industry standard algorithm. But we've highly optimized that for performance. So we do provide 2 to 3x compression ratios. And it's pretty competitive, actually, even compared to LZ4, even though we only operate on single cache line granularity, which is not a lot of data to work with. Because when you think of compression, typically you're operating at the file size. You're compressing PDFs, or you're compressing pages and blocks. But we are able to achieve these ratios at 64-byte granularity.
So I'd like to summarize. So this IP solution is available. It is already OCP spec compliant. It is portable across multiple nodes. And also, it's a drop-in IP solution. And the other thing is this is production-ready, made of this year. And we've actually had our first customer, a major manufacturer, that has licensed this IP. And there's more to come. So the call to action out here, while we've developed this IP solution, we need to work together to actually develop this third tier and then deliver the promise of CXL as a total cost of ownership value proposition. So that's where we'd like to work and partner with several of the other OCP members, especially if you're working with the controller manufacturers, or if you are the controller manufacturer or memory vendor. We want to work to address these requirements, the hyperscale requirements in the OCP spec. In terms of software, so we've done a lot of work on the software. We will be upstreaming a Linux driver that supports this capability. But again, we need the help of the community to provide feedback and think of these different use cases. Because in the end, what we deliver, we want it to be widely accessible to everyone. If you're not a Meta or a Google or a Azure, if you're some enterprise company that wants to leverage it, we want to enable it for such use cases. And there's more info in the link. So let me stop here for a quick set of questions if anyone has any.
Yeah, Alan.
So on the latency question, that's a big one, right? I mean, CXL just by itself is a big enough latency error. So do you provide a body pass? So if I say latency is critical for this particular application function, can you body pass the compression algorithm and keep the latency?
Yeah, absolutely. Absolutely.So the question is, is there a way to shut this off? So yeah, there will be controls provided to the hyperscalers. So they can actually target the right algorithm for the right latency. So actually, the proprietary algorithms operate in a few couple of nanoseconds, not hundreds of nanoseconds. But yeah, we do have that knob.
Any other questions? We have a couple more seconds. And yeah, by the way, we'll still be around out here at the exhibit booth. So please drop by with some additional questions. Yeah, go ahead, Suresh.
So you said you support both your Zip and LZ4, right? Yeah, so the question is, do we support both algorithms?
Yes, in this IP solution, we will support the proprietary algorithm and LZ4.
But is that dynamically selectable, or do you have to recycle?
So we have a roadmap of how we can dynamically select. But at the initial product, it would be something you can select at boot time. Thanks. I think with that, I'm out of time, maybe? So thank you. Thanks.