-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path308
146 lines (73 loc) · 34.1 KB
/
308
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
Hello everyone. Thank you for attending the CXL Consortium's "Making Memories at Hyperscale with CXL" webinar. Today's webinar will be led by Brian Morris from Google and Prakash Chauhan from Meta. If you have any questions, please feel free to enter them into the question box, and we will address them during the Q&A. With that, I will go ahead and begin the webinar.
Welcome to Making Memories at Hyperscale! Thanks for spending your morning with us. We are Brian and Prakash from Google and Meta, and we are here to showcase a way to build platform memory with CXL that we believe makes sense at scale.
If you've been following CXL technology for a while, you've seen this pyramid any number of times. This pyramid drives me crazy. It does a nice job of capturing the latency of the different memory technologies, but it ignores cost. Yes, main memory is slower than cache, but the motivation for adding main memory to your platform is that it saves cost. But CXL, as it stands today, is more expensive than main memory. You just can't get around the fact that you're paying for all the same media, and you have to buy a CXL card to host that media. So why would I add CXL memory to the platform if it is both more expensive and lower performance? The whole thing smells like a pyramid scheme. In this presentation, we'll outline a way to address the cost problem, which moves CXL from being an interesting niche to being a critical high-volume platform component.
Let's talk about the problem we're trying to solve here. Memory cost is a significant problem for new platforms. To illustrate this, think about a modern platform where memory is half the cost of the platform. That's not always true. Some platforms populate more memory. Some populate less memory. The cost of memory fluctuates in the market. But if you think about memory as half a platform cost, it makes the rest of the math simple, and it won't lead you too far astray. Memory capacity scales with CPU performance. You've seen VMs from cloud service providers that sell a virtual machine with a certain number of cores and a certain amount of memory. And intuitively, the more cores you have, the more memory you need to run workloads on those cores. Also intuitively, memory cost scales with capacity. You want twice as much memory; it's going to cost you twice as much money. So now, let's imagine that we're able to buy some new magic CPU that has twice the performance. I say magic because, in the end of the Moore's law era, this requires magic. We're not seeing new CPUs with twice the performance of the last generation CPUs, at least not with ISO cost. The CPU becomes more performant, but it also becomes more expensive. So that's why we call this a magic CPU. It's twice the performance, it's no more expensive, it's no more power. All the same properties. Just twice the performance. We populate this new magic CPU with twice the performance, based on the scaling we talked about above, we now need to populate twice as much memory, which means that the cost of the platform is 50% higher. So two times the performance with 50% more costs means that you've got one-third more performance per dollar. So, you know, we started with an optimistic new magic CPU technology, and the economics of the memory have brought it down to something that's significantly less compelling. So, your takeaway from this should be that memory cost is a real problem for improving the performance per dollar of future platforms.
Before we show you how to solve this problem, we need to do some homework so that you'll believe the solution once we show it to you. Technically, someone else already did the homework, and we're just going to copy their answers. On the next three slides, we're going to reference three academic papers. Each of these has a lot of great content, so I apologize to the authors for distilling their hard work into a graph and a single sentence takeaway from each. The first paper, that we see here, is from Meta, and it tells us that there is a lot of cold data stored in DRAM today. The graph suggests that for these eight workloads, half of the memory contents haven't been used in the past minute; and a minute is an eternity from the performance perspective. If your data is only going to be touched once per minute, it doesn't need to be in fast, expensive memory. Instead, cheap, slow memory out on CXL is sufficient.
This next paper is from Google, and it tells us that cold data has good compressibility. So, of course, not all data is compressible, but across a wide range of workloads, cold data can compress at a 3 to 1 ratio—though it's more like a 2 to 1 ratio once you account for pages that can't be compressed.
This third paper suggests that cold data doesn't need very much bandwidth. That isn't terribly surprising. If the data hasn't been touched in the past minute, then it probably doesn't need a lot of bandwidth to access that data. But the degree to which the bandwidth required is lower is impressive. The chart on the left suggests that the vast majority of data can suffice with just 1% of the platform bandwidth. In CXL, the bandwidth is additive to the baseline platform bandwidth, but this tells us we don't need an aggressive and expensive CXL solution to satisfy our bandwidth requirements.
We put the learnings from these three papers together; we come to a powerful conclusion. We can use CXL to build a new layer in the memory hierarchy. This new layer in the pyramid is cheaper than DRAM but more performant than an SSD. We can say this with confidence because we know that there is a lot of cold data that can go into the slower tier. We know we can use compression to halve the cost of the media, and we know we can afford to pay the performance overhead that comes with compression. Best of all, we don't need a new, unproven memory technology to do this. We can do this with DIMMs that are already available in the market or, perhaps, sitting around idle in your data center. With that, I'll hand off to Prakash, who will walk you through how we make this new architecture a reality.
Hi everyone. From Brian's content, it should have become clear that our major goal here was to reduce server memory costs using CXL, because memory has become a large and growing part of the server TCO. However, we did not see the CXL ecosystem providing the right solution for our needs. So, Meta and Google worked together to draft a controller specification that provided a tailored feature set for at-scale deployment and to significantly reduce costs through the maximal use of DDR4 DIMMs, that we have an abundant supply of. To further improve the cost per gigabyte of the CXL-attached memory, we introduced an inline compression feature that can further increase the usable capacity while leveraging silicon area on the CXL controller, which would have gone unused. The compression engine is paired with an on-chip cache to provide spatial reuse and reduction in latency, as well as more optimal use of the DDR bus bandwidth.
Let's dig a little deeper on why we prioritize certain things from a system viewpoint. Let's look at an example system on the right with 12 channels of DDR5 directly attached to a CPU and one CXL controller with 12 DDR4 DIMMs. Using DDR5 DIMMs of different capacities, for example, 32, 48, or 64 gigabytes, and DDR4 DIMMs of various capacities, for example, 16 and 32 gigabytes, we can construct a machine to have different ratios of DDR5 to CXL memory, as might be appropriate for the kind of workload that we have. The usage of DIMMs behind CXL provides flexibility, ease of serviceability, and the ability to leverage existing and mature supply chain for the entire lifecycle of this memory. Usage of DDR4 as the media of choice behind CXL helps us reduce costs directly by reusing memory from older systems or any leftover inventory. Also, since there are so many memory chips per system, DDR4 reuse helps lower the carbon footprint. One of the key insights from our focus on cost efficiency was that we need to increase the amount of memory we can support behind a single CXL controller. This is to help amortize the infrastructure costs, such as boards, cables, and power supplies. And so, we specced out a part that can support four channels of DDR4 at three DIMMs per channel, which gives you up to 12 DIMMs. And a single CXL buffer can thus support a large number of DRAM chips.
Let's look at the carbon impact in a more visual form. As mentioned before, DRAM contributes to the maximum amount of silicon chips in a server system. The figure on the right shows how a system with just DDR5 memory would have a large carbon footprint. By moving a fraction of this capacity to reuse DDR4, it reduces the carbon footprint. And by deploying compression, we can further lower the amount of new DDR5 needed, thus lowering the carbon footprint further.
We have been asked a lot of questions about this controller specification, and we would like to cover some of the frequently asked questions in this slide. One of the questions that is asked repeatedly is: which workloads are you targeting for a CXL-enabled system? The answer is, our goal is to deploy CXL broadly on all compute servers. This means that there's a huge range of compute workloads that have to work efficiently on this system. Another question we get asked is: how do you deal with encrypted data? Now, clearly, encrypted data is not very compressible. So, we would have to basically decrypt the data at the ingress, compress it, and then re-encrypt it. Another question asked is: why did you not talk about memory disaggregation? The answer is that the goal to get to cheaper memory does not obviate other use cases like disaggregation and pooling.
Finally, let's talk about the future. The controller spec we created is a base specification with plenty of room for innovation. For example, memory controller vendors can provide more efficient compression and decompression schemes to make the solution tunable, with knobs that help trade off speed and compression effort. Similarly, better caching and prefetching features can be enabled, accommodating different regions of memory with different performance and cost trade-offs. There's also support for other media types and additional tiers of memory, as well as support for features like hot page detection and telemetry that reduce software complexity. A major value add would be to have better RAS schemes, which are tailored for reused memory. These might offer better system-level failure and swap rates by proactively offlining pages with excessive errors or continuing to operate with bad channels. From the DRAM vendor's perspective, there is definitely room for a low-cost and low-power DRAM, which could be at lower performance. So, with that, we come to the end of the stock, and we'll open up for questions.
Great, with that, I will go ahead and dive into the Q&A. The first question that we received is: How do you see DRAM and CXL memory coexisting with different BW/latency characteristics?
I can take that question, and thanks, John, for asking it. So, as Brian had described, you have a lot of hot and cold pages in the system, and the way to deal with it transparently, or in a non-application aware way, is to kind of have software pieces in your kernel that identify hot pages in CXL memory, move them to main DRAM, or identify cold pages in main DRAM and move them to CXL memory. The other possibility is when your application can actually handle, or can deal with, different tiers of memory directly, in which case you allow the application to have visibility into the multiple tiers, and they can directly allocate in the slower memory when appropriate.
Great, our next question is: How do you tolerate the variable memory capacity that results from compression?
Yeah, I can take this one. It's a good question because it's not straightforward, right? It definitely requires software enabling the—you know—you should not expect to be able to just plug in a device like this and take advantage of the extra capacity because without the ability to handle that variable capacity, you're at some point going to run out of memory because you hit a data set that's not as compressible as you would like, right? So you certainly need back pressure in the system, right? You need the ability to either migrate or kill jobs that are not as responsive, or you need the ability to move pages to SSD or out over the network to spill over in the case where you are running out of capacity. And you can tune the whole system so that you are only hitting that case rarely, and if you're willing to pay enough in terms of a buffer for safety, maybe you hit that case never, right? So it really comes down to your tolerance for the tail condition of something like having to swap a page to SSD or migrate the whole job, or how tolerant you are for killing a lower priority job, right? So it factors into the larger system design, certainly.
Okay, our next question is: What kind of lifetime do you expect from recycled DIMMs?
I can take that, and Bryan, you can add if you have more, uh, to add. We think that, with the recertification methodology that we're using, we can definitely run these beyond 10 years of total life, um, and not see substantial failures from, like, reaching the end of the bathtub curve.
The only thing I would add is that, uh, there's a pretty interesting paper that came out from, uh, Microsoft in June, I think, is the, uh, Green SKU paper that talks about being able to reuse DIMMs and has some data on longevity, including longevity for, uh, DIMMs that they have, uh, that they've artificially aged through temperature and voltage, presumably, uh, that suggests, exactly as Parkash says, that north of a decade is, uh, very achievable.
Great, our next question is: why is memory cost becoming a larger portion of system cost?
Uh, I can take that. Uh, basically, if you look around a decade back, the scaling of DRAM became much harder. At least with logic, you can keep cramming data, but with the 3D-like construction for DRAM, it becomes harder and harder. So, the cost of adding increasing the capacity is kind of matching up with the increasing capacity. While for a compute element, like a processor, you can get some cost savings going from generation to generation with a higher core count processor, with memory, there is no such saving. So, you keep paying the same cost per dollar per gigabyte, and if your memory is doubling, you have to pay double the amount of money.
Great! Our next question is: Has any vendor created an expanded device conforming to the spec? Can you share examples?
So, we—I think one vendor had announced, uh, a controller, uh, that conforms to the specification in the FMS that happened this year. Um, and there are others that are working on it. These devices will be available in 2025.
Okay, our next question is: What versions of CXL do you find necessary for this application?
I can feel this one. The interesting thing is that a lot of the capabilities being added to CXL are not fundamentally required to do this. You know, there are a lot of things that are showing up in terms of standardization for flows, so that you don't end up with a bunch of vendor-specific handshakes, right? Which is nice, right? And lets us have a pipeline of these devices that potentially come from different vendors. But the basic functionality, really, all you need is reads and writes. I mean, there are other capabilities that the device has to have that you see spelled out in the OCP spec, but mostly they aren't things that are absolutely required. You know, there's talk about improved hotness monitoring in a future version of the CXL spec, and that'll be super interesting for this community usage, but not necessarily required.
Yeah, so I think CXL 2.0 has most of the features that are needed for this, but there are a lot of enhancements that make future revisions, uh, more interesting. Uh, and in fact, with the direct-attached model, I think CXL 2.0 is sufficient. Uh, if you start looking at more disaggregated kinds of scenarios, you probably need, um, the higher levels of spec compliance.
Um, our next question, uh, memory in your setup does not seem to be used for LD/ST operations. Given that, is it used as a way of staging/holding colder data? Um, could you share any comments on this?
So, that's not correct. I'm sorry, Brian. I'll start, and you can carry over. Uh, so, sure, this is uh, CXL memory is load-store memory. It's just like the latency impact of the load-store on CXL. So, you want a larger hit on frequently accessed things in local DRAM. So, uh, basically, uh, the reason why we are monitoring these, uh, pages is because it's not easy to track, uh, hotness at a poor cache line granularity, uh, but the accesses are on load-store.
Yeah, yeah, to help visualize how that works with the inline compression, right? Um, get more of this from the OCP spec, but uh, you know, essentially, a 64-byte cache line read transaction comes over CXL and goes to the device that the device has compressed at a larger block granularity, right? Say, at a 4K granularity, so it's taken a 4K page, which is 64 different cache lines, right? And that page, let's pretend, has been compressed two to one, right? So, I have, so that single load transaction that came from the core has to do 32 cache line reads from DRAM out over CXL, which is, you know, appalling, right? From the bandwidth perspective, like, that's rough, right? Uh, but then it takes that single cache line that was originally requested and returns it to the core. So, from the core's perspective, it's done a standard cache line read. Now, the fact that we have the 32x bandwidth inflation is addressed by the block cache that we talk about in the spec, where fundamentally, we see a lot of locality here. If you haven't touched a page in two minutes and now you're touching it, there's a pretty good chance you've got spatial and temporal locality for that page. And so, you should expect a high hit rate in the cache.
Great! The next question: I've seen papers showing the different characteristics of genuine CXL devices (various memory access latency). How will the data centers or users deal with this?
The fact that this data is cold, right—meaning that you haven't touched it in a couple of minutes—means that you're a lot more tolerant of that high latency. A couple extra hundred nanoseconds, you know, 200 more nanoseconds of latency for something you touch all the time would be crippling, right? It would be terrible, right? But a couple hundred nanoseconds of latency for something you touch every couple of minutes is not significantly impactful to performance.
Yeah, I think it's basically a statistical thing. If you have looked at computer architecture stuff, there's a thing called average memory access time. And it always is a question of how many times you are hitting something, at what latency. So, you can have a high-latency device that you hit infrequently, and a low-latency device that you hit frequently, and you get a good average access latency. So, the trick is just making sure that stuff that is accessed frequently stays closer to the CPU.
Great. Our next question is, are there any papers that characterize this expander and its performance?
Since we haven't publicly seen the first silicon here yet, I think it's going to be a little while before the papers show up. That said, the papers that we reference have the different component pieces, as we alluded to in the talk. You can do very similar things with Optane DIMMs, for instance. You just don't have all the beautiful properties of CXL, like having a separate bandwidth pool to draw from, and having the better latency properties that you get from DDR instead of Optane.
Great. Our next question is: Does this type of CXL compressed memory tier have value in GPGPU clusters?
I'm not going to speak to anything related to ML, but feel free.
I'm just going to say, as long as you're following the principle that your latency allows, you can provide capacity using CXL. So, in your large ML applications, a high amount of capacity is desirable. But the question is, is the bandwidth and latency that is offered by these CXL devices going to meet that requirement? So, maybe it's useful as a second tier of memory, as long as you can manage the data movement between the CXL memory and the local HBM; there should be nothing to stop you from that. But I think that's not really my area of specialty. So, I would say you have to do the work yourself to figure out how to deploy it.
Great. Our next question is, in what time frame do you anticipate media controller vendors to deliver compliant controllers to the spec CXL 3.x or later?
As Prakash mentioned, you saw one announced at FMS. And I don't recall what they said about when the broad availability of it is. But you can go look up that announcement. I don't believe that it is tied to CXL 3.x. I believe CXL 2 is sufficient.
Yeah, I think there will be some CXL 3.x ECNs that will be in these devices—not necessarily all of CXL 3.x compliance. But I expect devices in ’25 and production sometime in ’26, production readiness.
Great. And our next question is: would 3D NAND/SSD suffice as a low-performing memory?
Brian, do you want to take that?
Sure. You know, the real problem I have with NAND in this context is tail latencies. If your system can tolerate those tail latencies, fabulous, right? And there are some places where we find that we can tolerate those tail latencies. And then, by all means, if you can use NAND, use NAND, right? The dollar per gigabyte for NAND—that is much better, right? And it is TBD to what degree a low-latency NAND could get us there, right? It's nice because you remove the swap that we usually associate with NAND, so that saves a bunch of latency. If you can put that low-latency NAND behind CXL, right? That can be quite interesting. The flip side of that is, you've now taken a load from the core and tied it up for potentially the tail latencies of your low-latency NAND, which, despite the name, still has a fair amount of latency at the tail. Locking up buffers in the core and the cache for a long period of time to satisfy that request could be quite impactful for performance. But maybe not, right? I think that is an interesting space to explore.
Yeah, I think intuitively you don't want to expose the core to a load latency that is dictated by the NAND tail latency. But yeah, people should try to do this and see how it works out.
Great. Next question: What percentage of system bandwidth is needed to move hot/cold pages from DRAM to CXL memory?
Yeah. The interesting thing is that there aren't hard limits here, right? It depends on the application and what you consider hot and cold, right? How far do you want to turn the knob, right? If you turn the knob to be more than the number, you're going to have a lot more cost savings. But if you’re going to be more impactful to performance, then one of those vectors where it's more impactful to performance is the increased promotion and demotion bandwidth that you need to move things to that next tier. I believe the data that you see in the paper is based off of Optane, where Google has built a tiering stack similar to this using Optane rather than CXL. You were talking about the bandwidth available in the second tier; it was like an order of magnitude less that was used for promotion and demotion. But again, that's very tunable, depending upon where you set your tolerances. Yeah. And what your priorities are, right? You might say that it is of utmost importance to promote something that even has a hint of starting to warm up, and then you're going to be much more aggressive about moving things from the cold tier to the hot tier, in which case you're going to need more promotion bandwidth. Or, if your application is more tolerant of those latencies and you have the bandwidth to spare, then maybe you say, "Yeah, maybe I won't promote this right away. Maybe I'll wait and see if it's really needed." And the large SRAM cache that you see us recommending in the OCP spec helps facilitate that, right? Because if you have something that is hot and it's been decompressed and it's sitting in that block cache, then you're just talking about SRAM latencies after you get across CXL, of course. But only SRAM latencies. And no DDR media bandwidth is necessary to respond to those requests.
One more thing I would add is that, initially, we'll be deploying this with some caution, trying to steer a low bandwidth to CXL as much as possible. And as we work... As we get better with tuning the systems, we can look at more disproportionate sizes of CXL memory and also introduce things like job-specific rules on how much of the workload's footprint can reside on CXL versus main memory. So, those are things that will basically depend on how things go with first-generation deployments.
Great. Our next question: Is there work ongoing with the CXL Consortium to standardize the variable capacity that would result from compression?
There's this notion of a dynamic capacity device, but I'm not entirely sure whether we've mapped that into the variable capacity introduced by compression. Brian, do you have a question?
I don't think they map well, and I think it's still an open space. I also think there's an interesting question about the degree to which there should still be innovation. Perhaps there are different ways to do this or think about this, versus saying it's time to standardize and lock it down so everybody has the same view of the world across devices.
"Great. Our next question: which module form factors are the focus for CXL expansion memory—E3.S, E1.S, or are you looking at custom form factors? And then finally, are you looking at the new E5 form factor being defined at SNIA?
I'll go first. I suspect Parkash and I have different answers to this one. My opinion is that the form factor is a third-tier consideration, but it is one of the first questions that people ask me. The key is to focus on value, right? Define a device, find an architecture that has high value, and then start worrying about the form factor. Now, obviously, if the form factor is extreme and you can never make it fit in your server, then maybe that's a problem. But, you know, NVIDIA has been wildly successful with their very nice GPU products over the past decade, and that success is not riding on the back of snapping to a standard form factor. The value should come first, and then after that, we figure out the form factor. In the case of what you see in the OCP spec, we're really going to have DIMMs. We’re going to have a lot of DDR4 DIMMs to amortize those costs, and that leads to something much larger than the standard form factors.
Yeah, I... I don't disagree at all with whatever Brian said. I think the question is being asked, probably, in terms of what vendors of memory solutions should be producing. To date, I don't think we have established a standard that will work well with these DIMM-type devices. I'm not familiar with E5, so I cannot comment on that. But I think adapting to a different or custom form factor is a much easier problem for us than getting the right feature set in the controller ASIC. So that's what we focused on in our specification.
Great! Our next question: in addition to the cold data, could you go into whether CXL with compression would be a good fit for in-memory data analytics, which tend to be less sensitive to random access latency?
I don't think I'm, uh, enough of an expert on the workload to say. I can say more generically that if you aren't terribly sensitive to latency, and you have compressibility, and you don't have enormous—either you don't have enormous bandwidth requirements or you have really good locality—then it makes sense to put your data in a device like this, right? Um, so you could imagine a database where you have a whole lot of capacity needs that are kind of warmish, right? Maybe they're not cold; maybe they're only warmish. You can tolerate a little bit of extra latency, but the cost of the platform is really dominated by the memory cost, and that could be a good fit, right? If you really do have compressibility for your data, yeah.
I think the question might be relevant in terms of, like, a near-memory compute kind of, uh, application. I'm not sure if that's the intent, but clearly those things are possible. Uh, but I—we haven't yet, um, looked at the deployment models for a specific application.
Uh, next, regarding tiering, do you see auto-tiering being handled at the node level first, the resource scheduler level, um, or all at once?
I think everybody has to participate because of the variable capacity, right? Like, you know, the—uh—the tiering at the node has to, uh, be able to react, react immediately. Um, but of course, when you're scheduling jobs, you need to understand, uh, you know, what does that variable capacity mean? You know, how have you built a—uh—the appropriate buffering for scheduling work onto that node?
Next question: What CXL link bandwidth versus capacity ratio is appropriate for this type of cold data, low-cost expansion?
I can take this one. Basically, our principle for defining the speeds and feeds on this device—these are actually captured in the specification—is to match the CXL bandwidth to the device-side bandwidth and to the DRAM-side bandwidth. This approach allows us to support both DDR4 and DDR5 DIMMs at different speeds and with different DIMMs per channel. You can take a look at the specification to see which bandwidths we are targeting, but the actual usable bandwidths may be significantly lower to maintain performance and control.
Our next question: do you see any usage of such compression—uh, compression-capable CXL devices—for AI/ML workloads?
I think we covered this in the earlier question. Uh, I mean, if there's no—um—I don't see anything that prevents it from being used. It's just a question of whether, uh, you can tolerate a higher latency and lower bandwidth than your native attachment to HBM.
For our next question, could you delve into a bit about the software requirements for this application?
The, uh, fundamental pieces that you need are to be able to do promotion and demotion of, uh, pages, right? So, demote pages as you find that they're becoming cold. There’s a lot of talk about using mglru for these purposes, right? So, that’s really interesting. Promotion needs some work to be done in a standardized fashion, right? If you look at how Google was doing it with Optane, that was, uh, not yet at an upstream-compliant mode. The interesting thing that's coming along with these types of devices—and especially once we see hotness monitoring show up in CXL in terms of a standard—is having an interface to the device where it keeps track of what the hottest pages are. You can walk page tables to figure out what's hot, but that’s slow, right? You’re going to be late to the game in figuring out that something's become hot. You can sample performance monitoring addresses to determine what's becoming hot, and that can work quite well, right? But, uh, the best you're going to get are these devices that have the telemetry baked in to identify the hottest of the hot pages, to prioritize for promotion. And then, you know, the other piece is, uh, the stuff we talked about earlier in terms of how you tolerate variable capacity.
Yeah, as far as, uh, like the hotness and coldness, at least Meta's work has been in the public domain. There's a paper on transparent page placement, and regular contributions are happening to upstream Linux kernels. As for the variable capacity and compression, that stuff is not yet in development, and it will basically need to have some of these devices in hand for it to happen.
Great. And then, our next question: could you share a bit more about doing compression/decompression on the CPU side to reduce bandwidth on the CXL link?
So it's possible, but now you say, “I need to cut open every CPU that I want to use to add this capability.” It also has the trade-off of, now, you are stuck with whatever the CPU has implemented, rather than being able to have multiple vendors build devices that are tailored to the sort of things that you might be interested in doing over CXL. It is possible, but it's certainly not where we are in the first generation. That said, I could see a CXL link compression capability as being attractive to artificially improve the bandwidth of the CXL link, independent of what we're doing here.
Great. Our next question: Do you have workloads that can be completely placed into CXL-attached memories, or do your workloads need some amount of working set in local DRAM? I'm sure that some applications can tolerate being fully in CXL memory. They just don't access RAM very frequently, I guess. Yeah, it's very workload-specific. I would think that the batch kinds of jobs that have no latency guarantees at the user level could definitely run fully out of CXL, and it should not harm anything. And Brian, feel free to...
Yeah, I agree. It’s wildly workload-dependent, right? I mean, we talked earlier about how you could imagine a large in-memory database where the vast majority is out on CXL. Would I go so far as to say 100%? Well, I don’t really see a reason to take the hot subset and punish it that way. But you do what you feel is best, right?
I think we have time for one last question. Is there any extra level of validation or verification expected for compliant devices, apart from general CXL validation?
I think, because these are devices that have DIMMs attached, there is a lot of interoperability testing required with different DIMM vendors to make sure that these DIMMs are trained properly with sufficient margins in many different populations, like three DIMMs per channel, two DIMMs per channel, or one DIMM per channel. Depending on the use case we are targeting, we could have many different configurations in which this device is used. So, making sure that it works reliably in all those corner cases is important.
Excellent. Well, thank you so much to all of our speakers. That’s gonna be our last question for the day. Any questions that we didn’t have the chance to address, we’ll be going ahead and publishing a recap blog, which will address those questions there. Also, the webinar recording will be made available on the CXL Consortium YouTube channel, as well as here on our BrightTalk channel. We will also make these slides available for download on our website. Thank you all for your great questions, and thank you very much.