-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path175
71 lines (36 loc) · 43.1 KB
/
175
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
All right, welcome again to our webinar to talk about CXL memory utilization.I am your moderator, Kurtis Bowman.
I'm with the CXL Marketing Workgroup, the co-chair of that group, along with Kurt Lender from Intel, and also the director of server system performance at AMD.Today we are joined by some great panelists.We have Bill Gervasi, who is from Wolley, is the principal system architect there.Vijay Nain is the senior director of CXL product management at Micron.And then Tracy Spitler is the co-founder and VP of engineering at IntelliProp.So with that, let me hand it off to our great presenters.And we'll start today with Vijay.
Thank you, Kurtis, for the introduction.Good morning, good evening, everyone.Now I'll start by saying that for the success of any new technology, it's super important that all the critical ecosystem players are proactive in their contributions.And this has definitely been the case with CXL.Let's start with the enablers, or in other words, the CPU vendors.At Micron, we've seen an extremely high level of collaboration, participation, and support from CPU suppliers.And this goes all the way from basic bring up, ironing out early issues to ensuring that the validation of CXL memory modules is done correctly.Micron is, most of you know, is a memory supplier.And as we know, that for the success of any new memory technology, you simply can't have one supplier.So we're really happy with the fact that we're joined by the other two large suppliers in providing CXL-based memory solutions, which bodes very well for the future of CXL.Building up from CXL 2.0 compliant CPUs and memory modules, we've seen extremely strong engagement from all the major server manufacturers or OEMs, regardless of whether the end platforms make their way into traditional enterprise environments or hyperscale environments.And then to make all of this work, there needs to be a healthy CXL ASIC ecosystem.And there definitely is.There's multiple vendors in the space, each offering slightly differentiated solutions while still being compliant with baseline CXL 2.0 spec.So this really represents the industry with various different options to go to implement proof of concepts, map out different value propositions and so on.One of the really interesting and important areas for CXL is the software stack.And this is really unique because if you consider the case of traditional memory modules, where you plug into a server socket, you do some basic compatibility testing with the CPU of your choice.And that kind of completes the scope of testing that you need to do.With CXL, things are different.We need to make sure that that first level of software that's sitting, that's bare metal, whether it's a hypervisor, commercial or open source, or just, you know, a Linux distro or your Windows operating system in the data center, that layer of software needs to have awareness of CXL memory modules to be able to extract the best possible value for the different use cases.And then, you know, moving on from that base layer of the software, there are applications that we're very interested in looking at from a compatibility and use case perspective.So there are two pieces to this.One is there's applications which help you better manage what CXL brings in terms of tiers of different memory.And these allow you to work with the tools and capabilities of the operating system to better extract value from CXL memory.That's one part of it.And we've seen companies come forward with really elegant solutions to maximize your CXL value propositions.We're also seeing that as we move higher up the stack, there's end user applications that are already showcasing some really good benefits with CXL, whether that's from bandwidth expansion, capacity expansion, or a combination of both.And then finally, to make all of this work well together in, you know, in the 2.0 time frame, we've seen there's been excellent support from test infrastructure, test vendors coming up with a really solid, you know, infrastructure for us to make sure that anything we build really complies without exception to the baseline spec.
Now let's talk a little bit about what actually hardware is available today and where we've seen some performance improvements that we can talk about.So on the top left here, I have a good example of what's available.We have a CXL 2.0 compliant memory module that Micron has made public.This is based on the E3.S 2T form factor.And there's two capacity points, 128 and 256 gigabyte.And I guess the important thing here is to compare it to, let's say an RDIMM, where the bandwidth that you would get from one of these modules is approximately equivalent to the bandwidth that you would get from a single DDR5 RDIMM that's running at maybe 4.8 gigatransfers a second.And that's just pretty useful in comparing what you can get in both bandwidth and capacity when you're looking at something that's directly attached or that's a CXL module.Two clear benefits of any CXL memory expansion, you get more capacity, you get more bandwidth as well.And so by adding 8 256 gigabyte memory modules, we see two terabytes of increased capacity.By using four additional modules with a baseline of 12 RDIMMs, we can see a 24% increase in memory bandwidth.So those are some very basic first order benefits that you would see with memory expansion through CXL.A couple of use cases that we're seeing clear benefits.On the top right, this is the most obvious benefit that you would see when you add just purely memory capacity to any system.You know, SQL servers are limited in the amount of memory.And so everything that's not in memory has to be in disk.And by adding DRAM, whether it's direct attached or CXL, you would see a substantial increase in performance for these type of database workloads.So that 2x increase comes purely as a capacity play, and it's a clear value add for such use cases.On the bottom right is a slightly more interesting use case, which is and more nuanced use case, which is around DLRM or deep learning recommendation model, which is a workload that's been open sourced by meta.They use some variant of this in their recommendation engines.And one of the more complicated or intensive operations that they need to perform is embedding reduction, which really in mathematical terms is taking a bunch of vectors and compressing them into a smaller number or a single vector, which then helps with computations for the downstream.And there's really two takeaways from this chart here.One is as you move from the left to the right, you can see that the net throughput increases.And that's because you are increasing the number of threads and increasing the amount of compute that's available.So if you look all the way to the histogram on the left, we see that there really isn't much of an improvement when you add CXL memory because the limitations come from the compute capability and we're not yet hitting a memory bandwidth or capacity limitation yet.As we move to the right, we see that throughput increases, but now you can see relative changes in the independent bars of the histograms.And so for each histogram, you can think of the bars as they move from left to right as indicative of the ratio of memory that you allocate directly through an RDIMM or through CXL-attached memory.And DLRM finds a sweet spot where you get the maximum possible throughput with a certain combination of locally attached and CXL-attached memory.The key takeaway here is that for every end application that you're using, there's going to be some ratio which ensures the most optimal throughput for the system.
Now let's talk about some of the new use cases that we've been looking at.And again, this is all with CXL 2.0.So a lot of this indicates really good progress in the 3.0 timeframe that's possible.The chart on the left really is an AI inference workload that Micron ran on the Llama language model.I'm sure everybody on this call is very familiar with Llama.It's the choice of open source language model these days.So we ran a 70 billion parameter model, and there's really two takeaways here.The first is on the lower part of this chart, you can see there's a net 22% improvement in performance.This is measured by looking at the number of tokens per second or raw throughput of the language model.And this comes from bandwidth expansion.On the upper part of this chart, what we can see again is that depending on the type of workload that you're running, and Llama is by the way a very read intensive workload with very little write.So by determining the different interleaving ratios between direct attached memory and CXL attached memory, you can optimize the throughput.And we're showcasing here a software interleaving approach, which gives you a lot more flexibility in fine tuning.Moving to the top right, there's use cases in machine learning, in training, where you could take various different combinations of media and put them on a behind a CXL interface and gain benefits in the training throughput of a GPU.GPUs as we know, all GPUs in the data center today use high bandwidth memory.High bandwidth memory is extremely good at what it does, but it's also limited in terms of how much you can cram onto a GPU.And it's also extremely pricey.So what usually happens when you overflow the memory requirements that can be supported by the locally attached HBM is you go to the CPU, you use some of the memory that's attached to the CPU, or you have to go to disk.In either of those latter scenarios, CXL can help either by just expanding the byte addressable memory that's associated with the CPU, or you could have some kind of hybrid media.You could have a combination of an extremely large DRAM buffer along with flash sitting behind CXL and gain throughput benefits for GPUs in training.And then finally, at the bottom right in the virtual machine space, there's a bunch of different opportunities that CXL 2.0 unlocks, from baseline use cases of expanding the memory capacity of certain virtual machines, giving them more bandwidth, or even going to a tiered virtual machine architecture where you could say, I don't want all the memory that this virtual machine uses to be direct attached.I'm okay with having some portion of this be a little higher latency and connected to CXL.Those use cases allow for tiered VM opportunities as well, which we are starting to see more interest in in the industry.So I know I touched a little bit about ML training, but I'm going to hand it off to Bill who's going to give you a lot more detail on that specific area.
Thank you, Vijay.
So we're going to zoom in a little bit on the use of CXL for artificial intelligence acceleration, machine learning, and so forth.
And as we were looking at, when you want to use HBM as an acceleration factor for your engines, it's great, except that it is capacity limited.So even if you stack these things up pretty high, you get kind of a limited amount of capacity that you can add in your system.And so if you look, for example, at a typical application, you might get 80 gigabytes of total capacity.But you also have to have 9,600 signals between the accelerator and the memory devices, and they have to be less than two millimeters away.So it's essentially just by physical limitations means that you're going to hit the wall.So if you look at what's going on with large language models in particular, they just don't fit in that HBM.And that's where you hit this wall where you just simply can't fit the models into memory.And in particular, what that means is that the way you add bandwidth is by adding more AI accelerators.So there's a fundamental cost, there's a fundamental footprint increment that you're going to suffer by taking that approach to solving the problem.So what we at Wolley wanted to do was to look at alternatives for this.We're working with a customer who needs to expand for the large models, but simply just can't fit enough HBM.
And what's happening is the roofline model indicates this point where you can add memory up to a certain point, and then you become compute bound from that point forward.And so the challenge for large language models is to change that inflection point.If we can increase the amount of memory, we can change the point at which you switch from being memory bound to compute bound.And we just can't keep adding more HBM for the physical limitations that it provides.
So what we're proposing to the industry is a rethinking of how memory is architected.In particular, if you look at a CXL interface, the FLIT or flow control unit that comes from the host has everything that a memory needs.It has an address, it has a command read or write.It has data, it has some metadata, and it has everything that the memory subsystem needs.So we've created something that we call CXL native memory that takes that and implements this using a standalone chip done in a logic process, and then naked memory die that are attached to that logic device.So in some ways, this is similar to HBM.However, we're focusing this on a much narrower interface, a 32 signal interface, a PCIe by eight interface on the host side, and then taking all of that logic that you would implement in the DRAMs, consolidate that into one place, and then implement a wide memory bus.A wide IO gives us the ability to process a full cache line or FLIT in every clock cycle.
So with this kind of approach, we're not trying to displace HBM.We recognize HBM has its place.This is a way to incrementally add memory capacity, but in a very embedded environment.And so with this over a very narrow channel, we can add significant amounts of memory.With PCIe Gen 5, we can get 32 gigabytes per second, and the PCIe interfaces are full duplex.CXL allows you to do reads and writes simultaneously.So in a peak sense, you can actually get both of these running simultaneously.We have initiated since supercomputing conference standardization of this form factor.We're looking at a proposing a variation of the M.2 connector that is used for NVMe today, expanding that to an eight bit interface.And so in a module roughly an inch by an inch, you can have increments of eight to 16 gigabytes of memory on each of these 32 pin ports.And this is designed for a longer throw, like 100 millimeters, so it can be further away past the HBM.So with the incremental addition of these memory modules, this is how we're going to raise that roof line and increase the performance of the overall solution set.Note that we also see that UCIe is a very compatible expansion of the virtual world with CXL.So everything that you're seeing here, while it's CXL focused, could also be applied to a UCIe implementation.
Now energy is a huge problem.All data center managers recognize that the demands, especially for artificial intelligence, are taking things through the roof.And the US Department of Energy has embarked on this program to identify the sources of this.And as you can see with the graph in the lower left, cryptocurrency and artificial intelligence are many, many orders of magnitude higher energy use than standard programming.So we've identified that this is a problem.So what are we going to do about that?And wouldn't a CXL solution make the power situation even worse?Turns out that that's a little bit of a misconception.
When we put this together, what we find is that by reducing the throw and targeting a solder down or a small module form factor that's on the motherboard, as opposed to that E3.S form factor for the plug-in modules, we can reduce that power competitive with HBM.So in this particular model, we compared this to LPDDR.And what you see is that you're getting significantly higher performance and lower profile than even low-power DDR.And this is because we can assume a short throw.We can eliminate the DDR interface to the memory completely.All of those redundant circuits that are in all of those DDR chips get put in one place.And we can also think about re-architecting how refresh is done.A tremendous amount of energy is spent just refreshing those capacitors in all those DRAMs.We can now tie refresh into the CXL memory allocator and reduce power even further.Finally, just the fact that we're a FLIT-oriented page means that we can focus the memory on satisfying the needs of CXL instead of the general-purpose solution that DDR and even LPDDR provide to the industry.By being FLIT-oriented, we can move 64 bytes of data in and out of the core and achieve an efficiency 2,000 times better than the efficiency of the 1-kilobyte page per DRAM that you get in today's memory modules.So what about the performance?The performance is also an issue in the sense that we recognize that CXL latency is a little longer.So what does that do to performance?Well, you have to remember that the CXL interface, by being full duplex, doesn't degrade as quickly as LPDDR, HBM, or other solutions like that.So under a light load, you see that they're nearly equivalent.The DDR solution is maybe a little bit better performance at light load.But as the load gets heavier, you'll see that the degradation of the CXL solution is much flatter than the degradation of LPDDR because now you're dealing with things like turnaround time bubbles.
So in conclusion, what we're seeing here is that the large language models and the growing number of AI applications, including cryptocurrency, are fundamentally memory-bound and need some kind of expansion.So this use of CXL addresses that limitation of having HBM as your only high-speed source for the models.This CXL native memory solution is FLIT-oriented.It takes the concept of CXL and delivers it directly to the memory cells themselves.And that if you need lots of capacity expansion, 32-pin channels are an easy way to augment what you're already putting into your system.CXL memory power can be very competitive with HBM if you design it for these in-motherboard solutions.And finally, the full duplex operation is what offsets some of the latency penalties that you would get from CXL.And with that, let me hand it over to the next presenter.
Thanks, Bill.So my name is Tracy Spitler, as Kurtis introduced me earlier.I work for IntelliProp.We've been working in composable memory systems for five to six years now.
And with CXL 3.1, I think CXL has started to really recognize its ability to compose memory.And so with 3.1-- and Bill and Vijay have talked about memory expansion, bandwidth, all those things that have been with CXL since the beginning and into 2.0 and 3.0.With the introduction of 3.1, it really expands the true composability.So with 3.1, you get the introduction of a port-based routing switch.And what a port-based routing switch is going to give you is support for topologies outside of a base tree topology.So with a hierarchical base routing switch, HBR switch, you were kind of limited in kind of the PCI tree-based topology.Now with 3.1 and a port-based routing architecture, you can move away from that.And some of the other features of this is you can set up your switches for address-based routing such that you can define best routing paths and redundant routing paths and secondary routing paths.And again, moving away from the tree topology, you can start supporting topologies such as trees, mesh, ring, star, butterfly and multi-dimensional topologies.And I'm not going to touch too much on it yet, but this was kind of a big thing with 3.1 because now as you're sharing memory and you potentially have multiple hosts attached to the same memory and maybe to the same regions, maybe not to the same regions, there have been enhancements in the security protocols.
So with just about any kind of new technology, you start looking at what are the use cases and where can we apply this?And I think both Vijay and Bill touched on these use cases, but I want to kind of move on to how does composable memory apply to these use cases?And so if we start with just cloud computing and specifically virtual machines, one of the talking points through the years as composable memory has been contemplated and introduced is can you address the stranded memory concept?And I footnoted a page or a paper here that was put together by the Microsoft Azure team, and they do a really good job of walking through the different scenarios of stranded memory and how CXL pooled memory can address stranded memory.And I think there are limited cases here and depending on the scale of your memory and the size of your system, it may or may not be worth the cost.And again, this paper does a good job of pointing out if you're going to address the stranded memory problem with CXL pooled memory, here's really the way you have to do it.And moving beyond stranded memory, we have had a lot of interest from VM providers, software companies and people who deploy VMs in security.The folks are worried when you're running a VM that if the CPU becomes compromised, that a malicious actor could get to your data.And now with pooling memory, there's this use case of multiple CPUs and it's not just on its local memory, but it could be on this fabric attached memory, could all be running VMs out of the same physical memory.And I think there are use cases where this is a good idea and we'll get into the challenges in a little bit about when you deploy VMs into fabric memory.And then of course, everyone worries about performance, it's a cost performance trade-off, but if you're doing large database applications and things that require a lot of swaps and experience a lot of page faults, rather than swapping to storage, and I think Vijay mentioned this as well, you can now swap to fabric attached memory and get orders of magnitude better performance both from latency and bandwidth.And then the other use case that a lot of folks like to talk about is this AI training and inference models.Obviously, this has blown up over the last 18 months or two years.And there's a lot of tricks or algorithms that people use to deal with these large data models and a lot of them do involve storage and swapping data in and out of storage and swapping data from CPU memory to GPU memory.And there's a great use case here where now with fabric attached memory, where you can have multiple processing elements, they can all share the same dataset, have direct access to that dataset.And because it's composable, dynamically composable, you can scale the memory needed on a per use or per training or per inference application and then scale that back down or scale it back up for the next go around on your dataset.
So again, CXL memory pool potential, it kind of matches up with the use cases, but this kind of expands to where it could go.And so with cloud computing and HPC, you do incur, and both Bill and Vijay talked about latency, you do incur a latency hit.But that latency hit, and even the Microsoft Azure paper pointed out that latency can be very manageable and a lot of software can deal with the higher latencies.And if you have visibility into what that latency is on your fabric attached memory, you can gear your software workloads to match those latencies.And again, I talked a little bit on the last slide about these dynamically sized to workload fluctuations.So you've not stuck with whatever memory you allocated when you booted your server.With the fabric attached memory, now you can add and take memory away.And we've experimented with this.We've worked with customers on this, and it's really an ideal situation for a lot of use cases where as your workload increases, you can dynamically change that size to memory based on the composability in the fabric.And you can scale it back down as your workload drops.And then I mentioned VMs earlier and a little bit about security.One of the things we've talked to folks deploying VMs and VM companies is this idea of frictionless host migration.So again, if you're concerned that your CPU or your host has been compromised, but your data for your VM is all sitting out in fabric attached memory, it's almost frictionless just to save a few of the CPU states and completely move that VM to a different CPU that is repointed to the same data set.And then potential in AI and GPU deployments, again, and I keep hitting on this, but the dynamic memory capacity is really something that I think a lot of folks can really take advantage of based on their current workload and what workload they'll have in a month or six months or a year.And they can change that capacity dynamically without taking your server down.And again, the AI and GPU applications may be ideal for fabric attached memory because they are very bandwidth hungry, but they tend to be very latency tolerant.So they match up well with this idea of a memory fabric that can provide high bandwidth, but may impact latency negatively.And again, it gets back to the shared processing unit where you could have a single data set sitting in fabric memory and have CPUs, DPUs, GPUs all looking and operating on that exact same memory.And now you can increase your memory capacity without adding expensive GPUs.We've seen several customers come to us and say, "The only way I can get more memory," and Bill touched on this as well, "is to add GPUs because they're so limited with their HBM memory." Now if your GPUs can directly attach to the fabric attached memory, you can increase your memory capacity without just stacking more GPUs in your cluster.And again, this gets back to reduce or maybe even eliminate that high cost HBM by substituting fabric attached memory that the GPU can directly address.And then Bill touched on this as well.Energy is a big concern.I read a paper recently that was kind of interesting and a lot of people talk about training and the amount of power in training, but it's really the inference where you want to try and save a ton of energy because that's the ongoing cost.Training happens once, inference can go forever and ever.So by having fabric attached memory and reducing data copies, perhaps you can have multiple GPUs that are doing the same inference, but are all using the exact same data copy.And therefore you don't need a hundred or a thousand copies of memory or copies of data in your memory.You can have a single copy and they can all access that.And again, going back to this idea that you can scale up and down your memory size based on your workload, if you have excess memory and you're not using it in a fabric attached memory scenario, you could power that memory down and there's no need to keep that power going if you're not using that memory.I kind of put this drawing to the right.I kind of missed over it the first time, but this is right out of the CXL 3.1 specification that kind of shows the multi-pathing and the multi-host and multi-device capability of a port-based routing topology.
And I do want to talk about this.This is kind of off topic a little bit from the hardware side of things, but both Vijay and Bill mentioned software stacks and the requirement for software to support CXL and CXL attached memory.Now this becomes not just software applications in a fabric topology.Now you need a fabric manager, which is something fairly new to memory management, right?Because now you have memory out there that can be dynamically composed.You have routing that needs to be set up in these switches.And a big part of that is the fabric manager.And the fabric manager is a software stack that typically would run on your CXL PBR switch.And some of its responsibilities would be topology discovery and reporting.So on initialization or as devices come or go, that fabric manager is responsible for keeping an inventory of what is attached and where it is attached.Composition control as a host, once memory, the fabric manager is really the software entity that assigns any of those CXL devices to any of the hosts.And then the routing control we mentioned earlier, the address based routing, the fabric manager is really the input into the switch that defines that routing.And then there's security control.We mentioned security earlier.And then, of course, now you have these switches and this memory that may be on different power supplies, may be outside of your CPUs, likely will be outside of your CPU box.And keeping track of health and monitoring in that fabric is essential.And the fabric manager is going to have to deal with all those things.And then the interfaces to the fabric manager and from the fabric manager.So the fabric manager understands the hardware.And so he interfaces down to the switch hardware.And then from the other side of the fabric manager, there's typically going to be a composability agent.And that composability agent is the guy who's going to say, well, this host wants this memory.So now I need to go tell the fabric manager to assign that host or now this host doesn't need that memory.And I need to go tell that fabric manager to deallocate that memory.And a couple examples out there, the Open Fabric Alliance has their Sunfish framework.And then LiqidOS, we've done a lot of work with those guys.And they're known for doing composability.And I believe there's other folks out there, OCP and folks that are looking at this composability agent.
So this all sounds great.And it doesn't come for free.Nothing ever does.But let's talk a little bit about CXL pool memory challenges and where hurdles are going to exist that need to be tackled to deploy fabric attached memory.And one that really gets mentioned a lot is software friction.And by software friction, I mean, there's existing HPC and GPU and AI software stacks out there that have accounted for not having fabric attached memory.And those software stacks are qualified.And I think everyone knows how hard it is to requalify software, particularly at a HPC or server level.So some of the suggestions for where software changes can occur in the VM space.There's going to be longer latency to get to fabric attached memory.There's just no way around it when you're switching and hopping through switches to get to memory.It's going to have higher latency.So the VM guys and software developers can work to, A, figure out how to tolerate longer latency or B, understand the memory subsystem such that it can request higher latency when it can tolerate it or lower latency when it needs it.And then, of course, not just pure latency.There's going to be variation.So based on how much traffic you have in your fabric, that latency may be somewhat less predictable.And again, it gets back to being able to be latency tolerant and potentially NUMA awareness.And I mentioned NUMA awareness, we've done some experimenting where we attach different NUMA domains to different latency domains so that a composed building manager could understand based on what tier of latency you have, where to put that in NUMA.And then NUMA aware software could understand this is how much latency I can expect.And then from the AI side, the programming changing, the scheduler changing, it changes if today you're running GPUs where you're scheduling data swaps into the GPU's HBM memory.Now if the GPU can have direct access and look directly and have load store access directly to fabric attached shared memory, that scheduler certainly would change.And this idea that you could reduce GPU HBM or maybe even eliminate it, that's really going to have to come down to use cases and your architecture and looking at things like latency and bandwidth.And again, with the AI stuff, can you rethink the data copies and movement?Some of the other challenges that are going to be out there is scalability.So Bill mentioned kind of the physical limitations of HBM.CXL runs on PCI.And if you're going to scale up and out on fabric attached memory, being able to tolerate centimeters worth of physical interconnect, you may not be able to scale out very far.And then of course, when you're adding a bunch of switches, you're compounding your latency.And then there's the cost, right?You're not going to get this for free.The switches are going to cost the memory controllers, depending on how much intelligence you want to put into your memory controller is going to add costs.None of this is going to come for free.And then not just dollar cost, but also latency cost.If you're trying to scale out, you think you're going to have to put retimers just to get out to the switches, just to scale out.And then this is compounded, trying to figure out how much cost and how much it's going to save, I think everyone knows how much DRAM costs can fluctuate.So those are some of the challenges I think we see.They're certainly not inclusive of all challenges that people are going to hit, but these are kind of the early, I think, hurdles that we've seen folks bring up.So with that, I'll turn it over to Kurtis to moderate any questions from the crowd or a discussion.
All right.Well, first, I'd like to thank Vijay, Bill, and Tracy for the presentation.Great presentation today, guys.And open it up for questions.Please type questions in that you have and we'll get to them.I'm actually going to start off with a question that I have for Vijay.As you talked about the memory on CXL, what do you envision as the types of memory that will be connected via CXL into the servers that they support?
That's really an interesting question.I think that if you look at where things are today, what we're already seeing is amongst our hyperscalers, there is an interest in finding a way to use older generations of memory within the CXL construct.And that serves two really valuable purposes.The first, of course, is the cost aspect of it.Memory that was used directly attached to servers can now find a new home as opposed to being sold in the open markets, can find a new home as potentially a second tier of memory.And the second benefit of that approach is, of course, we're all doing everything we can now to be more environmentally conscious and aware.And so reuse is a big deal.So it does serve that purpose as well.So you would see different vintages, different flavors of DDR4, DDR5 in CXL deployments.That's one.The second is, I think that there is going to be a need for lower power consumption in memory used in data centers.We're starting to see some of those trends anyway.There are conversations around LPCAM as being applicable to data centers.So there's certainly going to be an interest in lower power memories being made available in CXL or through CXL.And then finally, I think that as we navigate the CXL landscape, we'll see more cases which ask for different media connected behind CXL through some combination of DRAM or some other memory that supports load store semantics, and then more persistent media as well behind the same interface providing value to specific workload, specific use cases.I think there's already a few examples of that last case that I mentioned, and I think we'll see more of those solutions being deployed.
Excellent.Thank you.We do have a question from the audience.William from Samsung asks, "Regarding AI accelerators over CXL, we talked about CXL native memory on PCIe Gen 6 x8 lanes, and maybe 11 of those would be needed in order to keep up with HBM performance." His question is really around, "Is there a limit to the number of PCIe lanes in a CPU or cluster that we're aware of that would make it feasible, or how do we scale in the data center?"
Yeah, that's a great question, and I think there are going to be multiple answers to that.One of them will come from me.In theory, a 32-wire interface is a whole lot easier to manage than what's currently 1920 wires, and with HBM4 is going to double again in pin count.So you have some fundamental, just basic pin count limitations there.So part of the answer is going to be, how many ports do you want to put on your CPU or GPU or your APU?And that is going to be something that evolves over time as the availability of things like CXL native memory or other solutions where a lot of expansion can be done over a low pin count interface enters the market and is available.The second answer is actually going to come from Tracy.Tracy, wouldn't you like to field part of the question about how switches could be used for that expansion?
Sure.I agree with Bill that switches are part of the answer here, right?So physically, you might not have that number of connections to the CPU or the GPU, but the switch gives you the capability of expanding the number of memory units.And if your switch and topology and architecture are built well, you can start thinking about interleaving those CXL memories attached to a switch.And let's just say you have 16 CXL memories attached on one side and you have to a GPU maybe four or eight by 16s CXL interfaces.And you can keep up to maximum the bandwidth on those CXL interfaces by interleaving across the memory.And I think you also want to start thinking beyond PCIe Gen 6 and start thinking about, are there other physical connections that PCI or CXL would adopt that might be something like 800 gigabit that would allow more reach for one to get the scalability and higher bandwidth.
Yeah, and I think the other answer is going to come from the UCIe front that reducing the number of signals on a UCIe interface is also going to be valuable where we're designing for a little bit longer throw.So it could be on a substrate as well, that we could be seeing these kinds of connects to the processing.
All right.Well, thanks for that insight, guys.Bill, a question for you.How does CXL memory device, how do these devices manage power?
Well, first of all, you want to look at the efficiency of today's DRAM solutions.It's not really pretty.If you, for example, you look at a CXL memory module with 10 DDR5 devices on a rank.And so what happens is that you come along with an activation to get a page.Well, each of those DRAMs has a one kilobyte page size.So that's 10 devices across 10 kilobytes of data are activated from the core out to the sense amps.And then that's a destructive activation.It erases the contents of every one of those capacitors.So a pre-charge cycle needs to be done to restore those contents when you're done.So that's 20 kilobytes of data movement.And how much is a flit?A flit is 64 bytes.Do the math on that.That means that a DRAM by its current definition is 0.025% efficient.I would argue there's a whole lot of low hanging fruit in how to do memory management if you were to redesign and re-architect memories to be flit centric instead of the traditional page mode that we've been using since the 1990s.Other aspects of that would include the granularity that, yes, we were discussing with our friend from Samsung that you might put 10 or 11 or 12 of these CXL expansion channels off of a processing resource.But that assumes that you're constantly working the same workload.And in a data center, that's far from accurate.Some data centers report at best 75% utilization.At other times of day, they might be bound below 30% utilization.So another aspect of the CXL granularity is the ability to shut down channels that you're not using.So that scalability is much more graceful when you have small interfaces that you can shut down or put into low power states in between uses, which are impossible with things like the direct attached HBM memories.DJ, do you want to jump in on that question?
I think both of you covered pretty much all that I had to add there.I think definitely with 3.0 and 3.1 bringing in the abilities of switching in aggregate the bandwidth capabilities that we can access would be enormous.And I think a big part of how available that bandwidth would be to either GPUs or accelerators would be a function of the pipe between the GPUs to the switch or the accelerator switch and a few other variables.I think both of you covered that already.Nothing to add really.
All right.Well, we've got a question coming in.Start with Tracy on this one from Mohan at HPE.He asks, where do you see the future of fabric management?Will it be distributed in switches or centralized in some sort of fabric management station?
Yeah, so I think currently the CXL 3.1 spec there actually is a fabric management specification.I think from our experience and kind of how it's specified, the fabric manager will reside kind of at a central place.It's difficult to distribute the functions of the fabric manager across different switches, particularly when you're trying to connect to a higher level composability agent.Now I will say that I think one place that needs to be looked at, and that is a failover fabric manager.And we've experimented with this in that you'll have a fabric manager running on one entity and will heartbeat to another fabric manager just to stay alive and stay in sync.But if that switch goes down where that fabric manager resides, you really want that redundancy in that failover.So from a distributed standpoint, I think the first step and probably the most important step is the redundancy so that you don't lose all your fabric manager if you happen to lose a switch.
All right.And then a question from Tong at ScaleFlux.He says, "Is it a fair statement that the majority of CXL memory deployments do not require chip kill type die failure protection?"
Yes, that's actually a very relevant question.I will say at the start of that the product that we've launched, which is the 2.0 product, does support chip kill.And the question then is going forward as we enter 3.0, 3.1 timeframes, what does this look like?And I can share that feedback we've received so far indicates that a vast majority of the customers are very comfortable with chip kill and don't want to give up the assurance that it provides in addition to standard RAS capabilities that you would have anyway or memory.And the reason for this is that a lot of these customers have service level agreements with their customers and it's just too expensive if you have downtime because of a memory failure and that impacts SLAs.So that's what we're seeing today.That said, from a CXL memory supply perspective, we are definitely looking at this to see are there use cases where we don't necessarily have to have the same level of chip kill and where we're going to be evolving as the market needs evolve.
Thanks, Vijay.Question for Tracy this time, how is coherency managed through the fabric?
Yeah, we only have a couple of minutes left, so I'll keep it to a couple of minutes.I could probably talk for hours, but the CXL specification, if you want to deploy the CXL.cache provides a great communication between compute elements and the memory controller to the state of the cache lines.So if someone else updates something in the memory, there's a communication mechanism if it's implemented to notify the CPU that, hey, this cache line has been invalidated and then the CPU cache controller can decide whether to reload that data.There's kind of a cache light capability that was introduced, I believe, at 2.0 called back invalidate, which essentially allows shared memory and a memory controller to simply notify the CPU that, hey, this cache line was touched.And then I think folks have been tackling this for years and part of the coherency question or an answer is how do you lock memory in a shared memory environment?And there's things like OpenFAM folks out there that are doing this and there's libraries in Linux to lock memory.So if you're concerned about the CXL.cache protocol, because it could generate a lot of traffic in your fabric, it may be, and it all depends on use case, it may be more advantageous to go with an OpenFAM or a shared locking mechanism that's kind of already been deployed and already tested and is in the industry.But I'll keep it at that so that we don't run too far over time here.
All right.Well, thank you very much, Tracy.And I think that wraps it up for us.Appreciate everyone's attendance, the great questions that came in, and we look forward to talking with you on future webinars.