189

YouTube:https://www.youtube.com/watch?v=8oOUCYVuPJQ
Text:
But the gist of my talk is how CXL is a big boon to the memory problems we have today. Like the server memory is basically not scaling very well,  considering the growth in the number of cores. Then how CXL is going to help us. That's really going to be my main point,  how CXL can really help us in that regard. Then Intel is fully embraced to this protocol. It's here to stay. Who is looking at adopting CXL should feel pretty confident  that it's not going to be like a fly by night or something which just goes away. There's definitely a lot of momentum. With that introduction, again,  my name is Anil Godbole. I work in the Xeon,  you could say like a product planning group. Xeon, as you know, has the computing cores, of course. Then there are all these other goodies on the side like the accelerators and the PCIe IP,  the CXL IP, all these other goodies on the Xeon,  which are now equally important as the cores. We are part of that group and also help with  marketing because these are some of the very advanced features. I also help in marketing since I have a good subject matter knowledge in this thing.

With that introduction, let's go to the first slide. Yes. Let me start by pointing out so that I always believe in,  make sure the audience agrees with you. The very first box on the left is  showing some of the popular memory intensive workloads. Now, I understand today's audience is mostly the IT folks from various companies,  so they will definitely agree that yes,  these are indeed some of the most popular applications today. They are more important,  they're really memory intensive. You wish you could really buy a lot more memory for your servers at reasonable prices. But today, if you see on the right,  the next chart, the number of CPU cores on all the CPUs,  not just from us, from our competition and even the other guys on the ARM side. They are all putting out so many cores that if you,  today, let's say on your laptop,  everyone needs at least eight gigabytes memory,  but most people actually use 16 to really make the laptop run well. Now, imagine on a server with 200 plus CPUs,  if you don't give at least eight gigabytes,  I'd forget the 16 gigabytes,  then it won't be worth it. I had no point buying the 200 core server. Anyway, the CPU core growth is going,  but the chart on the right says,  but the DRAM density,  in my young days,  DRAM used to scale 4x every three years. This was late '80s and that era. Then today, we are lucky to get a 50 percent increase. Today, 16 gigabit DRAMs are in vogue. I believe the 24 gigabit,  which is a 50 percent increase,  chips are already out,  and maybe some vendors are already shipping in some quantities. Then 32 gig are just coming out as well,  but they are further out. Needless to say, since the DRAM chip density is not scaling,  trying to make DIMMs,  which are higher capacity,  like something like two-fifths gigabyte DIMMs or something,  which will be needed as this high-core count CPUs come in. Those are today very, very expensive. Yes, if you pay the money,  you can get it, but the punchline of this slide  is the memory density and the cost are not quite keeping pace  with the growth of the data center infrastructure  and the CPUs and the workloads. Yes, that memory cost definitely dominate the server world. Now, if you thought Intel is making all the money,  no, it's just the memory guy.

 Next slide. This one and the next one is a quick introduction to CXL. Maybe I can even skip,  but basically just pointing out,  and this is straight from the CXL consortium. So the purpose of CXL,  we have worked with PCIe as a CPU-to-device connect link  over more than last two decades now,  but what's the need to invent CXL? So CXL basically did two things. It makes the memory coherent. So like in the PCIe world,  the device-side memory is considered an MMIO  or memory map I/O memory. So in that sense, when the CPU reads it,  it cannot be cached. And that's why, basically it's the device  which most of the time moves the data by doing DMAs. But with CXL, the device-side memory  now extends the address space of the CPU. So if CPU has eight gigabytes  and the device has another eight,  your system will boot up saying,  "Hey, I got total 16 gigabytes."  So that's the beauty of CXL. So the one big purpose was to improve  the accelerator performance, right? And then the other one is simply to add more memory,  which is that second box below. And it's turning out today that the capability  to add more memory is like 98% of the use cases of CXL today. So like this, how to add,  show me how to add memory with CXL  is really what everyone wants. Okay, next slide.

And yes, and before I go deeper into the CXL side,  just want to know, because a lot of people still  are not aware how CXL really connects with a CPU. So a lot of people think,  "Oh, yet another set of wires on the CPU."  No, not so. CXL rides on the rails of PCIe. So like what used to be the PCIe links today,  starting with at least at Intel,  starting with the fifth generation CPU,  which is a Sapphire Rapids,  the CPU host IP is as shown in the diagram, right? So either PCIe or CXL. And the PCIe connector is now a double,  what do you call the, it can be used as a flexible port. You can either plug a CXL card or a PCIe card. And as the link comes up,  the card says, or the device says to the CPU,  "Hey, I'm going to speak CXL,  or I'm going to speak PCIe."  And the host will appropriately switch  the host IP on its side, right? Okay, so just a quick primer. I mean, in that sense, adoption of CXL should really be helped  because now we don't need any new infrastructure. I'm just showing a picture of the motherboard  of the Archer City, which is a Sapphire Rapids,  our reference motherboard, right? And those four PCIe slots can also be CXL slots. Okay, next one, please.

Okay, so now let's get into how CXL really helps, right? So the title of the slide is  Augment Your System Memory with CXL. So what's shown in the picture is a CPU  with eight DRAM channels,  and our next generation CPUs will feature more. Now, keep in mind, I mean, this slide is trying to show  that how CXL actually helps some of the ills  which DRAM has, and that's what I'm showing there, right? So it's expensive to add. We wish we can keep on adding DRAM channels  because at the end of the day, as I show on the left side,  the DRAM CPU attached DRAM has the lowest memory latency. So don't get me wrong. Just because we have now CXL,  people are not, CPU manufacturers are not going to ditch  DRAM capability anytime soon. They'll do their best to keep on adding  more and more channels of DRAM  or improve the DRAM bandwidth,  and in all kinds of things. But at some point, you do run into a package limitation,  right? And as I show on the right side there,  adding a CXL link is much cheaper for CPU package,  as you can see, do the map there. One bystander, I'm sorry,  one bystander CXL link has a bandwidth  of roughly two DDR5 channels, right? And for about 66 pins on the CPU package,  you can get one CXL link,  and to do the same thing with two DDR5 channels  would cost you 250, right? So definitely CXL, after you're used up all the DRAM pins,  which you can afford,  only way to add even more memories to add CXL  is what I'm trying to say. And then some other issues with CPU attached DRAM,  like moment you do,  try to increase capacity by doing two DIMMs per channel,  I mean, gone are the days of four DIMMs per channel,  but you can either do one DIMM or two DIMM per channel  to do DDR5. And as you can see, moment you do that,  often the DDR5, the signal integrity of the channel degrades  and so the DDR speed drops down, you know? And now, so you got more capacity,  but you lost your overall bandwidth. But on CXL side, as we know,  so that look at that picture,  which is showing like a add-in card form factor,  or the EDSFF form factor,  which is a SSD-like form factor. But both are logically really the same. There's a CXL link coming out of the CPU,  that black chip in the center there  is really the CXL memory buffer,  which converts the CXL protocol to DRAM. And then what I'm showing there is a DIMM,  four DIMM slots in that picture. And on the EDSFF, you put all the memory down,  so you can pack it more densely, basically. So, and once you do that, that memory buffer AC, right? You can put slower CXL memory if you want to. In fact, one of the big reuse case for CXL is coming out,  and especially from the cloud service providers,  is they want to reuse their DDR4 DIMMs. As the Gen5 CPUs roll in,  they will be normally recycling them, right? But no, now they can actually stick them  behind these add-in cards and reuse them,  because CXL memory buffer can absorb  the slower speed of the memory,  and still deliver the full CXL bandwidth, right? In fact, one of the vendors called Microchip  has a chip which has, for CXL by eight,  which is one equivalent to one DDR5 channel,  they have two DDR4 links behind, right? So that's one way to use that. Anyway, one big thing though, CXL being a farther away  memory, you will have to contend with the higher latency,  right? So that's all the next evolution on the CXL  on the server software side, which is going on. So next slide, please.

Actually, yeah, before I go there, right? So, wanted to show how Intel is emerging the CXL. Did I, I guess, I don't know if I switched the slides,  but anyway, I wanted to show this slide a little bit later,  but let's, now that we have it, let's just speak to it. So first of all, Intel is, like I said,  we started with the fourth and, oh, actually the,  I meant to say the fifth, yeah, that is correct. So fourth and the fifth generation Xeons,  which are the Sapphire Rapid and the Emerald Rapids,  respectively, and we have the same Eagle Stream platform  for that, right? So that is the one which first offered CXL. In fact, Sapphire Rapids was the first one  and both are at version 1.1, right? But as we know, ecosystem for CXL,  all these device manufacturers, all that,  they're just coming up. Nobody really is shipping in big volumes today, right? So we know that, our every plan is all that. So for them, the big focus was, yes, jumpstart the,  get the CXL thing started. And just like Intel has done in the past,  Intel was the founder of the PCIe, founder of the USB,  founder of SATA, I mean, all these things, right? So in the very first few years of any standard you launch,  you had to nurture the ecosystem, develop the ecosystem. Now don't really count on the revenues, right? But our next gen, which would be GNR,  which we announced on the Burst Stream platform,  it will support the CXL version 2.0 spec,  which is now with the memory pooling  and all these modes available, it's all the rage today. So yes, and the Sierra Forest,  all these three different CPUs shown there,  they were actually two different CPUs,  they will both support this thing. And then of course, in future,  we will put out the next Xeons and keep up basically. We will keep up with the CXL roadmap. It's really the, Intel is fully committed to the CXL,  is really the purpose of this lab. Okay, next one. And then actually before I go there,  so note the two modes we have, right? So the memory pooling for the BHS CPU. So memory pooling is already,  everyone asks for memory pooling,  but we have one other unique mode called flat memory mode,  which I'll talk about in the next slide. Next slide, please.

Okay, and this is the slide,  which I wanted to actually show earlier,  but somehow I am buying this thing. So what this slide is showing is once you add CXL memory,  remember I said, it's the higher latency, right? So it ends up as you boot the system,  normally a system will boot up and the CPU will see,  or there are two or the OS will see two new modes, right? One is the near memory, which is the DRAM,  the other memory will be the CXL. And I'm talking a simple single server system. So it has its own memory and it's got this CXL attached. So normally what happens as shown on the left,  Linux has now, is already evolving, right? They already put in a lot of capabilities,  but it's up to the OS to make sure, right? As a workload starts running,  and assuming it's a big workload,  which has memory in both the DRAM and the CXL side. And then it starts executing from the CXL side,  it will slow down, right? And you don't want that. So what Linux will do, or any OS,  or even the software from likes of MemVerge,  what they will do, they will move the hard pages,  which are in the CXL and merge them,  promote them to the DRAM side, right? And of course, at the same time,  you have to find core pages in the DRAM side  and to replace them, because otherwise at some point,  the DRAM will get full with the migrating hard pages. So you also move, create room by moving out the core pages  from DRAM, right? So all this stuff, the OS and all this software  from MemVerge will do. And yes, there are now telemetry overheads,  because now you have to keep on detecting  which page is hard, which page is not, right? All that. And then once you decide to move,  now you move with 4K,  because page size is normally four kilobytes. So yes, Linux will work, the software will work,  but what I just want to put a little plug for Intel  is on the right side,  is we also have a built-in  two hardware control memory tiering nodes. So one is this interleaving of DRAM and CXL memory addresses. So this will allow you to not only add more capacity  to the CXL, but as I show there, by bandwidth expansion. So like memory, bandwidth intensive workloads  like machine learning and all these, or any of it, right? Can take advantage of this mode, right? And the other mode is actually a TCO play, as I show there,  which is a flat memory mode. So in this mode, you can actually use a cheaper memory,  like DDR4 on the CXL side. And the system is faked into thinking,  oh, you got a big memory footprint. And both of these nodes,  can you hit the return? Because I think I have the punchline below. Hit the return, enter button.

There you go, yes. So in both of these modes, the unique thing is,  first of all, they're unique to Intel Xeons,  but the system will boot up as a single NUMA node. So keep that in mind, like on the left side,  as I said there, it will boot up  as two different NUMA nodes. And here, it will boot up as a single NUMA node. That means the OS is fooled into thinking,  oh, I don't have to do anything. I don't have to do any page movements, right? The CPU hardware will do all the page movement. So if there's any interest, please get in touch with me. I'll have my email address there. But that's really how a CXL memory can be handled, right? So even though it's higher latency,  there are ways to reduce that latency. Or like hide it, not reduce it. Okay, next slide, maybe our last slide now.

Yes, and then the last slide is really to say  how CXL standards now firmly entrenched. We are showing all these founding members on the left side. And then in the recent years, prior to CXL,  there are these other standards like OpenCAPI from IBM. And then of course, everyone knows the Gen Z folks  and the CCIX from our competition  and from Xilinx originally, right? So all of them, they said,  hey, CXL is going in the right direction. So they folded their IPs into CXL  and now everyone's in the same community. So as I say, there are 250 plus member companies  feel confident about, you know,  CXL is definitely coming to a server near you. Okay, so that's my talk.

 Let's see, I have a question for you. Luis, what is the advantage of flat 2LM  over Linux auto NUMA mode? Linux already supports auto NUMA mode,  which offers similar capability.

Correct, now flat 2LM. Okay, so let's just stay right there. I didn't put, I should have put up a back up slide,  but I will add one when I do the final one to you, Frank. The big advantage of, so what I call,  just note the analogy, right? So what flat 2LM does allows you to use cheaper memory  on the CXL side. So what we have demoed,  in fact, I was the one demoing Intel Innovation Show  at the Supercomputing Show last November. We show a server running SAP HANA  and with 250 gigabytes of memory,  all on the pure DRAM or native memory side. Now we split it as 128 gigabyte,  half and half on the native and 128 on the CXL side. But the CXL side is completely made from DDR4. Okay, now when a workload, right? So again, the OS is fooled into thinking,  and we basically showed that the SAP HANA performance  does not degrade. Flat memory mode is not about performance expansion,  but it's about TCO where we are saying  you can use a cheaper memory on CXL  and not affect the performance by more than 5%. So what flat memory mode does,  if a cache line happens to be,  is missed on the DRAM, right? It knows it will be in the cache on the CXL side,  because the total footprint is 250 gigabytes in this case. If a line is outside of that, then it goes to SSD,  nothing you can do. But if it's within that,  it can either be in DRAM or on CXL side. So the CPU will only migrate a cache line worth of data. If it's a DRAM missed,  then it's the brother cache lines on the CXL side. So it will swap out those two lines. And for most applications with good temporal  and spatial locality,  flat memory mode works very well. Yeah, and it does not work well  in case the application is switching widely,  it's addressing something like a graph database  where from one node to another,  you don't know where that next node is, right? So it can widely go around  and basically doesn't have good spatial  and temporal locality. So, but that's how the flat memory mode does. It does only cache line movements. And so much, much better and faster than auto NUMA  or any other techniques.

 Another question related to flat 2LM.
 
Okay, by the way, the new name is flat memory mode. We took out that 2LM because it's too techy. Anyway, so, okay, go ahead.

 So does flat 2LM need CXL-CMM  to support heterogeneous interleaving?

So in flat 2LM, there is no heterogeneous interleaving. It's simply, you could think of it as two-tier memory,  except the OS is fooled into thinking it's only one tier. So yes, it needs a CMM, CXL memory module. And as I said, we had some with DDR4 memory. So it's like for CSPs, DDR4 memory is effectively free. So we are just trying to make a point. So it's like, you know, half the memory is free  and yet the SAP workload ran very well. So there's no, so you're confusing  with the hetero-interleave mode. That is different, right? So that is one where you need fast memory. You cannot use cheaper, lower bandwidth memory  on the CXL side. It is just like today, Intel CPU will support interleaving  of the DRAM channels to increase the bandwidth. Now you can add CXL channels to the mix,  but the CXL channels better be having the same DRAM  as the main memory. So in that sense, it's not a TCO play. It has the, it should have the same bandwidth, so.

 The next question is,  will GNR be released around 2025 as scheduled?
 
Oh, actually I'm not supposed to say,  but I believe it's even earlier than that.

 Next question, flat memory mode,  moving data chunk size is 4K plus between DDR and CXL.

No, it's cache line size.

Okay, so it's saying,  does capacity of DDR need to match capacity of CXL?

Correct. Today, that is the limitation. So we, because it does the cache line swapping. So if you have 128 or let's say two fifths gigabytes  on the native memory,  then you better have exactly two fifths gigabytes  on the CXL side. Now you can always have more memory on the native side,  but for flat 2LM, the boot prompt,  or I mean the BIOS will check that,  oh, this guy wants flat 2LM primarily. And if it finds that there's excess memory  on the native side,  then it will actually boot up the system in two tier,  not two tier, in two modes. Like part of the memory,  of the excess memory on the native side  is called 1LM memory,  while the, let's say it's 384 gigabytes, right? So two fifths gigabytes will be used  for matching the two fifths gigabytes  on let's say on the CXL side, right? So 128 of extra memory will be considered excess memory,  but it can still be used of course. And then some of our customers have found innovative ways  of basically, actually some CSPs  where they say, hey,  if a application is not benefiting from the flat memory mode,  let's say it's getting too many misses,  then they will just use that 1LM to pack,  to keep that application in that area,  something like that.

Okay, here's the next question. It's kind of a long one here. What's the working procedure  of hardware controller tiering feature? Is it fully independent from software,  e.g. things like middleware support  or depending on OS tiering scheme and so on? Will it be supported in GNR, not in SPR?

No, the OS tiering is supported in all the CPUs, right? The OS-based tiering,  which I explained on the left side of the diagram,  is always there. The hardware control modes are offered in EMR  and in the GNR. So the heterointerleave is there on EMR. EMR does not have the flat memory mode,  but the GNR has the flat memory mode as well. But OS tiering is independent. Like I said, when you boot up the system,  you decide which mode you want. You cannot have mix.

Okay, next question. Has Intel done any performance measurements  through workloads between auto NUMA  and flat 2LM mode to showcase its advantages?

No, we have not done that yet,  but yes, I mean, to make any claim,  and before I tout any feature,  I should always have a performance thing. I have performance data  just for the pure flat memory mode,  but I have not compared it with using the TPP  or some such thing,  which is built into the Linux nowadays, right? So we have not done that. But that's part of the thing to do, yes.
 
Okay, next question. Does the DRAM clock speeds need to be the same for CXL  and native DRAM for flat 2LM?

No, no. In fact, I was showing,  I remember I said we use DDR4 for flat 2LM,  while native was DDR5. So we showed in the demo,  SAP HANA, big mid-size footprint,  like 200 gigabyte kind of footprint database records. And we ran it in 128 gigabytes of native DRAM  and 128 of CXL. And the 128 of CXL was using DDR4. I'll send a slide on that. I'll include it in the deck. We don't have it here.

 Okay.
 
Because see, what matters, right? It's not so much the bandwidth play. Most of the time you want to execute out of DRAM. In fact, you never execute out of CXL. Moment the miss happens in DRAM,  the workload will stall for a brief period  when the swapping of the cache line happens. But the workload then gets its cache line in the DRAM  and the, what do you call,  the execution always happens in DRAM. Unlike in auto NUMA, where for some time,  the workload could actually use data  directly from the CXL side. So it's only after the telemetry kicks in,  OS knows that, oh, this workload is really going to this  and this page is hot now on CXL side. That's when it will stall the app and then move the page  and then resume the app, something like that.