265


All right. So i'll get this session started. It's CXL development discussion points. And, you know, a little bit of background. I did see some of the discussion on the tiering side, and i was  Trying to avoid all of that, to be honest, and kind of focus on  Some low-level parts of the stack here. Tiering is a part of all of this, right? We also have driver development and memory reliability issues  That are trying to be faced in the CXL world in terms of what  Was built into the spec and what we're starting to see. So i wanted to touch base on some of those points and show  How development efforts in CXL actually transfer over to other  Subsystems, and so, you know, there's another side to CXL.

So my first sort of ask, right, and this has come up several  Times, so one thing i think it's good for people to know is that,  You know, there's a discord server, and i think it's managed  By Dan, right, and we do have discussion on that discord  Server. There's monthly meetings that  We have as well where we discuss CXL-related issues, and one of  The things that i've noticed, and it was like explicitly called  Out about having more reviews on patches, and so one thing i'm  Trying to get people who may be interested, and maybe i'll show  Some more reasons why, but tiering is very exciting, but we  Also want to make sure that we have a robust driver underneath  All of this as well, and even internally, right, from Samsung,  We're interested in driving some features related to our hardware  As well, and it can be hard to sort of coordinate all of this,  But one thing that was really nice recently is that i think  There's a new thing is using patchwork, so the CXL  Development is starting to use patchwork, and so now you can  See patches queued up and how many reviews they have for  People to kind of get a jumping in point. I think that was one of the feedback that we hear from a  Lot of people is like, okay, you guys are running forward with  This, how do we even get involved, and where's a good  Point to start, and i always tell people, always reviews are  One of the most important things, and, you know, through  The discussion that was recently had, you know, let us  Know through the mailing list if anything should be queued. So another thing, i was talking with an attendee here, and he  Was mentioning how he's doing a lot of work in PCIe for  Unrelated use case, but i told him, you know, if you're  Interested in this area and you're starting to look a  Little bit more, PCIe knowledge is just very, very beneficial  In CXL, right, by the way that the PCIe hierarchy, a program  Was called root decoders, and you have to walk the hierarchies  And more and more functionality is being tied to port level, so  This is just very valuable for the -- i would believe the  Community in general, right, start looking more at the PCIe  Pieces that are shared, right, in some places there, and one  Of the things i wanted to point out, but i haven't seen much  From, and maybe dan, i don't know if you've seen this  Either, the port for port devices, the support for CXL,  I think AMD talked about this at plumbers. I couldn't find anything on the list yet. I don't know if anyone has seen anything.

That work is still in progress, but it's interesting  Because, you know, PCIe knowledge is really useful  Especially because now CXL basically makes your PCIe  Device, your memory controller, and -- but we still need to  Contend with PCIe error handling and all of these things. But, yeah, that patch in particular is also running into  The fact that PCIe has this concept called the port driver,  And that's where ar, dpc, these PCIe error events are reported,  And they actually end up with a driver architecture that makes  It really hard to add CXL-specific things. So the invitation for people with PCIe core knowledge to get  Involved is useful because, yeah, we're having to unwind  Some ways in which CXL is stressing out the PCIe core. And this support is in progress, but it's running up against that.

I think the key thing was they went with a service-based  Model, right, is that there's an owner, then you register a  Service, and that was the model that worked well, and, you  Know, yeah, just any kind of limits to that for this use  Case, yeah, it would be great to kind of work this out. Related to that, that i thought was quite interesting, so one  Area that we are, you know, reliability and serviceability  In general, and internally i've been kind of pushing to have us  Work more and look across subsystems that are dealing with  Memory errors in general, right, like i was saying, let's look  At EDAC, because we see it coming for CXL as well. And one thing that came up recently, right, was how to  Deal with poison. So if i take a step back,  Right, and this goes back to dan's comment about memory  Controller being on the device. So the device can interrupt the  Host and tell it about events that have happened or events  That have seen that the device is aware of. One of them being that the device has found poison, like a  Device physical address that should be poisoned now and  Shouldn't be used. And i found this discussion  Interesting. So one of the -- like the first  Patch, the part of it is like a clean-up for address translation  Between the device physical address and the host physical  Address, and that was picked up. And i like to encourage this  Kind of work, like even looking in new functionality, when it  Starts pulling out pieces that can be used generally, i think  It's a really good thing, right, even though the second part of  This is a little bit more controversial, and then i'll  Kind of pull that up on the next slide.

So i have a quote from Dan. I'm so sorry. And it has a typo, but i just pulled it as is. But, yeah, what -- and this is -- again, i've heard several  People here that i've talked to, and i think it's a very  Accurate description. Like memory controller, you  Know, dan said to PCIe, sometimes from a device vendor  We say it's moving towards the device, right, you know, like  That boundary is very blurred, but it's clearly not part of the  CPU itself, right, this memory controller responsibility,  Right, it interacts with it, and because this is happening,  What i am seeing personally, like beyond the company i work  For, right, is a push for this differentiated memory. And what does that mean, right? so i think there were good  Examples when we were having the tiering discussion about a  Second NUMA node and having different higher latency or  Whatever, that can be a way of approximating it, but, you  Know, there's memory with different bandwidth  Characteristics, right, there's HBM memory that's tied to cpus  And can be exposed as a NUMA node. You know, error handling in general, right, you might have  Memory with more potential for error issues. We have an upcoming talk about a device that has compressed  Memory as well and how to expose it, right, so it does open up  The possibilities of what you could do with memory, right? I'm not here to debate whether that's a good thing or a bad  Thing, but it's clear of where people are taking it. So i think the key thing right here on this poisoning event  Handling, and i would say it's like i see the -- i agree the  Same way, right, with this poisoning event handling, right,  Is that -- so, again, the device can send events, right, like  It's the basic model of the driver that it works, is that  You pull for these events or they can be interrupted and you  See this event related to the media and what do you do with  It? so currently, and actually this  Is a question for Dan, right, is that when i looked at this, and  I didn't compare everything, but when i look at EDAC, right, it  Basically does some things very similar to the CXL driver, right,  In terms of like reporting out some memory-related events. And, you know, as we're looking, you know, we're trying to say,  Like, should EDAC be the one that does that or what was  Different in the CXL case? like, i did not look at those  Patches when they first came, but why was there a push for the  CXL to have its own events and then like RAS daemon to  Understand the CXL ones versus piggybacking off of what was  There for EDAC? i'd be very curious about that.

So step back to the history of EDAC. So EDAC was a subsystem that was invented basically to  Understand all the different kind of architecture-specific  Memory controller layouts and how to extract error information  Out of the intel memory controller versus the IBM thing  Versus whatever. So it's basically trying to  Wrap some commonality across a whole bunch of different things. And then CXL is, to me, is kind of a standardization of that. So now rather than teaching an EDAC to understand everybody's  Memory controller, we teach linux to understand CXL and  Now hardware people are responsible for making their  Hardware look like CXL. But we're kind of in this  In-between stage right now where the CXL driver knows how  To do its native CXL events, and we have EDAC that knows how  To tell RAS daemon about memory errors in a generic way. So what we're working on now is taking these CXL events,  Translating them to EDAC to try to get the benefit of RAS  Daemon already knowing how to harvest those errors.

If you correct my assumption now, is that RAS daemon was  Changed to understand the CXL events that were output at the  Moment? 

i think that's what we don't want. I know some people added -- they said, oh, let's teach RAS  Daemon about CXL events. I think there's value in the  Fact that you can have a legacy existing RAS daemon that knows  How to check for corrected memory errors and be like, you  Know what? it has a leaky bucket  Algorithm that says too many corrected errors in this  Physical address, take the page offline. That's something that can be generically done for any memory. And we kind of don't want to teach it to do the exact same  Thing with some CXL-specific event. So for some of those cases where there's existing RAS daemon  Value to harvest, i think we should translate those into  Something that RAS daemon already understands. But then we can also backfill it with new CXL-specific things if  There's value there. But i don't want to teach RAS  Daemon a different way to do the exact same thing. They already know how to do it with an EDAC event.
 
I'm on the same boat. If EDAC already covers the  Same use cases CXL is trying to achieve but just through a  Different interface, we should just merge the two in some way. So i think we'll actively be looking in this space and trying  To help out in this space. But, yes, and i think Dan's  Response on the mailing list is spot on. There's different actors that can be informing the os about  Memory errors. And, yeah, there should be  Commonality. And it's clear that this patch  Didn't look at all these different options, but i think  It's nice that we started discussing it. And it was definitely on our radar, too, as well, like from  Our viewpoint of what we're looking at. And so i think there's some momentum behind here. The other thing -- oh, go for it.

A funny -- something you found here is like -- so if  You're not familiar with these acronyms about ACPI, GHES and  These other things, these memory errors can be either  Reported to the os directly or they can be reported to your  BIOS first so it can do something and log in, some  System log in, your platform log in and then tell the os about  It later. But one of the funny -- or not  Funny things we found was that for some of these protocol  Errors, if it goes to the BIOS first, when it tells the os,  Hey, os, here's a critical error, the next thing the curl  Does is panic. Because it tells it, like,  There's a GHES panic --  

Oh, right. Just a conceptual problem. This is a demon, right? and as it's a demon, it's  Running in memory. And we are just trying to  Figure out memory errors. What if the memory error is on  That page where the demon is running on?
 
I mean, a lot of these RAS events are opportunistically  Trying to get you some forensic information if possible. But, yeah, if you get a memory error that kills RAS demon,  You don't get to know about that.

You don't get anything to do with. So the question remains, do we see any error then? Or do we see anything if that happens to kill the RAS demon?

The answer is it depends. So for me, like, i don't like  The fact that BIOS gets first crack of these things, and i  Like the os native. I'm a curl developer. I like to work on kernel things. So i like that the os gets  First crack of these things. The problem is that -- then it's  Susceptible to RAS demon getting impacted. For these firmware first things, at least if -- let's say it was --  

It's not about firmware first. It's just the whole -- so the  Conceptual thing. So do we see anything if the  RAS demon is killed? 

when you access, right, i  Think this is the different question, right? It's like the latent error here, right? So you would see the machine check exception when you  Actually access the memory poison. So RAS demon i think is looking at non -- it's like hot path  Information. 

Corrected errors.

Yeah, errors that were corrected, things like this. So once you actually try to access the memory, then that  Would be the machine check exception and would be handled differently.

Right, yeah. So, yeah, we should talk about  What error we're talking about. We're talking about a system  That we might not get anything versus like a poison cache  Line, then we get a local machine check and that could be recoverable. So there's an entire spectrum of you get nothing until you get  Very precise error information depending on the error. But the discrepancy i found was that some of these errors are reported. The first thing it tells the kernel to do is panic. You had a fatal error. If we're not talking to the  BIOS first and PCI gets it first, PCI error recovery will  Go through the end and at the bottom there's a nice comment  That says to do, should we panic? question mark. And so it's an opportunity to clean up that -- have some  Consistency between those two things.

Yeah, and, you know, from a Samsung perspective, right, we  See this in another place, right? it was in the design where  They were using a PCIe controller. They were facing the same thing. There's another Samsung team in a different division, and they  Were trying to make sense of this PCIe ports that have, like,  Pmu information as well as error injection capabilities, and  They were like, what do we do? how do we handle this? Who's responsible for what? and so some of the same things,  Basically the response was why don't you put it in an EDAC,  At least the error ejection path. It was kind of quiet on there. And so that's one thing, you know, our team might try to help  Out a little bit with, right, is trying to make sense of how to  Put these all together in some common way. But, yeah, we see it within a company, too, right, in multiple  Places other than just CXL. I think that's one thing i'd  Like to take away from the talk. I'm not talking about this just  Because of CXL either, right? it's coming from platforms. There's something happening about, you know, memory that has  More error prone that's coming from more angles than just CXL.

Yeah. Like another recent development  Here was folks are trying to define a new memory scrub  Subsystem, and this is so that you can control the scrubbing  Rate of your memory, and it basically becomes a tradeoff  Between how much performance you want to give up for scrubbing  And to keep your system up and running versus would you be  Willing to scrub less to get more performance but maybe  Incur more memory errors. And some environments want to  Be able to control that scrub rate. There's an ACPI mechanism to do this. There's a CXL mechanism to do this. And what the memory scrub people found going upstream was  There's a whole bunch of different EDAC ways to do it as well. So it's been -- from the CXL side we're stumbling on the  Fact that people have been kind of independently solving  Problems and maybe not in ways that we could do better if we  All worked together more closely. You have the scrub thing right there.

Yeah, definitely. From you.And then i think one other interesting thing about piggy  Backing CXL support, again, you know, from a device vendor's  Look at it. So one of the things that CXL  Supports is called get set features, and it's a way for  You to figure out from the device what features it supports. And one of the features happens to be scrub control. So for us, when we look at this and say, hey, this has the get  Set feature support, we're excited, right, definitely,  Right? and definitely on the whole  RAS side we're excited as well, too. So it's kind of aligning perfectly for us about practical  Things that we would see. And so, you know, we're just  Very happy to see this work. But, yeah, we've definitely  Noticed that the problem now is not really CXL. It's about integrating in a bigger picture, right, for at  Least these pieces so far that i'm bringing up. And the other one, i have not looked at the AMD one. And then you brought that up as well in discussion. So i think the only thing that it does is report poison memory,  Very similar, like having a poison list, i think you  Mentioned. So, again, another source,  Right? that's their memory controller  From their m1300. 

No, that's their AI  accelerator. 

Oh, okay.

And as far as i understand, it can reach -- keep a  Persistent list of memory poison locations so the next time you  Boot up, you can be like, hey, don't test those pages again  Because we know they're bad. But, yeah, but that's -- and  There's a similar CXL command for this. The other piece of this for -- and we kind of started this in  The pmem space is that these technologies have a way to  Repair errors as well. Typically memory poison and  Memory failures have been permanent events, like never  Touch that page again. But you can scrub the error  And sometimes get the page back. And we're kind of -- we don't have a comprehensive way to Go back and forth. We do it with file systems  Where we can deallocate that physical page and bring it back  In, but there's no kind of generic way that the -- like  The page allocator could figure out, oh, hey, i can send this  Page off to be repaired. And bring it back. 

Okay.

I think it's hilarious that  The memory vendors aren't learning from the disk  Manufacturers. Do you remember like 30, 40 years  Ago when hard drives used to come with a list of sectors  That were bad and the file system has this bad blocks  Thing you can use to map them out?

Pretty much the same thing. 

And now we don't do that  Anymore. 

I'm going to claim innocence  And say it's someone else pushing. We do what they want. So, anyways, yeah. But i understood, right? yeah. I just think this is happening beyond CXL. I think it's the most important thing to kind of think about  Here, too, as well. That's the world as it's  Evolving. 

But to your point, will  Lee, the reason that went away because the drives got much  Better at not reporting poison. 

No, no. They have extra space and  They map that stuff out and then they use other blocks to get  Rid of this. It's under provision.
 
So this was pre-bad blocks drives that couldn't map it out  And put it somewhere else. They didn't remap sectors to  Say -- 

So one thing that i wanted  To bring up, right, and i enjoyed this discussion with  The tiering, too, was the whole benchmarks. And i think it would be nice when we're having tiering  Discussions, and i don't know if this is possible, if there  Was, like, some configuration -- it's hard with different, you  Know, server platforms, all these different things, but if  There was some agreement of there's these three workloads  That we're willing to run and see if you see any regressions. And my comment on there, too, right, is i know in the call  There was a ask for the whole group about workloads we're  Interested in. We've looked at radis, in  Memory databases, different things like this. But as a device vendor, in my personal opinion, right, i'm  Not the most valuable person to be benchmarking or to be  Providing information about benchmarks. Of course we're interested in running it and understanding  More, but i'm not the end workload user. And i think that's the key person that really should be  Giving us that information if possible. And i know there's some venues for this, right. And one is the -- OCP is a venue that i do see some  Potential overlap here, right, of them providing benchmarks as  Well, like, you know, they have this composable memory systems  Workstream, but not all of it will be directly relevant. It's more of, like, something more broad and more general,  But there are some goals within that workstream to work on  Getting traces. And then that brings up another  Thing about CXL, right, is that OCP was a forum for  Interested parties to come up with a model of a CXL device  That compresses memory, right? so this was driven by -- I think Meta and Google are the primary authors on there as  Well. And so you see these things  Coming from a variety of sources, right? What i'm trying to tell people, right, i look at it from a  Memory expander viewpoint, but there are many different use  Cases, and CXL is a potential way to do many of these things,  Right? it's just the flexibility of  Moving the memory controller. And, you know, that tiering  Working group, i think it's a great venue, and i really -- i  Think we will have members hopefully being involved in it,  Right, and contributing. And, yeah, that's it, right.

I was going to ask, like, isn't this a general -- like touching the mm is hard. Even without making CXL part of  It in the sense of people proposing mm changes generally  Are not sure what workloads they're regressing. It's not until, like, Mel runs and talks to his customers,  Like, oh, no, you regressed this very important use case.

It's not systematic. I'm not at suzette, but we chat  And we've got some legacy from there, and, you know, there's,  Like, was it mm test, right, which out of there, right, and  I think there were some, like, benchmarks worth running, right,  And configurations that were reproducible from there that  Were donated in some way, right? that was given, right? And he said, here, i ran this benchmark here. I think he was, like, great at that, more than many other  People. I felt like i could reproduce  Some of that, right? and i feel that that's lacking a  Lot of times, right? and, you know, it's just kind of  My hope, right? i'd love to see one of these  Turing patches in an associated workload. I could just go run and check it out at the same time. But, you know, that's kind of the way we see it. And then, yeah, i'm very curious, right, if i don't hear  Anybody using HBM that much, right? And i know it's a small use case, you know, where the large  Use cases, right? but i do know, like, if you  Look up, like, basically kind of -- to me it's like a pmem  Kind of thing. Like, HBM can be used as  Caches or it can be exposed as a NUMA node. And i would imagine people using those systems might want  Some sort of tiering or something there, right? You know, it's very -- i haven't seen anything, though.

I'm not -- i mean, maybe somebody else could speak up. And this is just my personal opinion. But it feels like people that pay for HBM want the kernel to  Get out of the way. 

Okay.

So they don't come to us and say, hey, your NUMA  Balancing is not working for HBM because they've turned  Everything off and allocated HBM on to their own special thing.

That's fair. I know one user like that. Super computing kind of thing. They want one workload and  Know exactly what they're doing. I think that's consistent.

The last thing, you know, we're kind of running out of time. This is super challenging. But we briefly discussed this. So internally, we've been trying to think of a baseline,  And we haven't been good at it. I'll be completely honest,  Right, because it changes so fast, right? Like what's in CXL, what's supported, the emulation of it,  Right? so there's QEMU, there's CXL Test, there was some user space tooling proposed. But, you know, it's just moving very fast, and then that makes  It very hard to settle on what should be a baseline. But, you know, i would say is that we should leverage what's  Currently there. If more than one vendor of,  Say, CPUs devices would be willing to donate some hardware  -- now might not be the time. It's a little bit early. But at some point, i think it would be great to somehow have  A central place where we could report information about  Devices and CPUs. If anyone was willing to donate  This, i know it's kind of like we have had success allowing  People to use some hardware that we have, but i can see the  Resistance from some people because it's a little bit scary  Saying, hey, i'm going to access this machine provided by  Samsung, right? it's just one person. So somehow trying to wrap this into some envelope where people  Trust it more and it won't go away would be nice. But where do you host it, what test do you run, how do you  Present data? i think this is being talked  About in many subsystems. Where we provide hardware is in  These co-location centers and it costs money to just have it in  There. Then there's the money for the  Hardware. So there's all these things  Here. I don't know -- i know the  Value for the -- there's several parties i know that there would  Be value, right? but i don't know whether they're  Willing to kind of pitch in on these kind of efforts. That's the hard part, right? but anyways, in the dream world,  I'd love to see this. So i think that's kind of all  That i have. 

That does sound like a  Good idea. So Adam and i were talking  About this earlier, but we're also a memory vendor. Our team is trying to figure out how to contribute more. And stumbling a little bit to try to figure out what to do. This is a sort of now or later offer. In the specific, it's we're interested in assignments. Tell us what you need. We know what mailbox commands we  Need to enable and stuff. And there's patches pushed for  That. But if we could figure out a  Sort of shared wish list, to-do list kind of thing, or if you  Can just reach out to me and i'll connect with the people,  We'll try to help with that. And of course it's challenging  Because some of them are on the opposite side of the earth,  Which makes things like the monthly sync difficult.
 
My suggestion, the patchwork i think will help in my opinion. You can see where there's no reviews, right, and something  That's queued up. When i work internally, the  First thing i ask people to do is review, right, and then kind  Of build from there. I think that's a great start. And then there's a lot of stuff in the rastuff, right, that's  Unsolved, and someone just needs to get in there and figure  These things out, too, as well. But anyway, you know. But i think it's hard to coordinate. That's the hardest part. I don't know what Dan has thoughts on that.

Yeah, i mean, it's -- oh. It's the -- but that's also kind  Of a general problem. How do people get involved in  The kernel? how do they find things to  Work with? but we should publicize the  Discord more. I feel like people will use that  More than the lists for kind of those fly-by questions where  It's like i don't want to send it to a global e-mail list. So that might be a good place to ask these things. But, yeah, the -- a lot of these kind of -- there's no  Grand vision, and a lot of these work tasks just pop up  Coming from review or coming from people having pain points. So, yeah, if we just coordinate more on what those pain points  Are and who's not looking at them, people can grab things.

Yeah, the discord server is great because there's history  There, but at the same time, time-consuming to consume the  History. 

Yeah.

So one of the things about the hardware and getting people  To donate hardware, you have both the CPU side but you also  have the device side, right? Who's building devices with  Which IPs out there, and there's limited things, so I don't  Know how many connections you guys have with the Rambus's or  The Intel's or the Synopsis of the world that have CXL IPs or  someone has a home-grown IP that they're using in a device,  and how do you get them all to be compliant?

Yeah, so, we're -- you know, Samsung is a big company. We have connections to many different places, right? But, you know, from a device perspective, right, from the  Memory expander device, i see that as the most near term,  Right, and that, you know, given enough time, right, that should  Be more generally available, right? and so, yeah, you know, i  Think it's around the corner, right? I'll just put it at that, right? but, you know, it's like a  Little bit of the chicken and egg problem. Once the servers are more widely available, then i imagine the  Devices coming out more broadly, too, as well. And, you know, there's been a little bit of a resistance to  The first generation of CXL with limited capabilities, right? Everyone's like, eh, you know, it's kind of limited in some  Ways, right? I understand the benefits of it,  too, as well, right, as like prototyping and figuring things  out, but, yeah, so it's a little chicken and egg problem right  now, but my gut kind of sensing is this will be worked out more  -- You know, from our team perspective, we would say is that, you know, we've had more hardware in the past year than  previously, if that makes sense, in our hands that we're working with, right? and so it's definitely around  the corner. Yeah. That's it. All right. Thanks for your time.