-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path265
96 lines (48 loc) · 29.2 KB
/
265
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
All right. So i'll get this session started. It's CXL development discussion points. And, you know, a little bit of background. I did see some of the discussion on the tiering side, and i was Trying to avoid all of that, to be honest, and kind of focus on Some low-level parts of the stack here. Tiering is a part of all of this, right? We also have driver development and memory reliability issues That are trying to be faced in the CXL world in terms of what Was built into the spec and what we're starting to see. So i wanted to touch base on some of those points and show How development efforts in CXL actually transfer over to other Subsystems, and so, you know, there's another side to CXL.
So my first sort of ask, right, and this has come up several Times, so one thing i think it's good for people to know is that, You know, there's a discord server, and i think it's managed By Dan, right, and we do have discussion on that discord Server. There's monthly meetings that We have as well where we discuss CXL-related issues, and one of The things that i've noticed, and it was like explicitly called Out about having more reviews on patches, and so one thing i'm Trying to get people who may be interested, and maybe i'll show Some more reasons why, but tiering is very exciting, but we Also want to make sure that we have a robust driver underneath All of this as well, and even internally, right, from Samsung, We're interested in driving some features related to our hardware As well, and it can be hard to sort of coordinate all of this, But one thing that was really nice recently is that i think There's a new thing is using patchwork, so the CXL Development is starting to use patchwork, and so now you can See patches queued up and how many reviews they have for People to kind of get a jumping in point. I think that was one of the feedback that we hear from a Lot of people is like, okay, you guys are running forward with This, how do we even get involved, and where's a good Point to start, and i always tell people, always reviews are One of the most important things, and, you know, through The discussion that was recently had, you know, let us Know through the mailing list if anything should be queued. So another thing, i was talking with an attendee here, and he Was mentioning how he's doing a lot of work in PCIe for Unrelated use case, but i told him, you know, if you're Interested in this area and you're starting to look a Little bit more, PCIe knowledge is just very, very beneficial In CXL, right, by the way that the PCIe hierarchy, a program Was called root decoders, and you have to walk the hierarchies And more and more functionality is being tied to port level, so This is just very valuable for the -- i would believe the Community in general, right, start looking more at the PCIe Pieces that are shared, right, in some places there, and one Of the things i wanted to point out, but i haven't seen much From, and maybe dan, i don't know if you've seen this Either, the port for port devices, the support for CXL, I think AMD talked about this at plumbers. I couldn't find anything on the list yet. I don't know if anyone has seen anything.
That work is still in progress, but it's interesting Because, you know, PCIe knowledge is really useful Especially because now CXL basically makes your PCIe Device, your memory controller, and -- but we still need to Contend with PCIe error handling and all of these things. But, yeah, that patch in particular is also running into The fact that PCIe has this concept called the port driver, And that's where ar, dpc, these PCIe error events are reported, And they actually end up with a driver architecture that makes It really hard to add CXL-specific things. So the invitation for people with PCIe core knowledge to get Involved is useful because, yeah, we're having to unwind Some ways in which CXL is stressing out the PCIe core. And this support is in progress, but it's running up against that.
I think the key thing was they went with a service-based Model, right, is that there's an owner, then you register a Service, and that was the model that worked well, and, you Know, yeah, just any kind of limits to that for this use Case, yeah, it would be great to kind of work this out. Related to that, that i thought was quite interesting, so one Area that we are, you know, reliability and serviceability In general, and internally i've been kind of pushing to have us Work more and look across subsystems that are dealing with Memory errors in general, right, like i was saying, let's look At EDAC, because we see it coming for CXL as well. And one thing that came up recently, right, was how to Deal with poison. So if i take a step back, Right, and this goes back to dan's comment about memory Controller being on the device. So the device can interrupt the Host and tell it about events that have happened or events That have seen that the device is aware of. One of them being that the device has found poison, like a Device physical address that should be poisoned now and Shouldn't be used. And i found this discussion Interesting. So one of the -- like the first Patch, the part of it is like a clean-up for address translation Between the device physical address and the host physical Address, and that was picked up. And i like to encourage this Kind of work, like even looking in new functionality, when it Starts pulling out pieces that can be used generally, i think It's a really good thing, right, even though the second part of This is a little bit more controversial, and then i'll Kind of pull that up on the next slide.
So i have a quote from Dan. I'm so sorry. And it has a typo, but i just pulled it as is. But, yeah, what -- and this is -- again, i've heard several People here that i've talked to, and i think it's a very Accurate description. Like memory controller, you Know, dan said to PCIe, sometimes from a device vendor We say it's moving towards the device, right, you know, like That boundary is very blurred, but it's clearly not part of the CPU itself, right, this memory controller responsibility, Right, it interacts with it, and because this is happening, What i am seeing personally, like beyond the company i work For, right, is a push for this differentiated memory. And what does that mean, right? so i think there were good Examples when we were having the tiering discussion about a Second NUMA node and having different higher latency or Whatever, that can be a way of approximating it, but, you Know, there's memory with different bandwidth Characteristics, right, there's HBM memory that's tied to cpus And can be exposed as a NUMA node. You know, error handling in general, right, you might have Memory with more potential for error issues. We have an upcoming talk about a device that has compressed Memory as well and how to expose it, right, so it does open up The possibilities of what you could do with memory, right? I'm not here to debate whether that's a good thing or a bad Thing, but it's clear of where people are taking it. So i think the key thing right here on this poisoning event Handling, and i would say it's like i see the -- i agree the Same way, right, with this poisoning event handling, right, Is that -- so, again, the device can send events, right, like It's the basic model of the driver that it works, is that You pull for these events or they can be interrupted and you See this event related to the media and what do you do with It? so currently, and actually this Is a question for Dan, right, is that when i looked at this, and I didn't compare everything, but when i look at EDAC, right, it Basically does some things very similar to the CXL driver, right, In terms of like reporting out some memory-related events. And, you know, as we're looking, you know, we're trying to say, Like, should EDAC be the one that does that or what was Different in the CXL case? like, i did not look at those Patches when they first came, but why was there a push for the CXL to have its own events and then like RAS daemon to Understand the CXL ones versus piggybacking off of what was There for EDAC? i'd be very curious about that.
So step back to the history of EDAC. So EDAC was a subsystem that was invented basically to Understand all the different kind of architecture-specific Memory controller layouts and how to extract error information Out of the intel memory controller versus the IBM thing Versus whatever. So it's basically trying to Wrap some commonality across a whole bunch of different things. And then CXL is, to me, is kind of a standardization of that. So now rather than teaching an EDAC to understand everybody's Memory controller, we teach linux to understand CXL and Now hardware people are responsible for making their Hardware look like CXL. But we're kind of in this In-between stage right now where the CXL driver knows how To do its native CXL events, and we have EDAC that knows how To tell RAS daemon about memory errors in a generic way. So what we're working on now is taking these CXL events, Translating them to EDAC to try to get the benefit of RAS Daemon already knowing how to harvest those errors.
If you correct my assumption now, is that RAS daemon was Changed to understand the CXL events that were output at the Moment?
i think that's what we don't want. I know some people added -- they said, oh, let's teach RAS Daemon about CXL events. I think there's value in the Fact that you can have a legacy existing RAS daemon that knows How to check for corrected memory errors and be like, you Know what? it has a leaky bucket Algorithm that says too many corrected errors in this Physical address, take the page offline. That's something that can be generically done for any memory. And we kind of don't want to teach it to do the exact same Thing with some CXL-specific event. So for some of those cases where there's existing RAS daemon Value to harvest, i think we should translate those into Something that RAS daemon already understands. But then we can also backfill it with new CXL-specific things if There's value there. But i don't want to teach RAS Daemon a different way to do the exact same thing. They already know how to do it with an EDAC event.
I'm on the same boat. If EDAC already covers the Same use cases CXL is trying to achieve but just through a Different interface, we should just merge the two in some way. So i think we'll actively be looking in this space and trying To help out in this space. But, yes, and i think Dan's Response on the mailing list is spot on. There's different actors that can be informing the os about Memory errors. And, yeah, there should be Commonality. And it's clear that this patch Didn't look at all these different options, but i think It's nice that we started discussing it. And it was definitely on our radar, too, as well, like from Our viewpoint of what we're looking at. And so i think there's some momentum behind here. The other thing -- oh, go for it.
A funny -- something you found here is like -- so if You're not familiar with these acronyms about ACPI, GHES and These other things, these memory errors can be either Reported to the os directly or they can be reported to your BIOS first so it can do something and log in, some System log in, your platform log in and then tell the os about It later. But one of the funny -- or not Funny things we found was that for some of these protocol Errors, if it goes to the BIOS first, when it tells the os, Hey, os, here's a critical error, the next thing the curl Does is panic. Because it tells it, like, There's a GHES panic --
Oh, right. Just a conceptual problem. This is a demon, right? and as it's a demon, it's Running in memory. And we are just trying to Figure out memory errors. What if the memory error is on That page where the demon is running on?
I mean, a lot of these RAS events are opportunistically Trying to get you some forensic information if possible. But, yeah, if you get a memory error that kills RAS demon, You don't get to know about that.
You don't get anything to do with. So the question remains, do we see any error then? Or do we see anything if that happens to kill the RAS demon?
The answer is it depends. So for me, like, i don't like The fact that BIOS gets first crack of these things, and i Like the os native. I'm a curl developer. I like to work on kernel things. So i like that the os gets First crack of these things. The problem is that -- then it's Susceptible to RAS demon getting impacted. For these firmware first things, at least if -- let's say it was --
It's not about firmware first. It's just the whole -- so the Conceptual thing. So do we see anything if the RAS demon is killed?
when you access, right, i Think this is the different question, right? It's like the latent error here, right? So you would see the machine check exception when you Actually access the memory poison. So RAS demon i think is looking at non -- it's like hot path Information.
Corrected errors.
Yeah, errors that were corrected, things like this. So once you actually try to access the memory, then that Would be the machine check exception and would be handled differently.
Right, yeah. So, yeah, we should talk about What error we're talking about. We're talking about a system That we might not get anything versus like a poison cache Line, then we get a local machine check and that could be recoverable. So there's an entire spectrum of you get nothing until you get Very precise error information depending on the error. But the discrepancy i found was that some of these errors are reported. The first thing it tells the kernel to do is panic. You had a fatal error. If we're not talking to the BIOS first and PCI gets it first, PCI error recovery will Go through the end and at the bottom there's a nice comment That says to do, should we panic? question mark. And so it's an opportunity to clean up that -- have some Consistency between those two things.
Yeah, and, you know, from a Samsung perspective, right, we See this in another place, right? it was in the design where They were using a PCIe controller. They were facing the same thing. There's another Samsung team in a different division, and they Were trying to make sense of this PCIe ports that have, like, Pmu information as well as error injection capabilities, and They were like, what do we do? how do we handle this? Who's responsible for what? and so some of the same things, Basically the response was why don't you put it in an EDAC, At least the error ejection path. It was kind of quiet on there. And so that's one thing, you know, our team might try to help Out a little bit with, right, is trying to make sense of how to Put these all together in some common way. But, yeah, we see it within a company, too, right, in multiple Places other than just CXL. I think that's one thing i'd Like to take away from the talk. I'm not talking about this just Because of CXL either, right? it's coming from platforms. There's something happening about, you know, memory that has More error prone that's coming from more angles than just CXL.
Yeah. Like another recent development Here was folks are trying to define a new memory scrub Subsystem, and this is so that you can control the scrubbing Rate of your memory, and it basically becomes a tradeoff Between how much performance you want to give up for scrubbing And to keep your system up and running versus would you be Willing to scrub less to get more performance but maybe Incur more memory errors. And some environments want to Be able to control that scrub rate. There's an ACPI mechanism to do this. There's a CXL mechanism to do this. And what the memory scrub people found going upstream was There's a whole bunch of different EDAC ways to do it as well. So it's been -- from the CXL side we're stumbling on the Fact that people have been kind of independently solving Problems and maybe not in ways that we could do better if we All worked together more closely. You have the scrub thing right there.
Yeah, definitely. From you.And then i think one other interesting thing about piggy Backing CXL support, again, you know, from a device vendor's Look at it. So one of the things that CXL Supports is called get set features, and it's a way for You to figure out from the device what features it supports. And one of the features happens to be scrub control. So for us, when we look at this and say, hey, this has the get Set feature support, we're excited, right, definitely, Right? and definitely on the whole RAS side we're excited as well, too. So it's kind of aligning perfectly for us about practical Things that we would see. And so, you know, we're just Very happy to see this work. But, yeah, we've definitely Noticed that the problem now is not really CXL. It's about integrating in a bigger picture, right, for at Least these pieces so far that i'm bringing up. And the other one, i have not looked at the AMD one. And then you brought that up as well in discussion. So i think the only thing that it does is report poison memory, Very similar, like having a poison list, i think you Mentioned. So, again, another source, Right? that's their memory controller From their m1300.
No, that's their AI accelerator.
Oh, okay.
And as far as i understand, it can reach -- keep a Persistent list of memory poison locations so the next time you Boot up, you can be like, hey, don't test those pages again Because we know they're bad. But, yeah, but that's -- and There's a similar CXL command for this. The other piece of this for -- and we kind of started this in The pmem space is that these technologies have a way to Repair errors as well. Typically memory poison and Memory failures have been permanent events, like never Touch that page again. But you can scrub the error And sometimes get the page back. And we're kind of -- we don't have a comprehensive way to Go back and forth. We do it with file systems Where we can deallocate that physical page and bring it back In, but there's no kind of generic way that the -- like The page allocator could figure out, oh, hey, i can send this Page off to be repaired. And bring it back.
Okay.
I think it's hilarious that The memory vendors aren't learning from the disk Manufacturers. Do you remember like 30, 40 years Ago when hard drives used to come with a list of sectors That were bad and the file system has this bad blocks Thing you can use to map them out?
Pretty much the same thing.
And now we don't do that Anymore.
I'm going to claim innocence And say it's someone else pushing. We do what they want. So, anyways, yeah. But i understood, right? yeah. I just think this is happening beyond CXL. I think it's the most important thing to kind of think about Here, too, as well. That's the world as it's Evolving.
But to your point, will Lee, the reason that went away because the drives got much Better at not reporting poison.
No, no. They have extra space and They map that stuff out and then they use other blocks to get Rid of this. It's under provision.
So this was pre-bad blocks drives that couldn't map it out And put it somewhere else. They didn't remap sectors to Say --
So one thing that i wanted To bring up, right, and i enjoyed this discussion with The tiering, too, was the whole benchmarks. And i think it would be nice when we're having tiering Discussions, and i don't know if this is possible, if there Was, like, some configuration -- it's hard with different, you Know, server platforms, all these different things, but if There was some agreement of there's these three workloads That we're willing to run and see if you see any regressions. And my comment on there, too, right, is i know in the call There was a ask for the whole group about workloads we're Interested in. We've looked at radis, in Memory databases, different things like this. But as a device vendor, in my personal opinion, right, i'm Not the most valuable person to be benchmarking or to be Providing information about benchmarks. Of course we're interested in running it and understanding More, but i'm not the end workload user. And i think that's the key person that really should be Giving us that information if possible. And i know there's some venues for this, right. And one is the -- OCP is a venue that i do see some Potential overlap here, right, of them providing benchmarks as Well, like, you know, they have this composable memory systems Workstream, but not all of it will be directly relevant. It's more of, like, something more broad and more general, But there are some goals within that workstream to work on Getting traces. And then that brings up another Thing about CXL, right, is that OCP was a forum for Interested parties to come up with a model of a CXL device That compresses memory, right? so this was driven by -- I think Meta and Google are the primary authors on there as Well. And so you see these things Coming from a variety of sources, right? What i'm trying to tell people, right, i look at it from a Memory expander viewpoint, but there are many different use Cases, and CXL is a potential way to do many of these things, Right? it's just the flexibility of Moving the memory controller. And, you know, that tiering Working group, i think it's a great venue, and i really -- i Think we will have members hopefully being involved in it, Right, and contributing. And, yeah, that's it, right.
I was going to ask, like, isn't this a general -- like touching the mm is hard. Even without making CXL part of It in the sense of people proposing mm changes generally Are not sure what workloads they're regressing. It's not until, like, Mel runs and talks to his customers, Like, oh, no, you regressed this very important use case.
It's not systematic. I'm not at suzette, but we chat And we've got some legacy from there, and, you know, there's, Like, was it mm test, right, which out of there, right, and I think there were some, like, benchmarks worth running, right, And configurations that were reproducible from there that Were donated in some way, right? that was given, right? And he said, here, i ran this benchmark here. I think he was, like, great at that, more than many other People. I felt like i could reproduce Some of that, right? and i feel that that's lacking a Lot of times, right? and, you know, it's just kind of My hope, right? i'd love to see one of these Turing patches in an associated workload. I could just go run and check it out at the same time. But, you know, that's kind of the way we see it. And then, yeah, i'm very curious, right, if i don't hear Anybody using HBM that much, right? And i know it's a small use case, you know, where the large Use cases, right? but i do know, like, if you Look up, like, basically kind of -- to me it's like a pmem Kind of thing. Like, HBM can be used as Caches or it can be exposed as a NUMA node. And i would imagine people using those systems might want Some sort of tiering or something there, right? You know, it's very -- i haven't seen anything, though.
I'm not -- i mean, maybe somebody else could speak up. And this is just my personal opinion. But it feels like people that pay for HBM want the kernel to Get out of the way.
Okay.
So they don't come to us and say, hey, your NUMA Balancing is not working for HBM because they've turned Everything off and allocated HBM on to their own special thing.
That's fair. I know one user like that. Super computing kind of thing. They want one workload and Know exactly what they're doing. I think that's consistent.
The last thing, you know, we're kind of running out of time. This is super challenging. But we briefly discussed this. So internally, we've been trying to think of a baseline, And we haven't been good at it. I'll be completely honest, Right, because it changes so fast, right? Like what's in CXL, what's supported, the emulation of it, Right? so there's QEMU, there's CXL Test, there was some user space tooling proposed. But, you know, it's just moving very fast, and then that makes It very hard to settle on what should be a baseline. But, you know, i would say is that we should leverage what's Currently there. If more than one vendor of, Say, CPUs devices would be willing to donate some hardware -- now might not be the time. It's a little bit early. But at some point, i think it would be great to somehow have A central place where we could report information about Devices and CPUs. If anyone was willing to donate This, i know it's kind of like we have had success allowing People to use some hardware that we have, but i can see the Resistance from some people because it's a little bit scary Saying, hey, i'm going to access this machine provided by Samsung, right? it's just one person. So somehow trying to wrap this into some envelope where people Trust it more and it won't go away would be nice. But where do you host it, what test do you run, how do you Present data? i think this is being talked About in many subsystems. Where we provide hardware is in These co-location centers and it costs money to just have it in There. Then there's the money for the Hardware. So there's all these things Here. I don't know -- i know the Value for the -- there's several parties i know that there would Be value, right? but i don't know whether they're Willing to kind of pitch in on these kind of efforts. That's the hard part, right? but anyways, in the dream world, I'd love to see this. So i think that's kind of all That i have.
That does sound like a Good idea. So Adam and i were talking About this earlier, but we're also a memory vendor. Our team is trying to figure out how to contribute more. And stumbling a little bit to try to figure out what to do. This is a sort of now or later offer. In the specific, it's we're interested in assignments. Tell us what you need. We know what mailbox commands we Need to enable and stuff. And there's patches pushed for That. But if we could figure out a Sort of shared wish list, to-do list kind of thing, or if you Can just reach out to me and i'll connect with the people, We'll try to help with that. And of course it's challenging Because some of them are on the opposite side of the earth, Which makes things like the monthly sync difficult.
My suggestion, the patchwork i think will help in my opinion. You can see where there's no reviews, right, and something That's queued up. When i work internally, the First thing i ask people to do is review, right, and then kind Of build from there. I think that's a great start. And then there's a lot of stuff in the rastuff, right, that's Unsolved, and someone just needs to get in there and figure These things out, too, as well. But anyway, you know. But i think it's hard to coordinate. That's the hardest part. I don't know what Dan has thoughts on that.
Yeah, i mean, it's -- oh. It's the -- but that's also kind Of a general problem. How do people get involved in The kernel? how do they find things to Work with? but we should publicize the Discord more. I feel like people will use that More than the lists for kind of those fly-by questions where It's like i don't want to send it to a global e-mail list. So that might be a good place to ask these things. But, yeah, the -- a lot of these kind of -- there's no Grand vision, and a lot of these work tasks just pop up Coming from review or coming from people having pain points. So, yeah, if we just coordinate more on what those pain points Are and who's not looking at them, people can grab things.
Yeah, the discord server is great because there's history There, but at the same time, time-consuming to consume the History.
Yeah.
So one of the things about the hardware and getting people To donate hardware, you have both the CPU side but you also have the device side, right? Who's building devices with Which IPs out there, and there's limited things, so I don't Know how many connections you guys have with the Rambus's or The Intel's or the Synopsis of the world that have CXL IPs or someone has a home-grown IP that they're using in a device, and how do you get them all to be compliant?
Yeah, so, we're -- you know, Samsung is a big company. We have connections to many different places, right? But, you know, from a device perspective, right, from the Memory expander device, i see that as the most near term, Right, and that, you know, given enough time, right, that should Be more generally available, right? and so, yeah, you know, i Think it's around the corner, right? I'll just put it at that, right? but, you know, it's like a Little bit of the chicken and egg problem. Once the servers are more widely available, then i imagine the Devices coming out more broadly, too, as well. And, you know, there's been a little bit of a resistance to The first generation of CXL with limited capabilities, right? Everyone's like, eh, you know, it's kind of limited in some Ways, right? I understand the benefits of it, too, as well, right, as like prototyping and figuring things out, but, yeah, so it's a little chicken and egg problem right now, but my gut kind of sensing is this will be worked out more -- You know, from our team perspective, we would say is that, you know, we've had more hardware in the past year than previously, if that makes sense, in our hands that we're working with, right? and so it's definitely around the corner. Yeah. That's it. All right. Thanks for your time.