52


Hello, my name is Robert Richter. This talk will be about CXL error reporting. It was created together with Yazen, who joined online probably.

This is a short overview of the presentation here. I will give a brief overview on the CXL design with respect to impact on RAS and error reporting. I will especially talk about CXL memory devices and then go deeper into the error handling here. I will also briefly touch the RCD mode, which is a restricted mode introduced basically with 1.1 and the impact on error handling and then an outlook on kernel implementation and also user land. So in the end, I hope there's some room for discussion.

So how does a CXL system look like? We have basically additional to existing DRAM in the system. There is now a CXL memory device plugged in to the same system similar to PCIe. This is shared through a link here, the CXL link. And this enables us to include additional memory to the system through a standardized interface. We have a way now to include memory with different characteristics which can be optimized with respect to costs, capacity, power, et cetera. This memory resides on the device here. But it is managed and controlled by the host, which means that the memory itself is plugged in to the system's current address space. And as such, it is accessible through load store semantics to the system. I also want to mention that this device topology is visible in the PCIe hierarchy to some degree. I will talk on this a little later. And this link here can be shared with PCIe so that you can also use this link to plug in PCIe devices. And the devices then show up as PCIe devices with PCIe host.

So how does all this impact RAS and the error detection on these devices? In a native system, we just have the DRAMs here and the CPU. There's a memory controller in the CPU typically. And that's it, no big deal. We can fetch the errors from there. With CXL, this is different. We have now a host. We have now a host here and a device where the memory is on. We don't have any longer a unique memory controller in the system. We have now more entities that can detect errors from system memory. The device can also report errors and provide also error details here. Once an error pops up here in the device's memory, then this must go through the device, through the link, and to the host back. So many components are involved here to detect the error compared to the local system memory. And with more locations also, we see different error types in the system which have a different error flow. And all this makes things more complex. Also, another notice that these devices here come from different vendors. So it's no longer a unique system from one vendor. So we have here different memory cards inserted, which might also have different implementations of the error reporting.

What else has impact or affects the error reporting and trust here? We also have firmware first and OS first. I will talk on this later in the details on the memory. We have these two CXL modes. VH mode virtual hierarchy is the current implementation, basically. And there's the restricted mode. Both modes have different topologies. And this also causes a different error flow. Another thing is we see CXL memory the same as standard system memory. So we then also have the expectation that we have the same look and feel with respect to error reporting here. So that raises the question how we integrate this into existing subsystems. But this might not fit well because of the changed way the errors are reported. We might need to implement new tools to detect the errors. Or we might add the error handling part to existing tools like ndctl also. And the next complexity here is also error injection itself. We have this many components involved and tools. And for each, we might need different-- for each kind of error type or component, we might have-- might need different tools, different approaches how these errors can be injected.

So where can errors happen in a CXL memory device here? You see here the host and the device that is connected through the Flex Bus or basically CXL link if it's a CXL link connected. Then we have this three components here-- CXL RAS capabilities, PCIe AER capabilities, and here down also the mailbox interface. For the CXL link, there are three CXL protocols. We talk on this .io, .mem, and .cache. And this can be formed to two groups. So one is the io part, and the other is the cache/mem part. Depending on where the error happens, we have-- it can be-- happen in the host or on the device. So we have four points where protocol and link errors can happen. One here for the cache/mem part, host and device each, and here for the io part, also host and device each. And then there's the device error itself or device related errors itself here, which report memory or thermal errors on the device. And these are locked in the event log. And the mailbox is used here to read this out of the system.

So based on this system design, we have four types of errors that can happen. The first two are poison and viral. These are basically integrated into the memory management of the CPU. Both are unrelated to CXL. We don't need additional drivers. So they have already solutions here, also from the handling point of the host. For poison, we tell the host that there's corrupted data with a data packet. And the host then has some host processor specific behavior, which could be marking this memory bad also, depending on what is happening. The only difference here is that we need to enable poisoning through mailbox command mechanisms. And then it's just enabled. Viral is slightly the same. Once an uncorrectable fatal error occurs on the device, for example, it can tell the host processor. And then there's also some host processor specific behavior, which means basically once such an error occurs, the system mostly cannot continue to work. And typically, this leads to a system freeze or a system reset, depending on the implementation. Now there are the CXL memory errors. We have these two types, the protocol errors and the component errors. I will go into more details in later slides. Basically, for protocol errors, we are using the PCIe subsystem, the AER error reporting. And for component errors, we have a mailbox that can be used to read the error log or event log.

Protocol errors, there are these two differences between firmware first and OS first. In the firmware first case, the firmware just consumes the error. And it may then signal to the OS. And using existing ACPI methods, the error that happened. For this, the CPER error format was extended in last UEFI specification to also represent CXL protocol and device errors also. So these are then typically sent through a GS interface back to the host, the OS. In the OS first case, the OS must handle the AER events using the PCIe subsystem. If there is an internal error, then the OS needs also to examine the CXL RAS structures here. And depending from where the error was detected, either the downstream port or the upstream ports needs to be checked for errors.

For components errors, we need a CXL driver, which basically has two parts or serves two parts here. One is to access the event log down here with some sort of event handler. And the other is to maintain or to control the mailbox interface to read out this error, basically. So once the driver is initialized, the driver must first take control of the memory error reporting here. It's using the ACPI OSC method, which means that the OS must request that it further wants to report errors. And then the firmware needs to grant this request. And then it can continue and basically looks into the event log. If there's an event pending or so, then the mailbox interface is used to fetch the errors from there. The event driver then further parses the error records and reports them further and may also have some further actions to report this to the operating system that can have further actions on the error.

Some notes on the differences between VH mode and restricted CXL mode. This is the software view of a VH case. And basically, the most interesting part is this one here in the middle. This is the part where no restricted CXL device is included. And we see here the root port. We have a switch in between here with a downstream port and an upstream port here. And this one is then connected to the endpoint. All these components are visible in the PCIe hierarchy. And we can use PCIe access methods to access the components of every component here, the structures of every component here, either in the root port or the down or upstream ports or also the endpoint. And this means there's basically a way to access existing methods, the PCIe structures that handle the AER errors. If restricted host is different, you see here, or one node here, to the restricted CXL devices. Once the device is in the system, another restricted CXL host is created, which forms a pair. So same as here for this device, which was originally attached to this switch. So we have always a host device pair available. In the case the device only supports restricted CXL mode. And another node also, here you see the PCIe link shown up, which means that here is a PCIe device connected to. So the same CXL node can also be used as PCIe device.

Some more details on restricted CXL device mode. This shows the diagram how this is seen in the system. We have a restricted host here with a downstream port. And here's the restricted CXL device, where also D, which also has an upstream port here, which is the PCIe device. And here's the restricted CXL device, where also D, which also has an upstream port here, which is not visible in this image, but it also exists. Now in CXL RCD mode, which is the former 1.1 mode, we don't see this downstream and upstream ports here. So these are not part of the PCIe hierarchy. This device shows up as a root complex integrated endpoint. And that's why we have also an event collector here, an RCEC, who's now responsible to propagate the AER arrows. To access the downstream and upstream ports, we need an additional root complex register block, which resides somewhere in the memory mapped ranges. And this needs to be mapped separately to get access to the downstream and upstream ports. And this register block basically shows up with PCI type 0 header, similar to PCI. And there is also a link into the component register set. Component register set contains the CXL RAS capability structure to read out the CXL protocol arrows. So this is the same here, but it needs to be handled differently once an error is detected here in the RCEC.

What is in the kernel missing? All of this. So there's the RCD. So RCD mode is missing. I sent a patch series two weeks ago here on this. We also need a CPER extension of the CXL protocol arrows. I think there was a submission already on the mailing list. And the AER handling of a PCIe subsystem needs to be modified in two ways. One is to make use of the CXL RAS capability structures. In case an internal was detected of a CXL device or host. And then we also need to add AER support for RCD devices, RCD mode devices, which means that we need to detect the messages coming from an RCEC. This is a CXL device. We need to extract the downstream and upstream ports that are in separate memory ranges. And for this, we need to extend the PCIe infrastructure in the kernel. What is also missing in the kernel is the interrupt support. We need this as well. And there's also no event driver at the moment. It must not necessarily be a separate driver, but we can also extend, for example, the CXL mem driver in the kernel already. And then the next question here is how we can extend existing subsystems. Do we have-- we need to extend the kernel locks. We might want to attach the error reports to trace points.

Sorry, can I just ask a quick question? Just a clarification on the previous slide, the CPER record, is there an OSC to negotiate if the OS has any idea what that is? 
Is there an OSC? As in, can the firmware query the OS to ask whether it's understood? Or does it need to do-- well, it's changed CPER records. Because if we have an unaware OS, which this should all work--

Well, OK. I think it's just an extension of the bitmap of the kind of error that happens. So

OSC already extended to support this.

OK, cool.

Yeah, this handshake.

The next slide was about user space here. So many things could be moved out to user space, which could include also mailbox interaction, monitoring tools, event handlers. There's also-- there might be a need for address translation to know-- yeah, to have a translation between device addresses and physical system addresses, which might not be trivial. And fault analyzers and interleaving region setup could be also moved out. But the general question is how we-- yeah, what's the kernel user interface? How simple we want to have kernel drivers here? And how this could be looked like. But to some degree, the kernel should handle the errors also and be aware of at least the fatal errors.

So to summarize all of this, we have this CXL way to add-- a flexible way to add system to the memory for-- yeah, to have the same look and feel as for native memory we want to have or we need. RAS here also as a solution. But there are many components involved with a variety of errors that can happen. Some of this is already available in the kernel. Machine check exceptions, for example, and parts of PCIe error reporting that can be reused. But the CXL protocol and component error implementation, this is completely new and we need new patches and kernels for this. And I hope we can also contribute a little here in this area.

So that's the end of my presentation.

I would like to thank you. And I hope there's some room for discussion also or some questions.

At least with regards to the mailbox interrupt stuff. This is more of a kernel-- less CXL and more where we are with the kernel. I also want mailbox interrupts. And right now, the caveat is basically like dynamically allocating vectors for the DOE. But I'm wondering if we can-- for the users that want interrupt support that don't have that problem, can we just have a basic MSI, MSIX implementation and just go from there?

OK, so the interrupt situation today is ideally we'd like dynamic resizing of the MSIs allocated so that along come various things and various different bits of the driver, and they all want an MSI. And they can all request it, and we just keep sizing the thing up. Unfortunately, kernel doesn't yet support that. There's been some discussion of it, but it hasn't happened. But what we can do is the equivalent of what the PCIe port stuff does for the switches, where you basically have a whole load of pre-registration things, where there's little bits of code for all of the features that might be there that go snooping around in all sorts of config space and various other places, figure out the largest MSI number, and then just allocate the thing. That's what-- Yeah. Yeah. Yeah, so that's what we've got in-- yeah, it was in the, as you say, the DOE stuff before. And I mean, my personal feeling on that is we just take it with the first user. What we haven't yet done is accept any of the users. So it's been in several patch sets.

One simple solution to that crossed my mind, but it's obviously not usable. But just allocate an obscene amount of vectors. Yeah, that's why.

Don't do that.

OK, just one thing. Well, I would like to see all the parsing and log handling to be user space. People like me will not do log or unlink in then for CXL. We do everything out of them for various reasons. So anything that will be inside the Linux kernel will be useless to me.

But obviously, you need stuff in the Linux kernel, and I'm fine with that. But all the parsing, handling, and so on of the log you get from the device is something I would wish is user space so it can be reused by other people.

If it's a memory error, you've got to handle it in kernel.

You get the poison.
Yeah, you get the poison. I don't need anything else. That's fine. I won't be doing anything in then. I cannot-- I actually-- I technically cannot do anything in   then.

OK, so you do nothing preemptive if you get a memory error. So normally, the first thing you do if you get a memory error like that from a RAM scrubber-- what do you do for DDR RAM scrubbing?

So theoretically, I'm talking about-- you have to think, in my case, it's pyramidal. I don't know who is running the operating system. I don't know what the operating system knows. I don't know what it is. So it's like, I can't do anything. It's not my problem. But I need to be able to get the error log out of them.

So you don't even tell them that the memory is corrupt?

So they're going to get the poison. So I assume, by default, if you have a basic operating system, you're going to get the poison. The poison is going to be beyond the CPU. The CPU is going to go to EDAC. EDAC, you're going to put page poison and so on. So you're going to go machine check exception and so on. So it's going to be fine on their front. But on my side, I still want to know about the error.

So I understand what you're saying. So it makes sense to me. But that's an implementation choice. No, customer use choice. Data center choice. Policy choice. So different data center would have different strategies. But CXL specs, suppose this. So what happens is you have two mailboxes. The device is required to have two mailboxes. So it's designed of that firmware. And so in this case, it could be out of them. And the host use different mailboxes. So the device could say, hey, send the correctable errors to one mailbox for OBD to consume and uncorrectable error to the other mailbox for this kernel and OS.

Yeah, somebody could correct me. But it's just the primary mailbox that we currently have.

Yeah, there's two sets of mailboxes.

Yeah, unfortunately not. It's optional.

The secondary.

Secondary one's optional. There's no guarantee it's there. And to be honest, the use model around that is horrendous for any guarantees that someone isn't already using it. So it was put there for that purpose. It's not easy.

Also, the design is that the kernel driver will do the direct interaction with the device, nobody else, for the primary mailbox. So that once it gets it, you can choose not to deal with it. Then it goes silent. And then OBD takes care of it.

OK, let's thank the speaker.