48


Hello everyone. Thanks for taking the time for attending this session. My name is Sayanta Pattanayak time and I am from ARM.

I think what the objective is to have a memory expansion device and that when we have a memory expansion device the kernel would expect a memory-only NUMA node which would be described in the system locality and resource affinity archive capability and that's where this SRAT ACPI structure comes in picture and that's what we've prepared for the CXL aware kernel. And in addition, if there is any heterogeneous properties like memory latencies and bandwidth attributes to be shared to the kernel, that's where the HMAT table is being prepared and shared with the kernel. That's from the point of view of the memory and all. And from the point of view of the CXL root device, the CXL aware kernel expects the root device to be present in the ACPI namespace and for speaking of the integration of downstream CXL devices and also understanding the properties of interleaving across host pages. So that's why this ACPI object with an HID of ACPI0017 is needed and it would indicate in the kernel that there is a presence of CXL root device. It would kick off the probe routine and subsequently the enumeration process and discovering other devices downstream topology. And it would also indicate the presence of CDT tables even though it is not completely dependent on it. And for each host bridge present in the system, there should be an unique ACPI object of HID ACPI0016. So that's what also being prepared. It would have a lot of methods associated with it and some of the methods for discovering and configuring the devices are same to the PCI host bridges and some method like CBR which can point to the root complex block for configuring the HDM decoders are needed but in current situation only the host bridges which are present at boot time is considered. So these methods are not implemented.

And the CXL early discovery tables which would give information and share with kernel where the kernel would have the pointers for configuring the host bridges for HDM decoders or also the type of memory windows which can be shared with this through these structures. So the CEDT CHBS structure would share the pointer where the host bridge registers can be configured for any sort of HDM decoders configuration required based on the topology. And there should be unique ACPI objects. I mean for each host bridge there should be an associated CHBS and the UID object under the host bridge object should match with the UID of the CHBS structure. So that's how they form the relation and it picks up in the topology. And the fixed memory window structure and that would define a memory ranges available in the host system which can be mapped for accessing the remote device memory ranges and if there is multiple targets for a host for a memory range which can be also described in the structure and whatever the interleaving way is being configured. So all those properties and details can be mentioned in the structure where kernel can pick up the details and do necessary configurations. So these are the tables what's being prepared in the firmware in the EDK2 and shared with the kernel for the kernel CXL framework to consume and prepare the right topologies and do the right configuration.

Coming to the hardware platform. So the fixed virtual platform is the test platform for this work. And it's the little background about this FVP. It's a complete simulation of an ARM system. It includes processor interconnects and memory and some of the peripherals also. And it gives a lot of flexibility in customizing the model. So for current work, the RD N2 FVP model is chosen and the CXL support is being added into this. So which is not yet complete. It is evolving actually. So some of the features which have been added are like DVSEC registers, mailbox, CDAT, DOE capabilities and HDM interleaving logic is not fully yet completed. So but it's ongoing task. So in a sense, this FVP model gives a lot of flexibility in designing and customizing according to the need. And why FVP is like in the absence of the real hardware. It allows the software community and the developers to prepare the framework where it can, where they can verify the functionality. For functional verification, it's a well suited model. It is fast. Yes, in case if there is a need of where there is a time accuracy or the performance tuning kind of requirement, then this platform is not suitable. But but but for our CXL framework, former development, it's well suited. So and and we can keep adding the features of the CXL as well and keep evolving both in the FVP as well as in the framework in the software side. So and then it it also allows the user to debug. It comes along with a lot of libraries and tools. So there can be a lot of way to debug also what kind of data transactions are being happening and things like that.

This is a very high level presentation of this end to end talk. What exactly are there just to give what kind of hardware we are using and how exactly is the communication is is being happening. And so. So. It tends to we have the SCP block, which is the system control processor. It has a M7 core in it and it would control the whole system's power, reset, clock domains. And so in a sense, it's kind of a master. And there is this application processor block, which would be having an ARM V9 node based cores. And. The M700 is the main interconnect in this talk, which will be having different nodes in it for connecting all the blocks around. And. And then the I/O virtualization block, which has multiple lanes under it, and there can be multiple root ports being can be connected through devices can be connected in the downstream port of this I/O virtualization block. And that that's what being utilized. And in in in this development work, there is a CXL capable root device has been plugged in here under one of these I/O virtualization block. And there is one CXL type 3 device capable of certain properties which are highlighted here. So that's been placed under this root device. And. And then that's how this CXL I/O communication happens through this block. And for. For the remote memory range. So the portion of system memory has been chosen and it's considered as a remote memory. And that range is being configured in the CCG port. So the CCG port is the CMN 700 port for doing the CXL.mem transactions. So it is CXL 2.0 compatible and that's what is being configured for the for accessing performing any kind of CXL.mem transactions. And this is this HDM memory what's highlighted here would be remote memory and considered as a CXL Mem and the DRAM is the local memory. So that so that that's how we're trying to demonstrate memory expansion capability using the CXL framework. And so one thing to mention here is the FVP allows just by modifying some script logic to have multiple CXL devices under this root port. And that that sort of topology can be just on the fly can be done.

Coming to the formal work. So we start with the system control processor where the actual boot process starts. And first thing for this work is being done is the interconnect configuration so that PCIe enumeration can happen and CXL device with the extended capability of the CXL and DOE capability is being found. If found then the CDAT tables are read out and what all properties and basically it just tries to find out the device the kind of memory ranges it supports. Once that memory range is found so it needs to configure once the memory range is being found it needs to configure the interconnect for doing this any sort of MAME configuration. So that that's the sole purpose and it's being done before AP boot. So that's why this enumeration and doing DOE operations and understanding the kind of range it supports and and then configure the interconnect accordingly. So that's what being done in the former in the system control processor and in the and then subsequently when the boot phase comes into EDK2 and in the EDK2's PCIe enumeration phase it has already. So when it kicks in it will also during the enumeration process it will also invoke the CXL DXE which is newly introduced. So it will invoke CXL DXE routines to also find out PCIe device with extended capability of the CXL and DOE. And if again in there also if it is found so then the CDAT like DSMAS and DSEMTS tables should be fetched out to understand the kind of memory ranges and what kind of memory and what kind of EFI memory types it supports. And so those details would be captured and kept in a local data structure for later platform driver to use. And when the execution reaches platform DXE so and then when it tries to prepare these SRAT, HMAT tables, SK tables. So it would fetch it would invoke those CXL DXE protocol routing interfaces and fetch those details and prepare necessary all the all the ACPI structures and that's how all the ACPI structures should be prepared. And also the structures for the CEDTs and CXL route device would be prepared and shared with kernel during the boot process in the next phase of the boot. When the kernel comes it would pick up all those ACPI tables. And in kernel already there is a framework to understand those ACPI tables and pick up the details and do the configurations and prepare the topology, write topology. So in kernel for now we are just utilizing which is well covered CXL framework and and seeing that validating the firmware is rightly done or not whether it's preparing the tables properly or not. So the firmware is not only here helped us in validating this firmware work. The kernel not only helped us in validating this firmware work, it also helped us like in understanding how exactly it should be done in the firmware and what kind of data should be populated in the ACPI tables and what's the right topology to keep there. So that way it was good for us to understand the framework and utilize it. And for memory nodes it just uses the NUMA framework. So the EDK2 work has been published recently for RFC and the links are shared here. Please have a look if when you get time and share your feedback, we'd be happy to address that and rework accordingly.

And this is just a presentation of like what's the kind of topology is currently done and what's so so. So we have a single endpoint. Yeah, there are multiple endpoints, but for CXL there is a CXL capable single endpoint, single root port, single host bridge. So just utilizing the software framework and the model, the FVP model what we are using right now don't have the interleaving logic completely. So we're just using it in the framework to populate the data structures properly. And so it's when we publish this memory window structures, it is just one way one decoder configurations and it's a one way or almost one way or non interleaving configurations. But, but the FVP allow has the flexibility to add endpoint another endpoint in this topology just by modifying the script at runtime. So, so it can be done and tables can be modified. Table configurations can be modified accordingly. Just we are trying to make the interleaving logic work here so that it keeps more value addition in this all solution.

On the status and current completed and what's the next task we are aiming for. So currently we have the CXL device enumerations and all the ACPI table preparations. Those things are done at at at the current stage. What's the kind of FVP supports and that extent, but definitely it would be evolving. There will be more contents would be added and more intricacies will be added in these tables and the data configurations. And now on the future side, next thing, one thing we first thing we would like to have is like the complete interleaving capability in the FVP and software also would be enhanced to demonstrate that complete capability. And and yeah, of course, would be will continue engagement with the developer community and contribute to whatever possible way and the firmware and the kernel. And so from the architectural point of view, memory pooling is one of the topic, which is the next next topic next in the next architecture, which is being investigated. So we will work on it and come to it when there is more things, more substantial things in picture. And and then. And another thing we definitely want to cover is like this SBBR, which is server based boot requirements. So many of the points in this SBBR are covered with this development, like preparing ACPI tables and considering the memory types and attributes. So many things are covered considering this memory expanse and use case. But there are many as we add more features and there will be more points to cover. So we definitely would like to have that coverage as much as we can to make the solution more standard and acceptable in community as well. And yeah, from kernel point of view, we will continue using the latest kernel and also validating the firmware and also in a sense, if there is any contribution from our side, we'll try to do that.

And on the reference side, these are the couple of references we are sharing. So the first is the link for where the Neoverse reference solution is being published. So it would give a sense of like what kind of support is there and some details. So it may help for the developers and for the FVP download link that FVP is downloadable for everyone. So anyone can download and try any kind of experiments they want to do. So, yeah, that would be all from my side for now. Thanks for attending. If there is any question.

Yeah, one question. And you have mentioned that the memory, the CXL memory is treated as NUMA node and I see that in the previous slides. And the next thing you have seen in the future work, you will deal with the memory pooling. So how we deal with the CXL memory? I mean, because I don't think it belongs to the NUMA.

Right. I get your point. Yeah. So that's kind of a topic which is still under a bit of debate. And I don't think there is a very conclusive answer to that yet. So that's why it's still under investigation. Yes, of course, the NUMA, it can't be. That's not so efficient. That protocol and considering pool memory and that may not be the right way to use it. So how that far memory or the near memory would be managed? Still, it's being discussed. And yeah, so I don't have a clear answer to that yet. Maybe in the future would be in a better position to answer that.

So now, for example, in our system, we have the high bandwidth memory. It's very fast DRAM. And we have CXL memory and we put in the same memory address range. And it doesn't mean some address range would be slower than others. Is there any proposal or patch related to this work? Because when we introduce the CXL memory into the kernel, we must deal with it. It looks like a cached hierarchy, but it's not a cached hierarchy.

Sorry, can you repeat the last part of your question?

My last question is, when we introduce the CXL memory into the kernel, we will meet a different range of different speed of memory. Some address range behave faster, such as the high bandwidth memory, and some would belong to the CXL memory. And it looks like a cached hierarchy, I mean, different speed, but they are the memory range of that.

Yes, yes, that's true. Now, when you have a CXL memory that share different properties and that may have different properties and attributes and that may share different bandwidth attributes. That's where it's being tried to be captured in some of the ACPI in the HMAT and all those nodes. But that's not enough. There are certain algorithms or things need to be there for efficiently managing this and then distinguishing this remote memory and the local memory. But yeah, that's a bit out of scope from our work right now. And so maybe, yeah, so maybe in the near future when we go in further testings and things like that. So maybe we'll have a more clear answer to that.

I was going to ask a naive question about like, is this code going into the EDK2 stream and can QEMU take advantage of it? But yeah, we can take that hopline. But I think we're going to talk later about memory pooling and things, right? So I think we'll circle back to some of these questions in later sessions.

So let's thank the speaker.
Thank you.