359


Hello, my name is Mohamad El-Batal. I'm from Seagate, office of the CTO, and with me also Hongjian Fan - an architect and a designer of software who is also going to present the software layer here. I'll start with a composable memory appliance contribution that we just did within the CMS team. And based on that, we actually are in the process of creating the CMA - the Composable Memory Appliance, under the CMS working group. We are inviting people to join the CLA, in order to work with us on this.

So, why memory pooling and memory appliance? The bottom line is: I think stranded memory is a real problem. And the memory wall situation in terms of compute over scaling beyond what the memory interface can keep up from a bandwidth incapacity perspective. The idea was to expand across multiple nodes - multiple client nodes - to expand the memory capability, and provide initially pooled memory, and potentially with software coherency, you can actually do shared memory prior to having hardware sharing as well. So, this solution is trying to target about two NUMA hops: it's connected via cabling, etc. You'll get more details, but it's essentially the current memory within the server being a single NUMA hop. The current contribution is two NUMA hops, and in the future, when we go to ASIC technology, currently we are at FPGA mode - and when we go to ASIC technology, we do have line of sight towards faster and lower latency, faster bandwidth and lower latency.

The current architecture is basically a three-node solution, which can be either 19 inches or 21 inches. You apply that can be configured in various ways. Currently, it maps to two nodes with full pooling or four nodes with partial pooling. You know, we have eight blades, each equipped with eight of these memory controllers, which have multi-headed capability. The future solution will be envisioned as Gen6 and will support up to 8 nodes with full pooling and 16 nodes with partial pooling. So, that's a significant upgrade from the current architecture. Currently, the type of part that we actually showed at the Innovation Village Samsung is a 128-gigabyte DIMM per blade. Each box has sixty-four of them, and basically gives us a terabyte capacity with four hundred gigabytes per second bandwidth and around 400 nanosecond latency. So, that's basically what we're able to accomplish at this point in a memory-pooled environment.

This is the actual booth. I actually added this picture because I knew the booth is closed right now. You can't see it. So this is the actual appliance that we demonstrated, along with the CFM - which is a Composable Fabric Management. The Composable Fabric Manager is a full memory composition tool that provides a GUI or a CLI and actually deals with the various nuances of how do you transfer memory capacity and bandwidth between nodes in a pool. So, the work that we actually demonstrated outside will be gone into more detail by Hongjian, who's going to show you how we're doing it, with which API, et cetera. The table or diagram on the right shows what we can actually do in terms of latency and how - because we're really not using a switch. We're using a multi-headed memory controller with a crossbar. And it's all based on address decode, capable of bypassing a switch - no need for the latency adder of a switch.

Configuration, like I was indicating, it can be - currently each server will have to have a retimer. We used in the demo Astera retimers, sixteen of them, Gen 5, and those were put on AIC cards to connect in a mesh architecture.

So, for this solution, you know, it's capable of by 8 connectivity as well. So, in an 8-blade system, we would have by 8 connectivity between each retimer card that is capable of bifurcation. So if the nodes are capable of bifurcation, you can utilize all 8 blades in the appliance. If the node is not capable of bifurcation, it will use half the blades in the appliance and connect,  in partial form.

We also, you know, within the server group, we chaired the PCIe Extended Connectivity Group, which spawned the PCIe SIG effort around the optical working group. This effort has led to the development of optical PCIe, which is now fully formed - at least there's a line of sight towards productization. We've seen a couple of samples from a couple of vendors, and we would like to start working on extending short-reach optical up to 7 meters for CXL. That's what we believe is needed from a requirement standpoint of the OCP specification right now.

So, with that, I'd like to talk a little bit about the Composable Fabric Manager. It really is based on standard APIs and standard connectivity to orchestration software. With that as background, I'd like to transfer the discussion over to Hongjian, who will tell you a little more about it.

Hi, my name is Hongjian Fan. I'm under the research group of Seagate. So, for this Composable Memory Appliance, we have developed this software, which we call the Composable Fabric Manager. The graph shown here is the architecture. On the right is the appliance. Each of the blades has its own BMC to manage the blade itself, and then it's the CXL logic underneath it. On the left side is the management plane, where on the top there is a composer and the fabric management CFM - which stands for Composable Fabric Manager. We have implemented two interfaces: one is the CFM interface, which is based on the hardware logic; and the other one is the Redfish interface. And we also have a service client installed onto each server, so those are all Redfish compliant. In the future, we're planning to add - what's shown in blue here is a memory interface to Kubernetes. We're open to collaboration on that part.

Yeah, so next slide is a closer look or more details on the software design. On the left, we have like different APIs and web UI interfaces. In the middle, we have the manager layer; and on the right, we have the Redfish client, which talks to the memory appliance, talking to CXL host servers.

And this is an illustration of the management model. So, in order to map a CXL device into a Redfish resource, we have to decide how do we want to present it. The DMTF and Redfish both have specifications that we need to adhere to: we abstract our device into a multi-headed and multi-logical device model, but with a static binding. So, what we've based on is already open in the spec. We had a choice to make about how to implement this, and we chose to use the device model. However, if Redfish were to introduce an abstraction for DCD - or dynamic capacity model - in the future, we might move to that as well.

And this is the CFM model, which we said is based on the hardware topology. So basically, we have an appliance route at the bottom, which has blades underneath it. Each blade should support two ports that connect to two whole servers. Then we have the memory pass in the middle which represents the composer's memory. And on the right-hand side, we have the host pass, where each port is a connected CXL device.

And here is the Redfish model that we have implemented. We have the chassis, which represents the CXL host servers; and the fabric, which represents the memory appliance. The system collection is a more abstracted way of representing the memory appliance, while the chassis collection is used to represent the CXL host servers. In the middle, we have the fabric collections which are used for composing and decomposing memory. One thing to note is that when we compose memory, we need to post the endpoint to connect it between the CXL memory appliance and the CXL host server.

So, the CFM API and the Redfish APIs are completely equivalent, This is an example of an API list that's probably most frequently used. We're showing both how to use the CFM APIs and the Redfish APIs. Let me break it down for you,  The CFM API and the Redfish APIs serve the same purpose, but with slightly different syntax. To illustrate this point, we've put together an example API list that demonstrates how to use both sets of APIs.

Besides those APIs, which are easy for machines to use in automation and orchestration software, we also provide some tools with nice user interfaces for human interactions. So, here shows the web user interface, which allows us to see the details of each blade or CXL host server. We can see how much memory has been used, located on which devices, and how many devices have been connected. And, of course, we also have controls for composing and decomposing memory here.

And this one shows the CLI tool - it's another tool we've provided to simplify command-line use.

So, call to action, We've created something that's basically contributed, and the CFM is open source. We would like to have you join us within both the CFM subgroup and the CMA subproject. And we'd like to look at this as a start, as a beginning, as a 0.5 spec, Even though when we released the base specification, it's 1.0, because we believe it works. But we want to make sure that the team, with its collaborative OCP spirit, is used to improve it and make it something that can be consumed by OCP customers. So, again, we want to make sure that, as a group, can actually expand on this. We believe in the power of collaboration and community input. I'll leave some time for questions now. All right, If no one has any questions, thank you very much. Thank you.