331


All right. Hello, everyone. This is Matthew. I'm happy to share our great open collaboration thanks to Grant, Vikrant, and Ready. So what's this? In each layer of full systems tag, including Ubers application and usage, Jackrabbit Labs cloud orchestration software, and SK Hynix CXL-based memory solution, considering Intel's recent CPU architecture. All of our members play a critical role to prove the feasibility and value in their professional sectors. So, you know, here at OCP is the right place that we bring technologies into our real system. Today, we are introducing about composable memory architecture with Kubernetes fabric-attached memory orchestration.

Today's agenda promises an exciting blend of ground-breaking solutions and insightful messages regarding composable memory systems, especially focusing on multi-host-based CXL memory solutions. Based on the OCP CMS logical and physical architecture, we would implement a feasible hardware and software solution through memory pooling. Through this collaboration, and showing now...

Open source, software-based, composable memory systems environment definitely paves the way for straightforward, simple-to-use, and efficient data movement over CXL fabrics. Through this type of infrastructure, by leveraging CXL 2.0 and 3.x beyond CXL 1.1, memory pooling—by which stranded memories are mitigated—is practical. Multi-hosts can dynamically allocate and de-allocate their CXL memory portions according to each node's memory usage by means of CXL spec-feature-like, dynamic-based device.

Actually, this work is already done in the OCP CMS work stream, and also this is captured from the OCP CMS white papers. CXL is a well-recognized interconnect for composable memory architecture, with current and future expected usages. Higher-level, host with firmware, operating system and virtual machine manager, that is essentially consuming local DRAM and remote CXL directly-attached and multi-headed memory. Last but not least, the data center memory-fabric manager stitches all of these pieces together and works with the orchestrator for provisioning and de-provisioning the memory. You can take a look into more about the logical architecture in the OCP CMS white papers.

CXL 2.0 provides that device memory can be allocated to multiple hosts through a CXL switch. For better connectivity, CXL 3.x provides that direct peer-to-peer device memory access through multi-level CXL switches. And it can remove a bottleneck that we don't have to go through the host anymore. Another thing is the data center memory favoring manager, also known as the CMS platform orchestrator, focuses on supporting existing data center scale research scheduler via family favoring manager APIs and CCI for handling of composable memory operations.

Well, what is the idea of this collaboration? And what is the impact of this collaboration? For getting the answer, we can think about the reasoning via six principles like 5W1H. Under Uber's data center usage, stranded memory is a real pain point in a Kubernetes environment. So, as one solution to solve this pain point, we built up the composable memory systems with SK Hynix FPGA-based real CXL pooled memory prototype and Jackrabbit Lab's cloud orchestration software for memory pooling under Uber's data center usage.

This is a high-level system diagram composed of multi-host servers as Kubernetes workers, orchestration server as Kubernetes master, and CXL pooled memory solution. We have dubbed the CXL pooled memory prototype as "Niagara." Niagara can connect to a maximum of eight hosts and can support up to one terabyte capacity using four channels. To support the CXL spec compatible DCD functionality, Niagara consists of two parts: a pooled memory manager and a pooled memory controller. According to the separated orchestrator request, the pooled memory manager of Niagara can send DCD commands to the pooled memory controller, and then the pooled memory controller plays an important role in supporting DCD functionality to allocate and de-allocate memory blocks. While we use an FPGA-based CXL pooled memory prototype to support the DCD functionality, we would implement software stacks interface for the host, CMS platform orchestrator, and pooled memory manager and controller. I will hand it over to Grant, who will talk about the mapping of this architecture to Kubernetes with cloud orchestration software and an FPGA-based CXL pooled memory prototype. It's your turn, Grant.

Yeah, so thank you. And I wanna thank Uber for their help with talking about how Kubernetes would like to interface with a composable memory architecture, like what's a good way to expose this type of platform, and then SK Hynix for providing the open innovation lab and the access to the Niagara and stuff like that, so we can show a real functional demo of how these technologies can be useful for application end users.

So, Vikrant made a comment about how his exposure here is proportional to the number of years he does things. Last year, he had a presentation here in this forum, and he said, "Wouldn't this be cool if Kubernetes could talk to DCMFM and schedule pods on CMS without having to worry about how the underlying memory technology works?"

And so we did that. So, this is in the innovation village now. We've been running it since Tuesday. It's essentially Bone Stock Kubernetes, Vanilla Kubernetes running on a Vanilla Linux kernel, running workloads on CXL expander. So, no off-the-shelf modifications. The Niagara does have DCD functionality, and there's some special kernel bits with that to get that functional, because the hardware's not available. The demo itself is stock. We're not using those. We can use those mechanisms. There's no reason why we can't use those mechanisms, but we are using just standard Linux kernel stuff. If it's 6.3 or newer, you're good to go, right? Implementation bits are pretty straightforward. So, we have a few things running on the host themselves. We have a couple of orchestrated daemons that just manage the memory that's visible to the host. There is a control point for those daemons running on the different worker nodes. But then the rest of it is just deployed into Kubernetes. So, there is a monitor pod that has a Kubernetes lifecycle. It's robust, blah, blah, blah. It restarts. And it serves as the inner point between the daemons running on the bare metal hardware and the Kubernetes scheduler itself. And then, your applications, your pods run. Unmodified, right? Init pods are a very familiar thing with Kubernetes. It's the setup that you need to do for your container applications to run. And so, you just add a few things that say, "Hey, I need remote memory. I need CMS memory." And then it just schedules it. It doesn't know how it's being scheduled. It doesn't care, because it shouldn't have to, right? You provide a couple of YAML files. Everything is good to go. Like I said, come check it out. It's live. Bring a workload. We'll run it. It's awesome. All right.

So, this is the larger overview. So, this is the setup that we're running right now. We've got two Emerald Rapids platforms and the Niagara system running. They're both connected through a Gen 4 link by 8. The cluster orchestration mechanism is Kubernetes, right? And then, our bits that are running in between the application and the EMR. And then, that's pretty much it. So, we've kind of covered this, right? All right.

So, I'm going to subject you to a video. And then I decided I would subject you to this instead. We'll run through this. It's a lot. But essentially, you deploy a familiar-looking YAML file. And what this does is it pulls our orchestrator off the container, loads it in, creates a namespace in Kubernetes. You add a little bit to your pod deployment for the init to ask for the claim. And then the highlighted bit on the top, you'll see it's sitting at zero memory for the NUMA node.

You kube-play like you would always. Memory shows up, right? You can check the status of the pods. Everything's running. Everything's deployed. And you just have memory on your server now. Easy peasy. Everyone who's used Kubernetes can do this.

And then, you can check the status of all your claims from the bare metal side, from the orchestrator, to make sure that everything's connected, everything's running, everything's happy. You can see that the orchestrator understands the pods and the lifecycle of the pods. So, it can garbage collect these memory resources when the pods exit or get terminated.

Last thing for this is that we did run actual workloads on it, not just some test stuff. It's a Gen 4 by 8 interface to DDR4 memory. And we ran two benchmarks that Uber found interesting for this work. It was a GoBench. GoBench. It's from Cloudflare. And then a Java benchmark, which is a higher bandwidth transactional processing workload. Kind of mimics an e-commerce store, right? GoBench ran parity. So, we ran GoBench on the DDR5 memory. And then we ran GoBench pinned to the Niagara. And because the Go kernel is so small, it was basically cached and running in the CPU. And even on the Niagara, all the GoBench benchmarks, like the compression, the decompression, the regex stuff, the HTML string parsing, ran pretty much the same. And then the Java benchmark, it was constrained by the interface, but you can do simple spherical cow math and say that if it wasn't constrained by the interface, remote memory would have been suitable for these workloads. So, we're pretty happy with where we are on this.

All right. So, as Grant and Matthew mentioned, we started this last year. We basically said, "Let's put together a solution that essentially works in Kubernetes." We didn't have many moving parts: we didn't have a CXL buffer, working software stack, silicon being ready, the server being ready, server memory buffer, or working software being ready. Here we are, with deep collaboration among four different companies, plus lots of others in OCP. This kind of highlights, you know, the openness and the innovation that can actually bring us together. So, we have a working Kubernetes end-to-end solution that not only does the control plane but also does the data plane, and you are able to actually launch the workloads. This whole thing is very transparent as far as Kubernetes is concerned; it is completely transparent. That's what we aim for. So, our goal is to essentially build upon this success to drive more and more memory pooling, you know, the use cases, and we also want to focus on high availability as well.

So, I strongly urge you guys to actually participate in the development of what the team has done already, within the context of Kubernetes. Contributions to the Kubernetes community, as well as building the orchestration memory, orchestration layer, and specs, white paper implementation, proof point collaterals, and everything—in everything that you can think of, all the way from development to production implementation—we strongly request you guys to join. Thank you.