329


Welcome to our full-day session for Composable Memory Systems. We have a full agenda, and thanks for being the early risers and strong supporters of memory systems. What I'm going to do is basically, we just want to go over what all things Composable Memory Systems is doing, what's the composition of work streams, and then make sure that we actually talk about what our deliverables are. Composable Memory Systems is a sub-project within a server project, and we focus primarily on memory configurations for general-purpose compute as well as for AI systems.

So, we'll give a little bit of an overview of where we started, where we are right now, and then work through what deliverables we have for this year. Just a quick update as we go through the day. We have a full day agenda going up to 4 o'clock, and it's quite a tight schedule, so Reddy and I will be sitting here. We'll try to manage the time as much as we can. There's a timer here for the speakers, and we'll give some updates on how much time we have left.

So, a CMS overview—some of you guys who have been here for the last two years, perhaps this is not a new thing for you. But, the whole idea was to, as we start looking at the memory challenges in front of us, we want to make sure that we have solutions talked about. How do we compose memory? How do we have configuration for multiple tiered solutions that we can talk for the different applications? What applications are driving these kind of use cases? How do we make sure that the industry is involved in building the benchmarks and workloads and profiles, and build the solutions and bring them in the open way as much as we can, as we want in OCP for hardware as well as software?
So, this is where our overall blueprint that we defined that we'll talk about use cases. We'll talk about what's the system architecture for such composable memory work systems, and then give the industry workload. And then we'll talk about how we can use these frameworks and benchmarks to validate the solutions that they want to bring to the market for the composable memory subsystems that apply to the use cases that we talked about. And this all has to be coming together with a management framework.
We also expanded our work to make sure that we feed into the academic industry and make sure that the work continues into the academic industry. We want to make sure that actually, we work with the industry. There are a lot of standards bodies working in different areas, JEDEC, SNIA, DMTF, UEC, UAL; this list continues to expand. OCP, we are not defining the standards, but we are defining what the use cases and what the solutions that are being brought forward. So, we want to make sure that we collaborate with the groups that are actually defining the standards and make sure the overall solutions are enabled.
So, that's the overall idea for the composable memory system. Everything related to the memory, whether it is general compute systems, AI systems, storage system—everything that has memory inbuilt, what the solutions are there, how do we build them?

At a high level, when we started, we looked at saying, "Where are the solutions?" We have one way of connecting memory right now, which is you have a DDR or LPDDR or some direct memory that is connected to a CPU. There are more ways that are coming along, which allows you to tier the memory. One of them could be the direct attached CXL or any other memory fabric that comes. Now, you can imagine, I'm going to talk about, through a picture, which is basically direct attached memory, or we have a pooled memory where you can have multiple hosts connected to a memory that are pooling the capacity, dedicated capacity for each one of them. Or, you can have those servers or end nodes connected to a pooled memory, but now they are sharing. There is a common memory that can be shared in a consistent way for the use cases that they have. Then, fabric memory, where you just have the whole memory fabric created, and memory sets as end nodes, and they connect to each other. It's quite visual. You can download the presentation later, and you will have access to that.

History, where we started—we started a long time back as a FTI, or Future Technology Initiatives. A few people came together. We started; it became a sub-project. And now, where we have very strong membership, we have more than 250 people who are members of the group. We have weekly calls where more than 40 to 50 people join every week on a Friday at 9 o'clock. So, there is a strong commitment from people, and a significant amount of contributions are being made. This year, we have already published specifications, white papers, and GitHub code for the memory fabric as well as for the workloads and the benchmark. So, there's a lot of contributions coming from the member community. And as we start looking into going beyond general purpose computing, but looking into the AI systems, AI systems have significant challenges that the whole community is bringing forward. Use cases that we're bringing forward, whether it is an accelerator memory expansion through the host, accelerator direct-attached expansion, fabric-attached memory, all the AI-related use cases, and expanding into what kind of memories can be used, how can they be influencing our decisions for what goes into the AI clusters and how it solves the AI memory wall problems, whether it is capacity or bandwidth—all of this is being discussed on a weekly basis. And we are quite excited to see how much work we can do. The work was significant. So, we ended up defining and distributing the work across five work streams. Let me hand it to Reddy to talk about the work streams and what we are going to do this year.

Okay, so, like Manoj said, we split the work stream based on the actual CMS charter. Obviously, we need to start off with what are the use cases that hyperscalers, the next wave customers, actually see. And we have essentially focused on the use cases. And then we have focused on the workloads. So, once we have the use cases, we need to have a set of targeted workloads. This includes general-purpose compute as well as AI and HPC as two distinct focus areas. For us to actually build a total solution, we do need a data center-wide composable memory orchestration framework. This is actually building on top of DMTF standards, but able to actually compose memory dynamically for each one of these workloads. Then, the third one is, like Manoj said, we have actually beefed up our focus this year on AI and HPC. Fabric is a very critical component for AI, HPC-specific workloads. Computational programming is something that we do want to target. So, we are not losing track of what is happening today, but we are also focusing on what's happening in the future. We do believe that computational programming is going to be one of the key aspects of driving the system-wide optimizations for AI workloads as well as the general-purpose workloads. And then, academic research is something where we want to keep an eye so that we essentially know what's happening in academia, and we can actually provide feedback. So, there is a bi-directional communication with the academic research community as well. So that's the way we segment.

These different focus areas are divided into work streams. Each one of them has work stream leads, and they're actually going to talk today. For each one of them, you're going to see specific focused presentations as well. From this year and next year, our focus is going to be on the composable workloads. We need to focus on both general purpose as well as AI-specific workloads. You probably heard about, you know, the caching on the Redis and CacheLib, Cassandra, Spark. CacheLib is essentially a caching, embedded caching library, an open-source library. So, we would be focusing on the general purpose. You're going to see quite a bit this year, but, you know, come next year, you're probably going to see more of AI and HPC. Data center memory orchestration is more about how do we actually stitch together everything from the Kubernetes type of orchestration down into specific, you know, specific nodes and compose memory on the node. So how do we stitch all the DMTF and CXL standards along the way and have a white paper, you know, to cover those aspects, implementation aspects. AI and HPC, the target is to have an inference white paper and also focus on the end user use cases and more clarity on that by having a logical system architecture that actually supports AI and HPC as well. Computational programming, the immediate focus is really educating on what exactly are the opportunities in computational programming through the white paper, as well as the blog. Academic research, we will have a blog coming up. We were targeting this, you know, summit, but we pushed it out to sometime after the summit.

And then, there are broad, horizontal, specific initiatives, like Manoj said: hardware, software, co-design initiatives, composable memory, app plan specification. Actually, that got published thanks to Seagate, who drove that. So, there will be more and more of that in the CMS. There was, you know, a proposal from one of the key contributors, from Brian, to see if we can actually have a memory tiering contest. Basically, say, take the workloads that we have defined and actually promote the vibrant ecosystem among the device vendors, as well as the total, you know, solution vendors to actually participate in demonstrating the memory tiering, not just the concept, but also the performance attributes and everything else. So, that is still in the works. And last year, we did the memory access tracker, ECN, through the CXL Consortium. This kind of highlights our work with others, you know, standards bodies, like Manoj said. And there is also a white paper that was published. So, I strongly encourage you guys to take a look at all the white papers that have been published so far. They are actually on the Wiki CMS.

From the logical architecture view, essentially we look at compute as one building block. It can be general-purpose compute coming from the CPU, or it can be, you know, coming from the GPU. So, those are the consumers of the memory. And the concept is, you have near memory, so this could be, if it is a CPU you are talking about DDR, you know, remote memory or CXL attached, directly attached memory one hop away. If it is a GPU, you could be talking about high bandwidth memory or LPDDR, whatever you have, right? Far memory is something that where you essentially have fabric attached memory. So, there is a little bit of nuance from the performance, you know, the bandwidth, you know, attributes. So, we essentially segregated them into near and far, and then you have the storage as really the element of, you know, storing some, you know, large content, right? So, that's kind of how we segregated them in the system architecture.

So, clicking down on the logical architecture view, I won't go into the details of each one of them. There is a white paper that actually explains all the details about each one of these building blocks. But, in general, the theme is, you essentially have a compute element, you know, and then there is also a management controller. We kind of, we didn't want to have, you know, a very implementation view. So, this is more of a logical view. So, you don't see that as a baseboard management controller, but technically it is BMC, right? Managing the host. And then, there is an out-of-band connectivity to the data center orchestration. And then you have CXL-based implement. If you look at the CXL-based memory expansion capability, you can actually attach it to the host directly, or you can actually share it among multiple hosts, or it could be coming from the fabric, right? And then, you know, last but not least, there may be a memory enclosure sitting somewhere on a different transport completely, all, you know, nothing to do with the CXL and PCIe. So, we are essentially looking at accommodating all these usage models in the logical system architecture.

From the workload focus areas I briefly outlined before, we will continue to focus on general-purpose compute, looking at the caching-related workload, streaming-related workloads, Spark and Presto data analytics, Cassandra, and so on. But we'll also increase our focus, both in the training and the inference workloads. And then, as and when time permits, we'll extend the focus to HPC as well. Currently, we are not focusing on HPC because we have so much work in front of us. And you're going to see that today, Vikrant and company are going to present; there are two topics, if I remember correctly.

So, I strongly encourage all of you to actually participate in the weekly meetings. We do meet every week on Friday. That would be my request to all of you: come and join, take a look at that. Last but not least, we also want you to actively drive some of the key problem areas. Things like, how do you take advantage of inference workloads with fabric-attached memory? How do you speed up the inference? Using a GPU and CPU resources collectively with the pooled memory resources, how does the orchestration need to shape up in that type of architecture? There are a lot of deep-dive discussions on these. So, if you're really an AI expert, we want you to actually attend and contribute as much as possible. All the details are actually on the CMS project wiki. It's updated with meeting minutes, latest white papers, all the latest presentations; everything is actually posted in the CMS wiki. Also, you can subscribe to the OCP mailing list. So, you essentially get an idea of what's happening. And if there are specific topics that you are interested in, where we have a discussion going on, you might as well join selectively if that is what you prefer, but that's another option. I think that's pretty much what we have. Manoj, yeah.

That's about it, I think. We are perhaps going to talk about that. We are perhaps going to return in three minutes so that we can start working on the next presentations. We have next, okay. All right, so we'll start. Thank you very much, guys. Thank you. Please join the work, and I think let's drive the memory solutions for the systems. Thank you.