273

YouTube:https://www.youtube.com/watch?v=mumLhBDYSgo
Text:
Good morning, good afternoon, good evening, everyone. My name is Parag, and thank you so much for the opportunity today here. As I mentioned, I'm Parag, and I'm responsible for storage segment marketing at ARM. Today, I'm here to talk about enabling CXL memory pooling devices. With that, let's get started.

From an agenda point of view, these are the topics we're going to cover. First and foremost, we will discuss the main reasons for the disaggregation of compute and memory. We'll cover perspectives from some of the Cloud Service Providers, and I'm going to specifically just call them as hyperscalers. Then, we move to cover CXL from both the host and device side perspectives, and how CXL is really enabling new data center architecture, specifically focusing on memory pooling devices. Then, we will finally summarize and discuss specific next steps.

Let's move to the main reasons for the disaggregation of compute and memory, and also how we look at it from a market dynamics point of view. The first reason is the inefficiency of DRAM memory utilization. We'll show you in some of the upcoming slides how much DRAM is really stranded at each of the Cloud Service Providers. The second reason is the memory channel bandwidth per core decline. As you've seen recently in some of the announcements from Ampere, NVIDIA, etc., they're going to release products with more than 128 ARM cores, and many of the hyperscalers would like to have at least four gigabytes per core. You'll see that upcoming designs are even thinking of increasing beyond eight memory channels in order to support these requirements. But we do expect core counts to increase over time. The next reason is the desire to reduce the total cost of ownership. As many of you know, DRAM is one of the highest expense items in the data center. Anything that can increase the efficiency of already existing hardware will indirectly contribute to reduced total cost of ownership. The fourth reason is PCIe speeds. Currently, with a PCIe Gen 5x4, you can equate it to a single channel DDR5 bandwidth, provided you take into account efficiency and all that. Then last but not the least, workloads and hyperscalers or Cloud Service Providers are becoming more and more divergent. Thus providing the need for more configurability because you don't want to build machines for specific workloads but rather be able to configure general purpose servers to different workloads. So, these are, as in our view, some of the main reasons for the real disaggregation of compute and memory.

With that, let us dive a little bit deeper into the details of two of the inputs that we have received from both Microsoft and Meta. And this is all public information. On the left, you see some charts from Microsoft on how stranded memory is as the number of scheduled cores increases. And in this specific case, they're showing that they have about 20 percent of the memory stranded, which is a lot. And if you just do some simple math, the assumptions are that 50 percent of the data center server costs are memory. And if you decrease the amount of stranded memory by 10 percent, then you almost save five percent of the total data center. Which is quite a lot of millions of dollars. So, Microsoft also has started to experiment with its workloads by adding latency to its workloads to really see the impact. And this is the second chart at the bottom left. And you can see that in some of the workloads, there is not any slowdown, and in some of the workloads, there is a slowdown because of the additional 64 nanosecond latency. So, based on this, you can really, if you zoom in a little bit, you can see that some of the databases such as Redis and Spark are not really impacted by this additional latency. So, you can assume that these are good candidates to start using CXL. And this is with no changes on the application side. However, we do expect that cloud service providers or hyperscalers will ask customers to tier memory for their applications and charge them accordingly so that they are motivated to reduce costs on their own because of the tiering and the pricing. Similarly, on the right, you see similar workloads from Meta where you can see the split of hot, warm, and cold data in terms of capacity and allocation. If you take workloads as an example, you can see that in terms of capacity, they have a lot of cold data. But in terms of allocation, they have a split of cold, warm, and hot data on the anonymous side and a lot of cold data on the files. So, based on this data, one can make appropriate choices on enabling the right memory based on the workload. Now, both of these companies are working on many proof of concepts. So, we are excited to be part of working together and making sure CXL would happen because as you can see, there's a real, real need here.

Based on these data center needs that you have seen here, the two solutions that are really enabling type 3 CXL. On the left, you can see CXL memory expansion, wherein memory will be added to existing servers. Then on the right side, it's more we call it of an advanced solution. With memory pooling, it can be enabled with multiple dips. This is where you have a memory pooling controller that needs to support cache coherency. We also expect the CXL memory pooling controllers to be driven by DPUs that are based on ARM. And memory pooling solutions will really help enable higher memory capacities, thus solving some of the market dynamics that we mentioned a couple of minutes ago.

Now, we actually want to deep dive a little bit more into the memory pooling side of things. We do expect multiple hosts connected to multiple CXL devices. So, for example, in this diagram, we only have showing three of them, but we do expect that some of the solutions will support at least eight hosts. This is what we are showcasing in the picture on the right, where you can see eight CXL input ports. So all of these input ports are connected to a CXL PHY and to an End Point, which are all connected to an ARM coherent interconnect. There are some other ARM products to manage fabric and SOC, etc. Then the coherent NOC is connected to eight DDR memory controllers. So, as I mentioned before, I do want to say that this eight by eight memory pooling controller is an architecture that we feel that there is consensus amongst many hosts who are really looking for such a solution. You can also assume that ARM coherent interconnect will provide the lowest latency possible because this is one of the key attributes of our product portfolio. And last and not least, ARM is actually in a unique position to provide an end-to-end CXL solution with both host and device side, because we have that knowledge, so we can really come up with solutions catered to specific requirements because we can optimize both sides very efficiently.

Now, you have seen different solutions on the CXL side, but we also think it's important to see this at a much larger scale. This is the current view of data centers wherein you can see a bunch of compute servers with memory and storage. Then you have Ethernet-based networks as the backbone with TLC and QLC flash and warm tiers, then hard drives in the cold storage. So, in order to enable the needs of hyperscaler data centers, we do expect changes. So, the first change is obviously the host will have CXL enabled ports. Then, we expect to see memory expansion connected within the compute server itself. Then the next stage would be to add disaggregated memory pool with DRAM. Over time, you also expect CXL-based memory in the warm tier, then probably even CXL with flash. So, we can really see the change that's going to happen over time from a data center point of view.

So, we have clearly seen how CXL enabled memory disaggregation really works from a data center point of view. But the question that many people ask is, OK, can you really quantify it, and how can you quantify it over time? So, we want to showcase how CXL can really enable memory disaggregation over time and what would be the benefits over time. So, let's start with the baseline on the left. This is the baseline assumption that we have. We have a set of jobs that are really unique with compute and memory, and we have clearly seen how much of the memory is stranded. So, just for the sake of it, and based on some of the information from the paper from Microsoft, let's assume 25 to 40 percent is what is the amount of memory that is stranded. Then, we expect the first level of disaggregation to occur, in which the overall DRAM cost will come down slightly with memory expansion. However, you can see that the efficiency of near memory is actually growing, which is really good. Now, over time, we do expect that with appropriate block sizes, migration rates, migration algorithms, and sharing costs, the cost savings will definitely increase to around 30 percent according to our internal estimates. And as you can see, near memory is really, really well utilized, and far memory is utilized based on the needs of that particular workload.

The other question that we get asked very frequently is about the timeline. So, in this case, we do have a timeline view that we share with some of our customers. This year, earlier this year, we have seen that Samsung has announced its first CXL device in the first half of this year on the host side. We expect some of the hosts providing initial support for CXL 1.1 and the CXL 2.0. We should also see ARM-based hosts with CXL support appeared in late half of this year and next year with Nvidia and Ampere solutions. In the  2024 timeframe, we do expect second-generation memory expansion devices to be able to appear in the market and first-generation memory pooling devices. And then in the 2026 timeframe is where we do expect second-generation memory pooling controller devices to be available. This is how we really see the market on from an enablement point of view.

Then, the other question we also get asked quite frequently actually is, OK, how big of a market this is really. We understand the problem. We understand how we can provide solutions to the problem. But how big is it? So, here are some of the estimates from our side on CXL memory expansion and pooling from a pure time point of view. And the way we approach this is we started with the server DRAM time, and in the 2025 timeframe from some of the estimates we have, it's roughly 15000 million gigabytes. Then, we estimated that the server DRAM time is going to grow at a 20 percent CAGR. And then we use an attach rate estimate, which is basically the chart of the top right. So, in the 2028 timeframe, we expect to reach 10 percent of the bits catered to CXL DRAM. So that translates roughly to 2000 million gigabytes. So, using these metrics, then we calculated based on the base product capacities to come up with the dollar time and unit. So, based on all of these estimates, we do think that easily we can achieve a four billion dollar time in the 2027, 2028 timeframe, which is actually lower estimates from our view. We do expect that it could go really, really higher. Then we also have shown a split of the products with respect to memory expansion and pooling. And as I mentioned from the timeline before, memory expansion devices will be initially rolled out and eventually switching to pooling. So, this is really how we see the market evolve over time.

Before I summarize, I also wanted to highlight why ARM is a great partner for CXL. First and foremost, the technology is in development cycle and ARM has very good know-how on coherent device technology. Due to your efforts on CCIX or C6. As many of you know, AMBA, CHI are the most popular bus architectures that are used across the board and many associates today. ARM has relationships with many cloud service providers on the host side and ARM has clearly demonstrated a very huge ecosystem leadership. On the ecosystem side, we have already proven that ARM is a major player in building ecosystems around the world. On the IoT side, we are the ones who have several million developers. We have initiatives such as ARM System Ready, Project Cassini, Centauri, which are all really about building ecosystem engagements. At the same time, we also have shown on how we can build ecosystems in many other newer segments, even like such as automotive, where there was an announcement called SOFI, which was done in September 2021. And also on the data center side, we have ARM System Ready for servers. So, ARM is really in a very, very good position to enable this ecosystem. And that's the reason we are really looking forward to working with everyone on enabling this CXL-based ecosystem. With that, thank you so much. I thank you for your time, and I really look forward to working with you all in the future. Bye.