244


Hello, everyone. My name is Ahmed Medhioub. I'm part of the product management team for  Astera Labs CXL Smart Memory Controllers. Today, I'm going to be presenting  breaking through the memory wall with CXL.

I will cover, first of all,  the problem statement, that's the memory wall,  and how us here at Astera Labs are breaking through it with CXL. Then I'm going to discuss a few memory-bound use cases,  and then dive into  this new modular shared infrastructure  that is now popular for CXL.  And then discuss some current ongoing efforts in  ecosystem enablements and finish  the presentation with some calls to action.

To quickly go over the problem statement.  Looking at the diagram on the bottom left part of the slide,  we see that over the last 20 years,  there is a significant improvement in  compute performance growth by a rate of about 3x every two years.  Compared to the memory performance rate at about 1.6x  every two years. This limited scalability in memory performance  is what defines the memory wall. There's also a significant memory latency delta  between tiers of memory. The industry has been challenged with this problem for years now  when trying to chip at this using  proprietary system configurations and deployment,  for example, persistent memory. Applications and SLAs built around  memory performance face complexity  in the software stack integration  when trying to scale.

So we at Astera Labs are helping to directly address  this problem with Compute Express Link, or CXL. CXL is a high speed, high capacity for protocol  for CPU to device or CPU to memory connections  designed for high performance data center computers,  and it is built on top of the PCIe physical and electrical  interface. And it includes PCIe based block IO input and output protocol,  CXL.io, and new cache coherent protocols  for accessing system memory, CXL.cache, and device memory,  CXL.mem. So the diagram on the bottom right of the slide  is showing 12 channel architecture  with eight local DIMMs and four CXL expanded DIMMs  via two LEO memory controllers or control chips.  And our Aurora x16 add-in card running at 5600 DDR5 speed,  increasing thus the capacity of the system by 50%  and under load reducing the overall latency by 25%. Using standard DIMMs allows hyperscalers  to have a flexible supply chain and take  control of their cost structure. We also are able to seamlessly expand memory  for existing and new applications  by working closely with ecosystem partners. And a good example of that is with our hardware  interleaving feature we enable with Intel's fifth generation  Xeon processor. Under interleaving mode is a setting  where all the memory in the system  is grouped in one single NUMA node.

So some of the use cases where we  see a significant impact of CXL expansion  as a mean to break this wall fall into two primary camps. Standards SQL databases on the left  and new applications fueled by the momentum  around GenAI on the right. For the SQL databases, we are able to show  how CXL can boost time to insight significantly  in what's happening now and what has happened computations. And there are industry benchmarks  that I'll be walking through in the next slide that show that. And on the right hand side, we're  showing a vector database that is caching relevant images. So GenAI can have a hybrid approach  to provide semantic cache to supplement data models.

So on online transaction processing  and online analyst processing, looking at the OLTP SQL  database acceleration case here on the left,  we're able to improve transactions per second  by 150% and CPU utilization by about 15%. On the bottom left is the configuration we tested. In gray, there is 128 gigabytes of local DRAM only,  which is our base case. In blue, we have 128 gigabytes of local DRAM  and an extra 128 gigabytes of CXL attached memory. So this is simulating about 1,000 clients  with Persona Lab benchmark. You can see that we are measuring peak performance  and comparing the delta between configurations. On the right hand side for the OLAP case,  we used TPC-H with a slightly different configuration. The local DRAM case is 512 gigabytes. And in blue, we have 512 gigabytes of DRAM  plus 256 gigabytes of CXL attached memory  using the hardware interleaving mode I described earlier. And here, we're cutting query times by half in some cases.

So CXL has a broad range of applicability. And you can see that with some of the applications on screen. On the far left side, RecSys is an AI caching simulation  popularly used for semantic search and recommendation  engines. And what all these represent generally  is the relative performance of a computer vision system  identifying scenes with configurations  I described in the previous slide using the same hardware  interleaving scheme. What's essentially happening is a multidimensional model  with cached scenes bearing membership  and semantic categories. That could be mountains or forests or streets. So the caching service allowed the model  to recall images more efficiently  with additional memory bandwidth. Some of the other benchmarks like CFD and EDA  are commonly used in high-performance computing. And we can improve performance by up to 50%.

So what does this look like as a server infrastructure? Typical architectures used for such application  like the ones you see on the left  use about 48 DIMMs via two separate two-socket systems  to service in-memory databases. And the challenge here is that you're  buying or over-provisioning more than what you would probably  need to run your in-memory database. For example, extra CPUs, extra backplanes, drives,  and power supplies, et cetera, versus getting  one dual-socket CXL box with 8 by 16  or A1000 memory expansion add-in cards  in two DIMMs per channel configuration that  allows you to provision a total of 56 DIMMs. The value here is the ability to add more DIMMs  without having the need for a second server.

So here's a deeper, more needed look  to what the node architecture might look like. You can see on the bottom left the modular density optimized  or MDNO architecture host processor module. And what we see working with hyperscalers  is this new MXIO connector based on the SFF TA 1037 SNIA  specification or 1033 SNIA specification. It allows for high-density nodes to have  a coplanar connection with the motherboard. The value here is that the serviceability of the card  and DIMM no longer requires pulling the full unit off  the rack to a cart, but simply allowing  the module to be removed. We see these types of solutions to be used with the by 16  LEO solution in both 1 and 2 DPC configurations  to expand memory bandwidth with low latency impact.

In a similar way, but for designs  that require more distance between the CPU  and the extension board, the MXIO cables  based on the SNIA defined SFF TA 1016 spec  addresses the same usability case  with a lot more flexibility as a cable solution. The compact nature of the cable allows for extra reach and room  for removing the model outside the enclosure. Without unplugging it when a DIMM needs to be replaced,  for example. And there are other applications to this cable solution  for coplanar based and top coffer mounted mounts,  as well as tray modules like the one I will show later.

But other options are available for folks  that don't want cabling. An edge connector, for example,  like the one here, which is defined by SNIA SFF TA 1033  is a great example. It's an elegant solution suitable to designs  that have a blade architecture as it handles well  consistent plugging and unplugging stress for test  and servicing. For the system builders, this connection  offers a lot of optionality in how to design  pluggable coplanar modules. What's essential is that the silicon  on the modules that unlock memory expansion  tend to lend themselves well to all sorts of designs  regardless of the challenges.

There are definitely multiple hardware challenges  that we face when building this memory rich compute  infrastructure. For example, not all systems can accommodate double width cards  and usually the number of what we can fit in CEM slots  is limited. What's interesting is that we see the OCP community  converging towards a standard assembly  and assembly design. And this is a very interesting example  we see the OCP community converging  towards a standard system architecture that directly  addresses these challenges. The data center modular hardware system  or DCMHS such as the Yosemite OCP reference  platform on the top right side of the slide. Here the motherboard designs are based on MDNO partial width  density optimized platform modules  to allocate more space for expansion cards  like the eight modules shown on the top right. This shared infrastructure allows  for co-existent interchangeable processing, acceleration,  and then based memory slow expansion as well.

But there are definitely challenges,  a number of challenges with such designs, some of which  are signal integrity, link bifurcation and configuration  1x16 to 2x8 to 4x4, et cetera, link diagnostics and monitoring,  performance and latency. And here at Astera Labs, we provide a comprehensive  solution to alleviate the growing  pains of deploying memory expanded architecture at scale.

 So to present the full view of this comprehensive portfolio  of CXL solutions, first shown on the left of the slide, what  is currently the most common application  is the CPU direct attached CXL memory expansion case,  the CEM or SFFTA-1002 blade or SFFTA-1033.

We also provide a solution for short reach CXL attached  memory, usually used with cabled solutions like MXIO, SFFTA-1013.  Depending on where the expansion module is located with respect  to the motherboard and the channel loss of that connection. This is essential to unlock backplane and JBOM designs.

And earlier this year, we released our new retimer smart  cable modules that unlock active copper PCIe and CXL connections  of up to 7 meters. This has the potential to enable new architecture  like shared and pooled memory for node-to-node,  as well as rack-to-rack connectivity.

 And it would look something like this.

So a reasonable question to be asked here  is, how is latency and performance  impacted with adding retimers in the path  for short and long reach? And based on the testing we did internally, shown on the right,  we see very minimal impact, about less than 10%  of the overall latency, even with two retimers in the path. The configuration used on the bottom  describes a common configuration used  for our testing with an Intel host,  with fifth generation Xeon processor,  at DDR5 5600 speeds on the CXL attached memory cards running  memory latency checker. Which is a benchmark  to measure the latency for the three configurations  we described.

So it really takes a whole village or ecosystem  to make CXL deployment at scale a reality. We have been, from the very beginning,  and still continue to work very closely with memory vendors,  CPU vendors, OS vendors, and other partners in the industry  to make sure that CXL memory is discoverable, manageable,  and properly tuned and configured for best  performance across all deployments stacks. DIMM interoperability, stability, and performance  has always been a central focus for our approach. And we have multiple bulletins and reports  from our Cloud-scale interop with a wide range of DIMMs  capacity and speeds that we'll go over in a bit. And we will continue to work with hardware and software  partners to enable more and more features.

So as I mentioned, our Cloud-scale interop  is an initiative where we run our devices  through a variety of tests, such as PCIe electrical protocol,  configuration and compliance tests,  as well as system and memory tests, and DDR traffic,  and stress tests, and additional security and RAS cases. We cover a wide matrix of devices  from ecosystem partners and customers,  from hosts and memory. And we make these reports available to our customer  portal.
 
So the call to action here is visit our website  to learn more about LEO, our CXL smart memory controller. As I mentioned in the previous slide,  take some time to explore our Cloud-scale interop lab. Also, OCP has plenty of information  on the DCMHS and other reference designs we talked about today. And more. And get in touch with me directly  for any more questions you may have on our CXL products. That was my presentation. Thank you very much.