-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path244
42 lines (21 loc) · 12.3 KB
/
244
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Hello, everyone. My name is Ahmed Medhioub. I'm part of the product management team for Astera Labs CXL Smart Memory Controllers. Today, I'm going to be presenting breaking through the memory wall with CXL.
I will cover, first of all, the problem statement, that's the memory wall, and how us here at Astera Labs are breaking through it with CXL. Then I'm going to discuss a few memory-bound use cases, and then dive into this new modular shared infrastructure that is now popular for CXL. And then discuss some current ongoing efforts in ecosystem enablements and finish the presentation with some calls to action.
To quickly go over the problem statement. Looking at the diagram on the bottom left part of the slide, we see that over the last 20 years, there is a significant improvement in compute performance growth by a rate of about 3x every two years. Compared to the memory performance rate at about 1.6x every two years. This limited scalability in memory performance is what defines the memory wall. There's also a significant memory latency delta between tiers of memory. The industry has been challenged with this problem for years now when trying to chip at this using proprietary system configurations and deployment, for example, persistent memory. Applications and SLAs built around memory performance face complexity in the software stack integration when trying to scale.
So we at Astera Labs are helping to directly address this problem with Compute Express Link, or CXL. CXL is a high speed, high capacity for protocol for CPU to device or CPU to memory connections designed for high performance data center computers, and it is built on top of the PCIe physical and electrical interface. And it includes PCIe based block IO input and output protocol, CXL.io, and new cache coherent protocols for accessing system memory, CXL.cache, and device memory, CXL.mem. So the diagram on the bottom right of the slide is showing 12 channel architecture with eight local DIMMs and four CXL expanded DIMMs via two LEO memory controllers or control chips. And our Aurora x16 add-in card running at 5600 DDR5 speed, increasing thus the capacity of the system by 50% and under load reducing the overall latency by 25%. Using standard DIMMs allows hyperscalers to have a flexible supply chain and take control of their cost structure. We also are able to seamlessly expand memory for existing and new applications by working closely with ecosystem partners. And a good example of that is with our hardware interleaving feature we enable with Intel's fifth generation Xeon processor. Under interleaving mode is a setting where all the memory in the system is grouped in one single NUMA node.
So some of the use cases where we see a significant impact of CXL expansion as a mean to break this wall fall into two primary camps. Standards SQL databases on the left and new applications fueled by the momentum around GenAI on the right. For the SQL databases, we are able to show how CXL can boost time to insight significantly in what's happening now and what has happened computations. And there are industry benchmarks that I'll be walking through in the next slide that show that. And on the right hand side, we're showing a vector database that is caching relevant images. So GenAI can have a hybrid approach to provide semantic cache to supplement data models.
So on online transaction processing and online analyst processing, looking at the OLTP SQL database acceleration case here on the left, we're able to improve transactions per second by 150% and CPU utilization by about 15%. On the bottom left is the configuration we tested. In gray, there is 128 gigabytes of local DRAM only, which is our base case. In blue, we have 128 gigabytes of local DRAM and an extra 128 gigabytes of CXL attached memory. So this is simulating about 1,000 clients with Persona Lab benchmark. You can see that we are measuring peak performance and comparing the delta between configurations. On the right hand side for the OLAP case, we used TPC-H with a slightly different configuration. The local DRAM case is 512 gigabytes. And in blue, we have 512 gigabytes of DRAM plus 256 gigabytes of CXL attached memory using the hardware interleaving mode I described earlier. And here, we're cutting query times by half in some cases.
So CXL has a broad range of applicability. And you can see that with some of the applications on screen. On the far left side, RecSys is an AI caching simulation popularly used for semantic search and recommendation engines. And what all these represent generally is the relative performance of a computer vision system identifying scenes with configurations I described in the previous slide using the same hardware interleaving scheme. What's essentially happening is a multidimensional model with cached scenes bearing membership and semantic categories. That could be mountains or forests or streets. So the caching service allowed the model to recall images more efficiently with additional memory bandwidth. Some of the other benchmarks like CFD and EDA are commonly used in high-performance computing. And we can improve performance by up to 50%.
So what does this look like as a server infrastructure? Typical architectures used for such application like the ones you see on the left use about 48 DIMMs via two separate two-socket systems to service in-memory databases. And the challenge here is that you're buying or over-provisioning more than what you would probably need to run your in-memory database. For example, extra CPUs, extra backplanes, drives, and power supplies, et cetera, versus getting one dual-socket CXL box with 8 by 16 or A1000 memory expansion add-in cards in two DIMMs per channel configuration that allows you to provision a total of 56 DIMMs. The value here is the ability to add more DIMMs without having the need for a second server.
So here's a deeper, more needed look to what the node architecture might look like. You can see on the bottom left the modular density optimized or MDNO architecture host processor module. And what we see working with hyperscalers is this new MXIO connector based on the SFF TA 1037 SNIA specification or 1033 SNIA specification. It allows for high-density nodes to have a coplanar connection with the motherboard. The value here is that the serviceability of the card and DIMM no longer requires pulling the full unit off the rack to a cart, but simply allowing the module to be removed. We see these types of solutions to be used with the by 16 LEO solution in both 1 and 2 DPC configurations to expand memory bandwidth with low latency impact.
In a similar way, but for designs that require more distance between the CPU and the extension board, the MXIO cables based on the SNIA defined SFF TA 1016 spec addresses the same usability case with a lot more flexibility as a cable solution. The compact nature of the cable allows for extra reach and room for removing the model outside the enclosure. Without unplugging it when a DIMM needs to be replaced, for example. And there are other applications to this cable solution for coplanar based and top coffer mounted mounts, as well as tray modules like the one I will show later.
But other options are available for folks that don't want cabling. An edge connector, for example, like the one here, which is defined by SNIA SFF TA 1033 is a great example. It's an elegant solution suitable to designs that have a blade architecture as it handles well consistent plugging and unplugging stress for test and servicing. For the system builders, this connection offers a lot of optionality in how to design pluggable coplanar modules. What's essential is that the silicon on the modules that unlock memory expansion tend to lend themselves well to all sorts of designs regardless of the challenges.
There are definitely multiple hardware challenges that we face when building this memory rich compute infrastructure. For example, not all systems can accommodate double width cards and usually the number of what we can fit in CEM slots is limited. What's interesting is that we see the OCP community converging towards a standard assembly and assembly design. And this is a very interesting example we see the OCP community converging towards a standard system architecture that directly addresses these challenges. The data center modular hardware system or DCMHS such as the Yosemite OCP reference platform on the top right side of the slide. Here the motherboard designs are based on MDNO partial width density optimized platform modules to allocate more space for expansion cards like the eight modules shown on the top right. This shared infrastructure allows for co-existent interchangeable processing, acceleration, and then based memory slow expansion as well.
But there are definitely challenges, a number of challenges with such designs, some of which are signal integrity, link bifurcation and configuration 1x16 to 2x8 to 4x4, et cetera, link diagnostics and monitoring, performance and latency. And here at Astera Labs, we provide a comprehensive solution to alleviate the growing pains of deploying memory expanded architecture at scale.
So to present the full view of this comprehensive portfolio of CXL solutions, first shown on the left of the slide, what is currently the most common application is the CPU direct attached CXL memory expansion case, the CEM or SFFTA-1002 blade or SFFTA-1033.
We also provide a solution for short reach CXL attached memory, usually used with cabled solutions like MXIO, SFFTA-1013. Depending on where the expansion module is located with respect to the motherboard and the channel loss of that connection. This is essential to unlock backplane and JBOM designs.
And earlier this year, we released our new retimer smart cable modules that unlock active copper PCIe and CXL connections of up to 7 meters. This has the potential to enable new architecture like shared and pooled memory for node-to-node, as well as rack-to-rack connectivity.
And it would look something like this.
So a reasonable question to be asked here is, how is latency and performance impacted with adding retimers in the path for short and long reach? And based on the testing we did internally, shown on the right, we see very minimal impact, about less than 10% of the overall latency, even with two retimers in the path. The configuration used on the bottom describes a common configuration used for our testing with an Intel host, with fifth generation Xeon processor, at DDR5 5600 speeds on the CXL attached memory cards running memory latency checker. Which is a benchmark to measure the latency for the three configurations we described.
So it really takes a whole village or ecosystem to make CXL deployment at scale a reality. We have been, from the very beginning, and still continue to work very closely with memory vendors, CPU vendors, OS vendors, and other partners in the industry to make sure that CXL memory is discoverable, manageable, and properly tuned and configured for best performance across all deployments stacks. DIMM interoperability, stability, and performance has always been a central focus for our approach. And we have multiple bulletins and reports from our Cloud-scale interop with a wide range of DIMMs capacity and speeds that we'll go over in a bit. And we will continue to work with hardware and software partners to enable more and more features.
So as I mentioned, our Cloud-scale interop is an initiative where we run our devices through a variety of tests, such as PCIe electrical protocol, configuration and compliance tests, as well as system and memory tests, and DDR traffic, and stress tests, and additional security and RAS cases. We cover a wide matrix of devices from ecosystem partners and customers, from hosts and memory. And we make these reports available to our customer portal.
So the call to action here is visit our website to learn more about LEO, our CXL smart memory controller. As I mentioned in the previous slide, take some time to explore our Cloud-scale interop lab. Also, OCP has plenty of information on the DCMHS and other reference designs we talked about today. And more. And get in touch with me directly for any more questions you may have on our CXL products. That was my presentation. Thank you very much.