285

YouTube:https://www.youtube.com/watch?v=KC548VU4Hd0
Text:
Welcome to my session where I'll discuss using our MemVerge memory machine software, which is specifically designed to use CXL memory capacity expansion, pooling and sharing hardware topologies to improve application performance.I'll first walk you through the software and user interfaces.Then we'll dive into real world use cases, showing actual results and the benefits of CXL in our software.But firstly, my name is Steve Scargall and I'm the director of product management for CXL and AI products at MemVerge.

So our memory machine software is 100% user space solution that enables server memory expansion solutions for unmodified applications using our latency and bandwidth quality of service policies.For fabric attached CXL appliances, applications can use our Gizmo object store APIs to utilize the capabilities of sharing memory across memory compute nodes.I'll talk more about each of these in a moment and show you how we use this technology to improve application performance.

Our memory machine has two memory placement policies, focusing on latency and bandwidth.Firstly, the latency tiering policy intelligently identifies and manages data placement and movement based on the page temperature.Hot pages are kept in the fastest tier and cooler pages are moved to the slower tiers.Pages will migrate between DRAM and CXL over time.Depending on application workload behavior and access patterns, ensuring optimal application performance.Our bandwidth tiering policy strategically places data between the memory devices in the system based on the physical bandwidth characteristics.For example, if a system has a NUMA node using eight DRAM modules and four CXL devices, we can specify to the policy that the desired ratio or interleaving strategy using the number of devices available to the system.This benefits applications that desire a higher bandwidth per core than DRAM alone can provide.

We're currently shipping memory machine 1.5, and if you're not familiar with our product, I'll walk you through the UI and explain the many features.But to quickly summarize MMX as we call it, it includes the support for multiple compute nodes through a single user interface that you're seeing here.So a small agent runs on each node and allows us to manage and collect telemetry, and then displays it through this management interface I'm about to showcase.So we also show CPU, DRAM, and CXL.We also have CXL in GP utilization and capacity consumption.We provide an inspection feature that allows you to watch your running applications to see exactly how much memory they're using, along with how much memory is used for the hot working set size.You can enable the memory quality of service features that include the latency and bandwidth policies I just discussed.And finally, if you have a CXL fabric, we can show its topology.So when you log into the memory machine UI, we display the dashboard showing all of the registered compute nodes, switches, CXL devices, GPUs, etc. We show high-level aggregated DRAM and CXL utilization telemetry for all of the systems that are registered, and this is akin to kind of looking at your data center at a 5,000 foot level. 

So our CXL fabric view shows how the nodes, switches, memory, JBOMs, and appliances are all connected to each other. In this example, we're showing two fabrics, each with a switch in the center and various nodes and appliances connected.

Now if we zoom in a little bit further, looking at a single compute node dashboard shows the key system level information, topology, and telemetry. Each component in the topology view is interactive, so I can hover over and click each item to show even more detailed information, and the telemetry information in the lower half shows live metrics for CPU, DRAM, CXL, and GPU.

So to gain further detailed insights into the hot working set size of any running application or process on your system, our insights feature allows you to monitor the chosen processes for a period of time and will report the live and historical memory usage. This is quite beneficial in a DRAM-only environment to understand what your application is doing and exactly how much memory it uses and how much of that memory is truly hot. So this then allows us to architect a CXL solution.

Now on systems with DRAM and CXL, enabling our bandwidth or latency memory quality of service feature allows us to control memory placement for the selected processes. So in this example, we're showing a quadrant vector database using the bandwidth policy, and for each selected process, we track and report the CPU utilization, the memory usage, and which NUMA node or nodes the process is using. And this is very beneficial for not only seeing where the memory access is coming from, but it also allows us to identify if there are any bottlenecks in the system.

Let's now look at how memory machine is being used with the various CXL topologies of local server memory expansion, memory pooling, and memory sharing. In the first use case of memory expansion, we assume one or more CXL devices are physically installed inside of the server.In the memory pooling use case, memory is provisioned from an external memory appliance, similar to how storage is provisioned from a NAS, DAS, SAN, or other storage appliance. And then finally, memory sharing allows memory from an external appliance to be shared between two or more compute nodes. And then software is responsible for managing coherency between the instances to avoid data corruption. In all three cases, memory machine can be used. So for example, in the memory expansion and pooling, our quality of service, latency, and bandwidth policies can manage that data placement among the CXL devices, as we previously discussed.This is great for  existing unmodified applications that need additional capacity and bandwidth that CXL delivers. Now for memory sharing, we have a global IO-free shared memory object store that we call Gizmo. And Gizmo is a common object store API interface that allows applications and manages the data and coherency behind the scenes.Now applications do need to be modified to use the Gizmo APIs, however.

To provide a concrete example of a real server using CXL and memory machine, this is the new MSI S2301 server.You can see from the system topology that it has two AMD Genoa CPUs and eight Samsung E3.S CXL memory expander devices. And these are installed in the front of the server, just like NVMe drives.So with one terabyte of CXL and 768 gig of DRAM, this system has 1.7 terabytes of usable memory.Now the system itself supports up to 8 terabytes of memory. So there will be 6 terabytes of DRAM and 2 terabytes of CXL.

So let's start looking at real world applications and use cases.So this first one is using FlexGen FlexGen, open source inference engine, specifically focusing on the use case where the models don't fit into GPU memory.So it can intelligently tier data between GPU, memory, main system, memory, and NVMe storage if required.case is when you don't have access to high-end GPU servers or clusters, but do have a single mid-range commodity GPU, for example. What we're showing here is the comparison of running an OPT 66 billion parameter model in two different scenarios. The first scenario is using an NVIDIA A10, which has 16 gig of GPU memory, 256 gig of main system memory, and a fast NVMe drive.Now, this particular model is too large to fit into GPU memory entirely, so it has to spill out into main memory and eventually into the NVMe storage. Now, the benefit of FlexGen is that the workload completes, albeit slightly slowly. Ordinarily, with other inference engines, the job will die because the GPU returns an out-of-memory condition. The normal solution to this is to use more expensive GPUs with more memory, use multiple GPUs, or somehow shard the workload so it only works on part of the problem that does fit into the GPU.Now, the second scenario is to use CXL in our solution. Doing so allows the data to comfortably fit in the combined capacity of DRAM and CXL, so the GPU can access data directly using DMA operations. Now, you can see the results from these two scenarios. The x-axis is the time elapsed for the test, and the y-axis is the GPU utilization. The NVMe solution is the green line that clearly shows that the GPU utilization drops significantly once the data has to be copied in from NVMe. Now, contrast this to the blue line, which is the result we got from running the DRAM and the CXL solution. Now, we can clearly see that the GPU utilization is maintained above about 95% for the duration of the test when running on with CXL. So, that gives us a 77% improvement in GPU utilization relative to NVMe. Now, we're also able to get time to inside or time to result much faster. You can see here the result is a little under 300 seconds for the DRAM and CXL solution versus a little over 600 seconds for the NVMe solution. And this also has the benefit of allowing more tokens to be generated or decoded. So, we saw a 3x improvement in decoded tokens per second using DRAM and CXL versus NVMe. And of course, because we weren't using NVMe, there's no IO storage to worry about. 

So, we can see that the GPU utilization is maintained above about 95% of Yi, and so it has been bolted into slug as well Retrieval Augmented Generation or RAG is a very popular solution these days because it allows us to use a pre-trained model such as Llama2, but at around data sets of their knowledge base that it wasn't previously trained up. Now, once the data's been pre-processed embedded and ingested into a vector database, when a query enters the RAG pipeline, the nearest neighbor results to the query are extracted from the database and added input of the model so it has that relevant information. It then uses that information to generate an answer and this is what we call a naive RAG pipeline. Depending on the model and data set size there are several challenges to this solution. So during the embedded stage that's usually compute bound. During the generation or LLM inference stage that usually requires GPUs.During and because of the data set sizes and embeddings that might actually consume more memory than the system has available to it so that could eventually spill to disk. And of course if it does happen then reading and writing from the vector databases would incur significant penalties. So to improve on this we focused on the vector database since that's where the data is stored and using Llama indexes, simple composable memory solution we can have the data in multiple vector databases. The remainder of the pipeline we left unmodified and will be a focus of future work. Now the results of this show showed that adding CXL devices to the system to not only increase the capacity but also memory bandwidth achieved a 30 improvement to the number of requests per second from the databases so this efficiently increased the scalability of the databases and because we have multiple databases each with a different set of data we can scale horizontally and vertically and of course this also helped to improve the user answer quality due to rich retrieval context now the large memory capacity that we added can also be shared and pooled across multiple nodes and then this also improved the RAG performance and particularly on the GPU utilization side.

now moving on a different use case here is the Samsung CMMB memory pooling appliance and this enables flexible allocation of memory resources by supporting up to eight E3.S CXL memory modules connected in the in the appliance itself and we can connect those directly to multiple hosts so for this use case we ran the TPCC workload against the mysql database using our memory machine latency policy after normalizing the results we can see up to a 60% increase in transactions per second and a decrease of up to 40% in query latency.

Now let's move on and talk about memory sharing. In this use case, we ran a TCP-DS benchmark against the SPARK stack. Now, TCP-DS is a decision support benchmark that models several generally available applications aspects of a decision support system, including the queries and data maintenance. Now, the benchmark includes 100 queries that provide a representative evaluation of the performance general purpose decision support systems have, and TCP-DS enables emerging technologies such as big data systems to extract information and execute the benchmark. So, that's why we chose Spark for this particular example. Now, each benchmark result measures the query response time and query throughput. So, looking elsewhere on the stack, the Alluxio is an open source virtual distributed file system. Ordinarily, it would cache data locally on each node using its DRAM, so there might be data copies—multiple copies of the same data—all floating around, and the cache has to handle that per server. So, we slightly modified Alluxio when we replaced the Alluxio software with a version that we used with the Gizmo APIs, and that uses the shared memory from CXL memory appliance, and this has many benefits. So, not only does no data have to be cached locally in DRAM, so that means each node could potentially reduce the DRAM capacity unless the server bomb costs, but additionally, the application can then use the full DRAM capacity, whereas before, some of that DRAM capacity was used for the cache. Now, instead, cache data is written to Gizmo, so all of the connected nodes have access to exactly the same data and can bind addressably access it directly in place, without any additional memory copy operations, so that significantly improves again data duplication and access to the data. Now, Manverge's Gizmo product is a global disk and I/O free shared memory object store. It's based off of CXL multi-server shared memory architecture, so data accesses and collaboration in distributed environments can be simplified, and it enables real-time data sharing using memory across multiple servers. So when we map the same physical addresses to each of the two servers in this example, they're both accessing exactly the same data in exactly the same physical address range, and this has the benefit of reducing or eliminating the need for traditional network or disk I/O, moving data around the system as each application comes online or goes offline or needs to move around. So in this example, server one runs the spark primary and a worker node, and then server two just ran a worker node, and then finally, the shared CXL memory is provided by Montage using their CXL memory expansion adding card, and that's sitting behind an Xconn switch that is connected to both compute nodes. So, the observations we made from our testing definitely had some significant benefits, and this was that again, you know, Alluxio uses our Gizmo implementation, so you don't have that local node caching using expensive DRAM, we removed or reduced the disk and network I/O, which would traditionally be handled by HDFS, because all of the data is accessible over the CXL memory fabric relative to ethernet, of course. We're improving the latency of each of the query and by doing that, we're also able to have higher queries or requests per second, and of course, that all results in better and reduced time to result or insight.

Now this is identical software stack setup to the previous use case the exception here is that we're now using the Niagara appliance from SK Hynix. Now the Niagara appliance is the prototype so it's direct connect versus the switched appliance that we saw previously and this allows numerous hosts to share or pull memory capacity from Niagara.Now when comparing the first nine results from the 100 we ran in the TPC-DS benchmark suite comparing CXL memory from Niagara to the Ethernet and DRAM solution we do see Niagara is slightly slower but really not that much slower. Now given that Niagara is using PCIe Gen 3 FPGAs the results are rather encouraging and we definitely expect the performance to improve as the hardware evolves.

So thank you I appreciate your time and feel free to reach out with any questions at any time via memverge.com or our LinkedIn and we'll be happy to follow up.Thank you.