-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path180
55 lines (29 loc) · 21.1 KB
/
180
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
YouTube:https://www.youtube.com/watch?v=eQoB4A49Wnc
Text:
Okay, good morning everyone. My name is Charles Fan and I'm co-founder and CEO of Manverge. It's great to kick off this Memory Fabric Forum and with a strong focus on CXL. And I'm going to talk about Big Memory Computing for AI.
As all of us have been witnessing over the last year and a half, AI is seeing a tremendous growth thanks to the large language model and generative AI advances. As we see in the last 10 years, it's gone through a couple of different waves of breakthroughs and there are new models coming out with increasing size of parameters that's demanding more and more memory. As you see on this curve, the blue curve are the different models and the y-axis is the number of parameters in exponentials, logarithmic scale. And as the number of parameters increases, the amount of memory it requires increases. The red line is the computation, GPUs, that's growing in computation as well as growing in memory. Its growth in computation is much faster than the growth of high bandwidth memory that's on those GPUs. So as it gets over the dotted line, you'll see the memory is becoming a limiting factor, both the memory bandwidth as well as the memory capacities, and it demands a number of new memory technologies to be developed.
We believe that CXL will be one of these memory technologies that can really help in this area. And it will be part of the general transformation of computing architecture from the x86 era to the AI era. If we look at x86 era data centers, the three pillars, compute, connectivity, and data are performed by the x86 processors, talking to DDR, DRAM, plugging into the DIMM slots, interconnected by Ethernet, primarily running TCP/IP networking on top of it, where most of the data being stored on storage that's connected to that TCP/IP network, whether it's object storage, file storage, et cetera. So the combination of the three forms the data center that we have known over the last 20 years.
But with AI era, while those three components continue to exist and will continue to exist for a long time, there emerged a new center of gravity where the AI workload takes place. Starting from compute, where GPUs and other AI processors have taken over to perform most of the computation for AI training and for AI inference. And the memory are primarily supplied by the high bandwidth memory that's directly attached to those GPUs and processors.
At the same time, there is a new AI fabric emerging. Now for the NVIDIA ecosystem, it's NVLink. That's a point-to-point link that's interconnecting the different processors, exhibiting very high bandwidth. There is emerging open system standards, such as Ultra Ethernet and CXL. And I think a number of leaders in the industry are driving the advancement of CXL and the underlying PCIe protocols to increase the bandwidth and to provide an open system way for these new processors to communicating with each other and communicating with the memory. And exactly how AI fabric is going to turn out is a very interesting topic that's going to unfold in the next five years. But what we know is some of these technologies will be dominating and creating another fabric in addition to the Ethernet that we know today.
And then there's a third leg. How is data going to be handled? Certainly today, storage will continue to exist, but it is our belief that there will be a new data layer that's enabled by a memory-centric system that's attached directly to this AI fabric. And this layer of technology is going to deliver a higher bandwidth, lower latency to all the AI processors, and speaking the memory semantic protocols. And this completes this new center, we believe will be the new AI computer that most of the AI workload will be taking place. So as we introduced, CXL is one of the leading technologies to driving the AI fabric. And it is from MemVerge that we are working on software, the big memory software that is to enable the memory-centric system that's attached to the fabric starting with CXL.
So now let's dive into CXL, the protocols. I'm sure the audience are already pretty familiar with this now, where the version 1 was published in 2019 with the product starting to appear last year. And we're going to see the production-ready memory expansion devices hitting the market this year in 2024. And version 2 defined the pooling methods that allow the disaggregation between memory and compute that enables elastic memory on demand. And we are going to see the first POCs of memory appliances that support version 2 protocol this year as well. And version 3 defines cache-coherent memory sharing as well as cascading of CXL switches. And that will appear in the next two to three years. So we're going to see a slew of exciting products coming to market starting this year over the next three years.
And what MemVerge is working on is the software to go along with the CXL hardware that are coming out, the expander card as well as the CXL switch and CXL-attached memory systems. And what we are introducing is a software called Memory Machine X that contains two components. The first component is a memory tiering technology that enables optimal performance for server memory expansion use cases. The second component is a memory sharing technology that can work on top of these fabric-attached memory systems that allow multiple servers to share memory on top of CXL 2.0 where our software are performing the necessary cache coherence as well as enabling the application synchronization for a good subset of workloads. So now let me dive into each of these two components to see how they work.
First, server memory expansion.
And this is a picture of a typical server where you have the DDR DIMMs that's plugged into the server. And now the CXL memory expander is coming to two forms. It can come in the form of AICs. This is coming from vendors like Astera Labs, like Smart Modular, like Montage and others, where you can just plug in the regular DDR DIMMs, either DDR4 or DDR5, into these add-in cards and plug these into the PCIe slots or CXL slots inside the servers. And the second form are the E3.S form factor, where the add-in memory appears just similar to SSD, where you can put into the front slots for the servers that support CXL. And each of these modules have their own benefits where E3.S is more compact and easier to manage, but at the same time, AICs, that you can use existing DIMMs, even the ones you recycle from the older servers and reuse the memory for memory expansion purposes. And these products are becoming available in larger mass production quantities this year. And you will see memory leaders like Samsung, in this case, and SK Hynixand Micron all introducing products in the E3.S form factor.
So how do they perform compared to the DDR memory? Here are some measurements we have done in our lab with a single CXL running on PCIe Gen5, eight lanes. We are seeing a bandwidth around 50 gigabytes per second. That's the X-axis, while exhibiting latency at 50 gigabytes, somewhere around 250 to 300 nanoseconds. And this is end-to-end latency. And the bandwidth is comparable to a single DIMM, of DDR5 DIMM. But latency, there is a difference, roughly 2.5 to 3x difference in terms of latency. And so for the cases where you're adding the memory expanders to your system, depending on your goal, and there are ways to optimize, either to optimize the bandwidth or latency of your hybrid memory system.
And that's where our tiering engine comes in. At the heart of our tiering engine is a policy engine that support different policies that either to optimize bandwidth, where we spread data according to the bandwidth of data between DDR memory and CXL memory. So the overall bandwidth of the system can be fully utilized. Or it can have policies to minimize latency, where the harder data are being placed at lower latency tier, and the colder data placed on the higher latency tier. And these policies are configurable. And we have done a number of performance measurements and show that it's superior to the default hardware settings or hardware interleaving settings or the kernel tiering software.
And in fact, we do have another session in this forum that will be dedicated to this that will happen tomorrow. And from the software, not only we have the policies, but we also give you good visualization on the usage, on the performance of your memory and CPU and GPU subsystems. And it supports multiple CPU platforms, including Intel and AMD.
And it supports almost all of the memory subsystems out there from the industry leaders.
As I mentioned, the more detailed, including the performance numbers, will be presented by Steve Scargo at 12.30 p.m. Pacific time tomorrow. And if you're interested, this software is available today. And we also are working with our hardware partners to recommend the right configurations for your system. And you can contact my colleague for a potential POC.
All right. So that's the tiering software, the first module of our Memory Machine X.
Now let's go into the second module, the memory sharing module that can support a fabric attached memory that allows memory to be shared across multiple nodes. So in this case, you can have multiple hosts connecting to a single system. And as part of this system, if you look at the picture on the left, it would include a CXL switching controller that can support somewhere between four to eight servers connecting to this single box. And then within this box, there are also memory media, either AIC cards or E3.S modules being placed in there or just memory directly on board. And there are a number of partners who might be presenting today and tomorrow who is working on such memory appliances. And these memory appliances can be connected to multiple servers. The memory could be provisioned on demand to these servers. And more interestingly, now at least physically, this allows these different servers to have visibility into the same region of memory. While there is still no hardware support for cache coherence, this can potentially allow applications or middleware layer of software to do interesting things, given these memories can be accessed by multiple servers at the same time. On the picture on the right, this can also be configured where the switch box and memory box are separate from each other, where switch box are just a number of the CXL ports that on one side connecting to the servers, on the other side connecting to the memory appliances. So that's another form factor that some of the hardware vendors are working on as well.
Okay, so now let's dive into that memory sharing cases. So why is that case interesting? Because that's really related to one of the most fundamental problems in computer science is what's the best way for computing processes to send data to each other or to share data among a group of processes?
So traditionally, there are three methods. Sorry for the formatting here. It's a little overlapping. So there's a message passing. That's the fundamental method. Allow a process to send data to another process in a message. And on a single node, this can be in the form of a socket, of a queue, of a pipe. There are different methods for a process to send the data to another process. And second method is shared storage. And on a single node is typically in the form of a file where a process could open a file, write to the file, another process or processes could read from the file. And they can also write to the same file and everyone can read from the file as well. And the file stays there and it could be perlisted as well. So that's another way for nodes to share information with each other is through a common storage unit such as a file. And the third method is shared memory. And this has been well supported by operating systems. If there is within a single node or within a cache coherent domain where different processes are running on the same node, even talking to different processors, if as long as they are in the same cache coherent domain, that they can access the same memory and load store to the same memory. And that provides another high performance way, highly performant way for different processes to share data. On a single node, all three methods are very popular and developer can pick and choose whichever one to use based on the specific requirements of their applications.
But when you are talking about transmitting and sharing data across multiple nodes, not in the same cache coherent domain, only the first two methods are possible today. The message passing, of course, is the predominant way where you have a network fabric, perhaps Ethernet running TCP/IP on top of it, or you could have InfiniBand or you could have RDMA. Underneath, it's a message passing mechanism. And if you have multiple nodes that need to do collective communication, there are libraries available such as MPI or NCCL that allows the collective communication to happen in the efficient manner. And smart people have done a lot of great work to make messaging passing as easy and performant to use as possible for the application developers. Now, for some other cases, shared storage can be put to use. And shared storage does live on network. So it's sort of sitting on top of the message passing fabric, but it does place data into a common store that allows processes on multiple nodes to access these data. It does have the benefit of keeping a place there so that you don't have to occupy space on each individual node and you can persist it as needed. But it does incur higher latency and lower bandwidth. It's a lower performant system here. But shared memory has not been possible. First of all, CXL didn't exist. It didn't allow different nodes to access the same physical memory. And also the overhead to perform the cache coherence synchronization can often outweigh the benefit a shared memory can bring, especially if you're dealing with a general-purpose read/write workload.
And this potentially can be changed with CXL. CXL provides the physical hardware foundation where multiple nodes can have access to the same memory unit. And with the right software, we believe, while it is not yet performant for general-purpose read/write workload, but there is a subset of workload that it will be efficient to implement the right software layer to enable cache coherence and making shared memory a preferred method of sharing data across multiple nodes.
And now let's go into where this might be preferred. First of all, let's look at where is the benefit of a shared memory over a traditional message passing. When you look at traditional message passing, you have node A writing to the local memory and transmit it across a network, and then node B reads from the local memory of node B. So there are three steps for a data to go from A to B. Where if we take a simplistic picture of a shared memory system, it just involves node A writing to the shared memory and node B reads from the shared memory. So the second step, this networking step, often the slowest step, is removed from this process. Therefore, you achieve the performance advantage of using a shared memory compared to a message passing over a classic network. You basically take the I/O out of the transmission. But there is a cost. If the two nodes are not in the same cache coherent domain, you need to do the cache coherence and coordination across the different nodes so that the access is done correctly, particularly when the data is being written or being updated or being deleted. This is where synchronization cache coherence becomes very important. So what we discovered is that it is possible to devise a single-writer, multi-reader system for memory sharing on top of CXL through a middleware memory layer. And that's the layer of software that we are developing. And this is particularly high performance if the read/write ratio is high on the data. The good news is today's world, there are more and more data where the read and write ratio is high. About 10 years ago when I was working at VMware, leading the big data effort, that's one observation we made. While the classical transactional data, which still exists today, dealing with CRUD, create, read, update, delete cycle, the new big data, or in today AI data, there are less updates and there are less deletes of the data. Usually when data is created, they keep being created. Certainly people read from it and people can append. And there are a lot of pipelining of these data going through different stages of processing. So it's more CRAP of a data over the CRUD type of data. So the data keeps piling up here. And this type of processing of data is actually quite a good fit in many cases for a shared memory system like the one that we are proposing here. And now since it's not a general purpose shared memory system, it's not really ready to be implemented in the OS kernel yet. So it gives room for a layer of software, like the Memory Machine X software, to deliver this to the application. So the application has an easy to use API to enjoy the benefit that can be delivered by a shared memory system. Now there are some other considerations that shared memory could be advantageous when you have a one-to-N communication rather than one-to-one communication. There are further efficiency can be achieved when the data is not easily shardable across nodes. There are more and more of these data that need to be shared. And when they are being shared rather than kept different copies in different local memories, it also has the effect of saving on the memory cost because of deduplication that's taking place here. So there are a number of reasons for the subset of workload that shared memory is preferable.
At the same time, shared memory could be preferable to shared storage as well for obvious reasons where the performance will be higher. Now, certainly it does not persist data. So the sweet spot is for the data that, for the intermediate data, the transient data that need higher performance, but that doesn't need permanent persistence. Now for the data that needs permanent persistence, the shared memory can serve as a shared memory cache to the shared storage as well.
So this is what we have been developing over the last couple of years. We are going to present different APIs for different workload, and Gizmo is one of the APIs that we have developed. It stands for Global IO Free Shared Memory Objects. It presents the object store that runs in memory, and we present the SDK, we call Gizmo Library, and we also have a Gizmo Manager that's a coordinator across multiple nodes. This allows different applications to create memory objects that can be accessible by the other nodes. They can memory map to the objects and doing the load store operation to the memory.
And we have integrated into multiple applications and application frameworks. One of the first ones we integrated is Ray, which is the AI application framework. We replaced the single node object store in Ray with Gizmo and having them talking to a shared CXL memory. And while waiting for the actual shared memory hardware system to be available, we ran some benchmark on a software emulating environment with Gizmo and Ray, and it can demonstrating some performance improvements. Accessing local object is as fast, accessing the shared memory as local memory. Accessing an object that previously was in another node, it becomes 675% faster. And when you're running a shuffle across a four-node cluster, it becomes 280% faster. So we are seeing some initial indications of the performance advantage of using a shared memory over using just network message passing system because it eliminates those network I/O, it reduces the number of copy operations, and more efficient use of memory that can reduce the amount of spilling that would happen on this system.
And this Memory Machine X software will be available for POC in Q2. The API is Alpha API, meaning it can certainly continue to evolve as we're hearing additional requirements. We do have a number of customers lining up to testing this library. And if you are interested, again, contact my co-e, our customer representative team, and we can set up a POC. We're working with hardware partners who are working on the appliance. Some of them will be presenting today and tomorrow, and we can coordinate and deliver the hardware plus software system pre-integrated to you for you to test out. So that's kind of a quick introduction of the Memory Machine X software that can optimize the performance for memory expansion, that can enable single-rider, multi-reader memory sharing for fabric-attached memory systems.
And together, they're going to form the Memory Machine X product line that serves the new CXL use cases. Just a quick introduction on MemVerge. We are a software company starting in 2017, and Memory Machine X is our new product that we are introducing this year. We also have a product called Memory Machine Cloud that enables the cost saving on the cloud using our big memory technologies. And we do have customers in financial, in scientific computing, as well as in cloud service providing markets. So thank you, everyone.