180

YouTube:https://www.youtube.com/watch?v=eQoB4A49Wnc
Text:
Okay, good morning everyone. My name is Charles Fan and I'm co-founder and CEO of Manverge. It's great to kick off this Memory Fabric Forum  and with a strong focus on CXL. And I'm going to talk about Big Memory Computing for AI.
 
As all of us have been witnessing over the last year and a half,  AI is seeing a tremendous growth thanks to the large language model  and generative AI advances. As we see in the last 10 years,  it's gone through a couple of different waves of breakthroughs  and there are new models coming out with increasing size of parameters  that's demanding more and more memory. As you see on this curve, the blue curve are the different models  and the y-axis is the number of parameters in exponentials, logarithmic scale. And as the number of parameters increases,  the amount of memory it requires increases. The red line is the computation, GPUs, that's growing in computation  as well as growing in memory. Its growth in computation is much faster than the growth of high bandwidth memory  that's on those GPUs. So as it gets over the dotted line,  you'll see the memory is becoming a limiting factor,  both the memory bandwidth as well as the memory capacities,  and it demands a number of new memory technologies to be developed.
 
We believe that CXL will be one of these memory technologies  that can really help in this area. And it will be part of the general transformation of computing architecture  from the x86 era to the AI era. If we look at x86 era data centers, the three pillars, compute, connectivity,  and data are performed by the x86 processors, talking to DDR, DRAM,  plugging into the DIMM slots, interconnected by Ethernet,  primarily running TCP/IP networking on top of it,  where most of the data being stored on storage  that's connected to that TCP/IP network,  whether it's object storage, file storage, et cetera. So the combination of the three forms the data center  that we have known over the last 20 years.
 
But with AI era, while those three components continue to exist  and will continue to exist for a long time,  there emerged a new center of gravity where the AI workload takes place. Starting from compute, where GPUs and other AI processors have taken over  to perform most of the computation for AI training and for AI inference. And the memory are primarily supplied by the high bandwidth memory  that's directly attached to those GPUs and processors.
 
At the same time, there is a new AI fabric emerging. Now for the NVIDIA ecosystem, it's NVLink. That's a point-to-point link that's interconnecting the different processors,  exhibiting very high bandwidth. There is emerging open system standards, such as Ultra Ethernet and CXL. And I think a number of leaders in the industry are driving the advancement of CXL  and the underlying PCIe protocols to increase the bandwidth  and to provide an open system way for these new processors  to communicating with each other and communicating with the memory. And exactly how AI fabric is going to turn out is a very interesting topic  that's going to unfold in the next five years. But what we know is some of these technologies will be dominating  and creating another fabric in addition to the Ethernet that we know today.
 
And then there's a third leg. How is data going to be handled? Certainly today, storage will continue to exist, but it is our belief  that there will be a new data layer that's enabled by a memory-centric system  that's attached directly to this AI fabric. And this layer of technology is going to deliver a higher bandwidth,  lower latency to all the AI processors, and speaking the memory semantic protocols. And this completes this new center, we believe will be the new AI computer  that most of the AI workload will be taking place. So as we introduced, CXL is one of the leading technologies to driving the AI fabric. And it is from MemVerge that we are working on software,  the big memory software that is to enable the memory-centric system  that's attached to the fabric starting with CXL.
 
So now let's dive into CXL, the protocols. I'm sure the audience are already pretty familiar with this now,  where the version 1 was published in 2019 with the product starting to appear last year. And we're going to see the production-ready memory expansion devices  hitting the market this year in 2024. And version 2 defined the pooling methods that allow the disaggregation  between memory and compute that enables elastic memory on demand. And we are going to see the first POCs of memory appliances  that support version 2 protocol this year as well. And version 3 defines cache-coherent memory sharing  as well as cascading of CXL switches. And that will appear in the next two to three years. So we're going to see a slew of exciting products coming to market  starting this year over the next three years.
 
And what MemVerge is working on is the software  to go along with the CXL hardware that are coming out,  the expander card as well as the CXL switch and CXL-attached memory systems. And what we are introducing is a software called Memory Machine X  that contains two components. The first component is a memory tiering technology  that enables optimal performance for server memory expansion use cases. The second component is a memory sharing technology  that can work on top of these fabric-attached memory systems  that allow multiple servers to share memory on top of CXL 2.0  where our software are performing the necessary cache coherence  as well as enabling the application synchronization  for a good subset of workloads. So now let me dive into each of these two components  to see how they work.

First, server memory expansion.

And this is a picture of a typical server  where you have the DDR DIMMs that's plugged into the server. And now the CXL memory expander is coming to two forms. It can come in the form of AICs. This is coming from vendors like Astera Labs, like Smart Modular,  like Montage and others,  where you can just plug in the regular DDR DIMMs,  either DDR4 or DDR5, into these add-in cards  and plug these into the PCIe slots or CXL slots inside the servers. And the second form are the E3.S form factor,  where the add-in memory appears just similar to SSD,  where you can put into the front slots for the servers that support CXL. And each of these modules have their own benefits  where E3.S is more compact and easier to manage,  but at the same time, AICs, that you can use existing DIMMs,  even the ones you recycle from the older servers  and reuse the memory for memory expansion purposes. And these products are becoming available  in larger mass production quantities this year. And you will see memory leaders like Samsung, in this case,  and SK Hynixand Micron all introducing products in the E3.S form factor.

So how do they perform compared to the DDR memory? Here are some measurements we have done in our lab  with a single CXL running on PCIe Gen5, eight lanes. We are seeing a bandwidth around 50 gigabytes per second. That's the X-axis, while exhibiting latency at 50 gigabytes,  somewhere around 250 to 300 nanoseconds. And this is end-to-end latency. And the bandwidth is comparable to a single DIMM, of DDR5 DIMM. But latency, there is a difference, roughly 2.5 to 3x difference  in terms of latency. And so for the cases where you're adding the memory expanders  to your system, depending on your goal,  and there are ways to optimize, either to optimize the bandwidth  or latency of your hybrid memory system.

And that's where our tiering engine comes in. At the heart of our tiering engine is a policy engine  that support different policies that either to optimize bandwidth,  where we spread data according to the bandwidth of data  between DDR memory and CXL memory. So the overall bandwidth of the system can be fully utilized. Or it can have policies to minimize latency,  where the harder data are being placed at lower latency tier,  and the colder data placed on the higher latency tier. And these policies are configurable. And we have done a number of performance measurements  and show that it's superior to the default hardware settings  or hardware interleaving settings or the kernel tiering software.

And in fact, we do have another session in this forum  that will be dedicated to this that will happen tomorrow. And from the software, not only we have the policies,  but we also give you good visualization on the usage,  on the performance of your memory and CPU and GPU subsystems. And it supports multiple CPU platforms, including Intel and AMD.

And it supports almost all of the memory subsystems out there  from the industry leaders.

As I mentioned, the more detailed, including the performance numbers,  will be presented by Steve Scargo at 12.30 p.m. Pacific time tomorrow. And if you're interested, this software is available today. And we also are working with our hardware partners  to recommend the right configurations for your system. And you can contact my colleague for a potential POC.

All right. So that's the tiering software,  the first module of our Memory Machine X.

Now let's go into the second module, the memory sharing module  that can support a fabric attached memory  that allows memory to be shared across multiple nodes. So in this case, you can have multiple hosts connecting to a single system. And as part of this system, if you look at the picture on the left,  it would include a CXL switching controller that can support  somewhere between four to eight servers connecting to this single box. And then within this box, there are also memory media,  either AIC cards or E3.S modules being placed in there  or just memory directly on board. And there are a number of partners who might be presenting today and tomorrow  who is working on such memory appliances. And these memory appliances can be connected to multiple servers. The memory could be provisioned on demand to these servers. And more interestingly, now at least physically,  this allows these different servers to have visibility into the same region of memory. While there is still no hardware support for cache coherence,  this can potentially allow applications or middleware layer of software  to do interesting things, given these memories can be accessed  by multiple servers at the same time. On the picture on the right, this can also be configured  where the switch box and memory box are separate from each other,  where switch box are just a number of the CXL ports that on one side connecting  to the servers, on the other side connecting to the memory appliances. So that's another form factor that some of the hardware vendors  are working on as well.

Okay, so now let's dive into that memory sharing cases. So why is that case interesting? Because that's really related to one of the most fundamental problems  in computer science is what's the best way for computing processes  to send data to each other or to share data among a group of processes?

So traditionally, there are three methods. Sorry for the formatting here. It's a little overlapping. So there's a message passing. That's the fundamental method. Allow a process to send data to another process in a message. And on a single node, this can be in the form of a socket, of a queue, of a pipe. There are different methods for a process to send the data to another process. And second method is shared storage. And on a single node is typically in the form of a file where a process  could open a file, write to the file, another process or processes  could read from the file. And they can also write to the same file  and everyone can read from the file as well. And the file stays there and it could be perlisted as well. So that's another way for nodes to share information with each other  is through a common storage unit such as a file. And the third method is shared memory. And this has been well supported by operating systems. If there is within a single node or within a cache coherent domain  where different processes are running on the same node,  even talking to different processors, if as long as they are in the same  cache coherent domain, that they can access the same memory  and load store to the same memory. And that provides another high performance way,  highly performant way for different processes to share data. On a single node, all three methods are very popular  and developer can pick and choose whichever one to use  based on the specific requirements of their applications.

But when you are talking about transmitting and sharing data  across multiple nodes, not in the same cache coherent domain,  only the first two methods are possible today. The message passing, of course, is the predominant way  where you have a network fabric, perhaps Ethernet  running TCP/IP on top of it, or you could have InfiniBand  or you could have RDMA. Underneath, it's a message passing mechanism. And if you have multiple nodes that need to do collective communication,  there are libraries available such as MPI or NCCL  that allows the collective communication to happen  in the efficient manner. And smart people have done a lot of great work  to make messaging passing as easy and performant to use  as possible for the application developers. Now, for some other cases, shared storage can be put to use. And shared storage does live on network. So it's sort of sitting on top of the message passing fabric,  but it does place data into a common store  that allows processes on multiple nodes to access these data. It does have the benefit of keeping a place there  so that you don't have to occupy space on each individual node  and you can persist it as needed. But it does incur higher latency and lower bandwidth. It's a lower performant system here. But shared memory has not been possible. First of all, CXL didn't exist. It didn't allow different nodes to access the same physical memory. And also the overhead to perform the cache coherence synchronization  can often outweigh the benefit a shared memory can bring,  especially if you're dealing with a general-purpose read/write workload.

 And this potentially can be changed with CXL. CXL provides the physical hardware foundation  where multiple nodes can have access to the same memory unit. And with the right software, we believe, while it is not yet performant  for general-purpose read/write workload,  but there is a subset of workload that it will be efficient  to implement the right software layer to enable cache coherence  and making shared memory a preferred method of sharing data  across multiple nodes.

And now let's go into where this might be preferred. First of all, let's look at where is the benefit of a shared memory  over a traditional message passing. When you look at traditional message passing,  you have node A writing to the local memory  and transmit it across a network,  and then node B reads from the local memory of node B. So there are three steps for a data to go from A to B. Where if we take a simplistic picture of a shared memory system,  it just involves node A writing to the shared memory  and node B reads from the shared memory. So the second step, this networking step,  often the slowest step, is removed from this process. Therefore, you achieve the performance advantage  of using a shared memory compared to a message passing  over a classic network. You basically take the I/O out of the transmission. But there is a cost. If the two nodes are not in the same cache coherent domain,  you need to do the cache coherence and coordination  across the different nodes so that the access is done correctly,  particularly when the data is being written  or being updated or being deleted. This is where synchronization cache coherence  becomes very important. So what we discovered is that it is possible  to devise a single-writer, multi-reader system  for memory sharing on top of CXL through a middleware memory layer. And that's the layer of software that we are developing. And this is particularly high performance  if the read/write ratio is high on the data. The good news is today's world, there are more and more data  where the read and write ratio is high. About 10 years ago when I was working at VMware,  leading the big data effort, that's one observation we made. While the classical transactional data,  which still exists today, dealing with CRUD,  create, read, update, delete cycle,  the new big data, or in today AI data,  there are less updates and there are less deletes of the data. Usually when data is created, they keep being created. Certainly people read from it and people can append. And there are a lot of pipelining of these data  going through different stages of processing. So it's more CRAP of a data over the CRUD type of data. So the data keeps piling up here. And this type of processing of data  is actually quite a good fit in many cases  for a shared memory system like the one that we are proposing here. And now since it's not a general purpose shared memory system,  it's not really ready to be implemented in the OS kernel yet. So it gives room for a layer of software,  like the Memory Machine X software,  to deliver this to the application. So the application has an easy to use API  to enjoy the benefit that can be delivered by a shared memory system. Now there are some other considerations  that shared memory could be advantageous  when you have a one-to-N communication  rather than one-to-one communication. There are further efficiency can be achieved  when the data is not easily shardable across nodes. There are more and more of these data that need to be shared. And when they are being shared  rather than kept different copies in different local memories,  it also has the effect of saving on the memory cost  because of deduplication that's taking place here. So there are a number of reasons for the subset of workload  that shared memory is preferable.

At the same time,  shared memory could be preferable to shared storage as well  for obvious reasons where the performance will be higher. Now, certainly it does not persist data. So the sweet spot is for the data that,  for the intermediate data,  the transient data that need higher performance,  but that doesn't need permanent persistence. Now for the data that needs permanent persistence,  the shared memory can serve as a shared memory cache  to the shared storage as well.

So this is what we have been developing  over the last couple of years. We are going to present different APIs for different workload,  and Gizmo is one of the APIs that we have developed. It stands for Global IO Free Shared Memory Objects. It presents the object store that runs in memory,  and we present the SDK, we call Gizmo Library,  and we also have a Gizmo Manager  that's a coordinator across multiple nodes. This allows different applications to create memory objects  that can be accessible by the other nodes. They can memory map to the objects  and doing the load store operation to the memory.

 And we have integrated into multiple applications  and application frameworks. One of the first ones we integrated is Ray,  which is the AI application framework. We replaced the single node object store in Ray with Gizmo  and having them talking to a shared CXL memory. And while waiting for the actual shared memory hardware system  to be available, we ran some benchmark  on a software emulating environment with Gizmo and Ray,  and it can demonstrating some performance improvements. Accessing local object is as fast,  accessing the shared memory as local memory. Accessing an object that previously was in another node,  it becomes 675% faster. And when you're running a shuffle across a four-node cluster,  it becomes 280% faster. So we are seeing some initial indications  of the performance advantage of using a shared memory  over using just network message passing system  because it eliminates those network I/O,  it reduces the number of copy operations,  and more efficient use of memory  that can reduce the amount of spilling  that would happen on this system.

And this Memory Machine X software  will be available for POC in Q2. The API is Alpha API, meaning it can certainly continue to evolve  as we're hearing additional requirements. We do have a number of customers lining up to testing this library. And if you are interested, again, contact my co-e,  our customer representative team, and we can set up a POC. We're working with hardware partners who are working on the appliance. Some of them will be presenting today and tomorrow,  and we can coordinate and deliver the hardware  plus software system pre-integrated to you for you to test out. So that's kind of a quick introduction of the Memory Machine X software  that can optimize the performance for memory expansion,  that can enable single-rider, multi-reader memory sharing  for fabric-attached memory systems.

 And together, they're going to form the Memory Machine X product line  that serves the new CXL use cases. Just a quick introduction on MemVerge. We are a software company starting in 2017,  and Memory Machine X is our new product that we are introducing this year. We also have a product called Memory Machine Cloud  that enables the cost saving on the cloud  using our big memory technologies. And we do have customers in financial, in scientific computing,  as well as in cloud service providing markets. So thank you, everyone.