-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path309
120 lines (60 loc) · 26.5 KB
/
309
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
Welcome, everyone. This is Siamak Tavallaei, and I have my colleague here.
Yeah, my name is Jinin.
We're going to split the time. I will talk for about 10 minutes overall, covering what our product is and how CXL can improve it. Jinin will then talk about the overall effort we have with partners and ecosystem developers across the world.
Welcome! You know, everybody has been talking about how AI is changing the landscape for memory.
You went through the keynote speech today. By now, everybody knows that the requirements for systems—whether there are accelerators, networking devices, or memory—are growing very fast because the expectations we have of these artificial intelligence systems are growing. We are humans; our brains have developed over a billion years. We expect similar responses from AI. For that, the size of the data needed to feed all of these computation elements is growing. And because of this, smart people are developing memory models, algorithms, AI models, and high-performance computing models that are variable. They change over time. Every six months, they double the required number of bytes or data packets they need. These models are coming from different smart people, different companies, and different innovators, all developing at the same time. Based on all of that, we are providing bigger, more varied services on top of it all.
Okay. What does that all mean to memory? These computation elements are growing fast. They need to be fed by data, and data needs to come from sample observations. It needs to come through the network and, eventually, needs to land on memory before it gets processed. So, that's why we need more and more memory.
Now, as systems are developing around AI, expectations of these solutions are getting bigger and bigger. Data sizes, instead of being 4K pages, are getting into 20-megabyte high-resolution pictures or reaching hundreds of megabytes for one minute of video frames. So, basically, all in all, the requirement for data is going up. It is not just static data; data needs to come in, and it needs to be fast. Therefore, the requirement for the bandwidth to bring memory is going up as well. What we see is that capacity for high-bandwidth memory is also required—not only bandwidth, but also capacity.
Okay, so what does all that mean? What it means is that we're used to having a hierarchy of memory in the form of response time, basically latency, and capacity, and basically the ratio of all that is a bandwidth requirement. How fast can things come in? So, we basically have two tiers of requirements: capacity tier requirement and bandwidth tier requirement. Now, traditionally, we have fed all of that using solutions such as wide buses, DDR buses, and nowadays, with HBM (high bandwidth memory), we're providing a lot of bandwidth. But because capacity requirements are growing, we need other solutions. So, if we had high bandwidth devices, they are normally small, and they need to be close to the computational elements. As you know, we sometimes trade off space for time. When we want something to be done fast, we run them in parallel. Because we want to run them in parallel, we need more space. So, the interplay between time, space, and energy required to do all of that is basically the physics of why we are here and what we are talking about. So, new technologies are coming. For example, CXL has an interconnect to memory that is adding the capability on bandwidth and on capacity. That’s where we can bring in memory—more of it—closer to computational elements. Now, you have seen specifications, and you have presentations on CXL, so there is no need to discuss that as far as specifications go. All that is needed to say is that CXL has all the good attributes of being a successful technology. How? It is built on the already understood PCI infrastructure. So, that’s a very interesting question for me. That’s already a good thing for it: First, do no harm. Then, a lot of companies are joining the consortium—more than 260 companies have joined already—and it is taking work away from developers because it is consistent with the programming models that people need based on load/store. So, all of that are good attributes for a technology that can be successful.
Now, what do we do with it? At Samsung, we really believe in CXL's success, and it's more than just specifications. We are part of this CXL Consortium. More than a specification, we have products. We have gone all-in with CXL-connected memory modules. Another very important factor for a successful technology is for it to be easily plugged into current servers and systems. For that, we have taken some standard form factors; for example, E3.S is a form factor for CXL. DIMMs are already good solutions, so we know that on a memory module—a DIMM memory module—we can have 40 DRAM devices, and that's a wonderful solution already. To the extent that locally attached memory is satisfying the needs, we do have DIMMs.
But when we need more than that, CXL comes in to help with expanding the space that we have to include DRAM devices. For that, Samsung has a number of product solutions. We will go through them one at a time: CMM-D (CXL Memory Module), specifically for DRAM; CMM-H (H is for Hybrid, which includes DRAM and NAND flash), targeted for tiered memory; and CMM-H TM (Targeted for Persistent Memory or PM).
So, engineers within Samsung have done a tremendously good job of packing 80 DRAM devices on one standard module. E3s with 80 DRAM devices, using current technology of 16 gigabits per die, or 32 GB per die, can easily produce a module with 128 GB of capacity, or 256 GB of capacity. What that means is, with four of such modules in one normal server, it could have one terabyte of memory, in addition to what you already have locally attached to the CPU.
Now, another interesting technology here is to introduce not just DRAM but also NAND flash as part of the memory hierarchy. In that case, CMM-H is suitable. An FPGA device translates CXL cycles into what a flash array requires and presents NAND flash to the operating system as though it were DRAM. It moves the data from flash into DRAM, and DRAM is accessed by the CPU. In that model, with only 64 gigabytes of DRAM, this package can introduce the equivalent of one terabyte or up to two terabytes of memory footprint to the operating system. These solutions are good for large databases and scenarios that require a very large memory footprint.
Now, another model for that is if we want DRAM for its properties: being responsive, low latency, and high bandwidth. However, we would also like to have the persistence properties—in other words, non-volatile. In that model, CXL can help disassociate the requirements for DRAM and battery backup. Basically, battery backup DRAM is controlled by an FPGA device, and the interface is CXL. From the system's point of view, it looks like a load/store DRAM solution, but the battery backup provides persistence on that storage. So, it can be a storage solution. Databases can benefit from that, and we have a number of customers for it.
Now putting it all together, Samsung is also working on a larger solution. We call it CMM-B for a box, so it's a consolidated box of power, cooling, interface, and interconnect. It houses several of these devices, which could be CMM-D or CMM-H devices, to provide up to 16 or 24 terabytes of memory.
Now, the beauty of that is, when we have a system that is standalone, you can have different processors, GPUs, or accelerators plugged into it. The underlying memory or medium could be whatever makes sense. It could be DDR4, DDR5, Flash, or a hybrid of those. Therefore, that provides very good investment protection for multiple companies.
So, with all of that, I will give the floor back to Jinin. He's going to tell us about the collaborations we've been doing with our partners.
Thank you, Siamak. Hello, my name is Jinin So, head of the CXL System Architecture Group in Samsung Memory Division. It's my honor to share our collaboration status with industry partners at OCP. From now on, I will discuss success cases and future plans where Samsung collaborates with the industry ecosystem to demonstrate the value of CXL technology.
First, let me introduce the SMRC, the Samsung Memory Research Center, an OCP experience center located in Samsung Korea. SMRC is a compact data center where Samsung's customers and partners can collaborate. It offers an infrastructure that combines the latest server, network, and memory devices, providing a remote environment for the industry ecosystem to collaborate. For this collaboration, SMRC presents CXL reference architectures.
At SMRC, Samsung's current and next-generation memory products are available, enabling customers to evaluate their workload or services on the rated and industry-leading, partner-backed infrastructure. These collaborations generate critical requirements for the participating partners. The technology is designed to create a virtuous cycle in which the products can be integrated into the system and maximize customer value. Currently, many customers, including SAP, Uber, Dell, and Synopsys, are developing various uses of CXL memory at SMRC.
Software support is just as crucial as hardware reliability. To provide a reliable software environment for CXL devices, Samsung collaborated with Red Hat, a leader in enterprise solutions. In May, Samsung became the world’s first company to receive CXL product certification from Red Hat. Additionally, in October, Samsung enabled CXL devices on Red Hat's container product, OpenShift, by developing operator software. This means you can use Samsung CXL devices in Red Hat OpenShift container instances. In other words, customers can evaluate Samsung CXL devices certified on the Red Hat Enterprise OS environment to ensure high data reliability.
In cooperation with the industry ecosystem, Samsung is developing and refining the CXL reference architectures to ensure customer experience. This architecture follows the CXL specifications, with Gen 1 already developed as a direct-attached type and already in use by several customers and academic institutions. Gen 2, which is focused on CXL pooling, is anticipated to be ready by early 2025. Finally, Gen 3, which will expand into a fabric, is expected to be completed by the end of 2025.
Regarding CXL reference architecture Gen 1, we collaborated with Supermicro to co-develop a system based on the CPU that supports the CXL 1.1 spec. This platform supports the latest CXL-enabled CPU, CPU, and memory solution, offering up to two terabytes of DDR5. CXL memory is also provided either as an AIC card type on the board or with up to four CXL E3.S cards on the backplane, offering more than one terabyte of CXL memory. Additionally, various tiering and monitoring software, such as rated interweaving based on Red Hat OS and Samsung SMDK, are implemented in the software stack to effectively utilize CXL. This setup provides an optimal solution for customers to evaluate the bandwidth and capacity expansion effect of CXL memory without modifying their software code.
Here's a customer success story using the CXL efforts architecture, Gen 1. RAG is a solution that solves the hallucination problem of a larger language model. The fundamental computation of RAG is to extract similar vectors from the vector databases through the vector inner product, KNN, or ANN. The vector database is large-scale, terabyte-class, and cannot be loaded solely within CPU main memory. As a result, performance degradation occurs due to the frequent stressing of SSDs. By adopting Samsung CXL memory, the customer was able to process RAG in memory, resulting in a 2.5x faster execution time compared to the existing DDR-only server system. In addition, various tiering techniques are being researched with Samsung to improve performance in RAG applications.
Secondly, here's a performance improvement case in a relational database management system. We are conducting joint research with major DB vendors on the use cases and value of CXL memory in DB applications. DB applications use various software-based cache structures to enhance performance. Among these, the cache layer, which typically uses relatively fast SSDs, was migrated to Samsung CXL memory devices without requiring any software code changes. Using CXL memory as the cache media instead of SSDs resulted in up to 2.6 times performance improvement in the TPC-H benchmark.
CXL Reference Architecture Gen2 offers a memory pooling solution where multiple hosts can access the CXL memory pool through a switch. This approach provides customer applications with greater memory capacity and bandwidth than a single-server form factor can support. Additionally, the solution minimizes stranded memory, reducing customer TCO through efficient memory allocation and deallocation. The CXL Reference Architecture Gen2 is expected to be ready by early 2025.
To build this CXL Reference Architecture Gen2, a CXL switch is required, and we are collaborating with partners like XConn and H3P. The server CPUs utilize the Intel GNR and AMD Turin CPUs, with the server systems being developed by Inspire and K2s. The CXL Reference Architecture can expand memory up to 24 terabytes, which can be dynamically allocated to each host through the Fabric Manager. On the software side, we are developing our own Fabric Manager API and orchestration software, following standard RESTful API, and partnering with Red Hat to accelerate commercial solution development. Additionally, we are researching REST features in the memory pooling system.
I believe that in-memory databases can showcase the best performance gains in the CXL Reference Architecture Gen2. We conducted a POC with SAP using a pre-developed system and confirmed a performance gain of up to 32% in SAP HANA DB. This POC demonstrated that the performance is still scalable in custom applications when CXL memory is extended with a switch.
Reduction in customer TCO is an important collaboration focus for CXL reference architecture Gen2. This decrease in server purchases due to the employment of the CXL memory box is a critical proof point. Recent memory capacity and bandwidth bound with application performance will be enhanced by simply adopting the CXL memory pooling system. Cloud customers can increase the number of container instances without additional server system purchases. Samsung will validate this through customer collaboration using CXL reference architecture Gen2. Currently, we are conducting joint research with Ladder to enable the Ladder OS on top of CXL reference architecture Gen2.
Lastly, we have the CXL reference architecture Gen3. Currently in development, it's aiming to integrate with GPU fabric. The existing GPU is heavily burdened due to the limited memory capacity of a single device. CXL reference architecture Gen3 aims to alleviate this by supporting direct load/store operations, P2P transactions, and memory sharing.
The basic concept is to design a structure where GPU cores share a large memory pool through a switch in the GPU fabric, similar to NVLink and UALink. This architecture also allows for direct access to network storage to prefetch data needed by the GPU. Additionally, it serves as an optimal platform for realizing near-memory processing technology. This setup not only provides substantial memory capacity but also aggregated bandwidth to GPU core applications, as large data transfer energy, measured in picojoules per bit, can be minimized. PM technology can be implemented at the CXL switch level or the CXL memory control level. GPU cores can share attached large memory and access the shared memory directly through load/store operations. This is essential for providing terabyte-scale SaaS services, enabling high-quality, long-context large-language model services at a low cost.
Here's the expected use case for the CXL reference architecture Gen3. For LLM services, the model size is a concern, but the KV cache size caused by batch processing and long context is an even bigger issue. When the sequence length reaches 2048, the required KV cache size cannot be served even with the latest GB200 GPU system. The CXL reference architecture Gen3 will eliminate this memory constraint, enabling higher-quality LLM services within a reasonable SLA time.
Secondly, there is a RAG-vector database use case. Soldering a terabyte vector database on a GPU is challenging, and data movement is required for LLM serving. By loading the RAG database onto the CXL reference architecture, Gen3, and incorporating process-near-memory technology, the GPU can focus solely on LLM serving. Since the amount of data that moves from the RAG literature to the input of the LLM is very small relative to the vector database size, the burden on the GPU fabric will decrease.
Checkpoint storage is also a critical use case for the CXL reference architecture Gen3. By storing tens of terabytes of checkpoint in the CXL memory pool, it will reduce the burden on the GPU network, allowing the GPU system to focus solely on computation.
Okay, call to action. Yeah, please, just bring your own workload to Samsung SMRC. Optimize your service using Samsung's CXL devices and reference architectures. We hope to develop even better CXL devices and reference architectures as a result of this industry ecosystem collaboration. Thank you.
Any questions? We're here. We have at least four or five minutes. Are you interested in memory, CXL, and solutions that Samsung is providing? Any future technologies you'd like to see us talk about next time?
Thank you for the presentation. Matt Brummage from Arm: What use cases do you guys see in memory pooling or other memory expansion for shared memory?
For example, for now, we are researching the use of some CXL memory pooling, especially for memory sharing. So this vendor, this customer, is the DB vendor. They are using some software-based cache coherence, some software. So I think that with hardware-based memory sharing supported by CXL 3.1, their problem — the software overhead — will be eliminated. This way, they can reduce their TCO using this kind of concept. That's our target.
Of course, as you know, the answer also depends on the use case and the medium that you select. If the device is a high-capacity device, for example, a NAND flash device but on CXL fabric, and the consumers of that—smaller processors—don't need that much capacity, within the industry, we might build large modules, but it would be nice for it to be subdivided. And we have the name "pooling" for that. It's not shared. It's not concurrently shared. But different regions of the same device can be subdivided to other CPUs. That's a CXL memory pooling concept, as you know. But with CXL, we can extend that to memory sharing, as Janine suggested. In that model, we reduce the amount of time and complexity involved in moving data from one compute element to another when they're not directly connected. If they're directly connected, well, have fun. But later tomorrow, we will have diagrams showing how large these systems are. We cannot do all-to-all; we have to have switches in between, in the form of a hierarchical topology. Each switch in that model could provide a conduit to memory. In that case, the amount of time and energy required to move data will be less.
What do you think about the SDRAM as a native CXL interface? Right now, you have to use the bridge, right? The CXL controller plugs into the DRAM.
DRAM.
DDR, right?
Yeah.
So, what do you think about SDRAM as a native CXL interface without a controller, or with a separate controller?
As you know, memory cells are very simple. So, you either talk to them using a simple interface, such as high, low, high, low — bit twiddling, DDR bus, for example. Or, if they are more sophisticated and serialized, you have to have a media controller. So, OMI (for example, the OpenCAPI memory interface) was a solution. Fully Buffered DIMM was a solution that had serial buses connecting to wide buses. Now, CXL is the de facto standard version of that. So, you could think of the CXL controller as a media controller that connects to the native method that the DRAM device, or future emerging memory device, needs, and then abstracts that with a standard interface, such as the CXL protocol. Whether you call it a CXL controller or a media controller, it's just semantics.
Yeah, I understand. But what I'm trying to say is, we all save the power.
Saving power.
Yeah, because you have to convert to the DRAM. Yeah. From the DDR interface and the front DDR interface, convert to the CXL.
Of course. You're right on. So, the power consumption in that controller is non-zero. But I totally understand that all of us want to reduce power consumption, reduce energy consumption. That's a good thing we need to do. On the other hand, compared to a GPU, compared to a large accelerator, power is so many cents per kilowatt hour. So, if we can turn consumed power into value, people are OK with it. What's not OK is to provision power and not use it—provision a GPU and not use it, provision memory and not use it. So, in the way I have looked at the problem, power and cooling provisioning is for a 20-year time frame, for a 10-year time frame. You build a large system, and it's provisioned. Us, as engineers and technologists, our job is to fully use it for good. So, turn all that power into something that's meaningful. If I put a controller in there, and that allows us to have a much bigger fan-out to a larger memory array, well, that reduces the amount of time I need to run the application. Because it takes less time, I can turn the machine off when I don't need it. Therefore, I save power.
OK.Thank you.
You talked about disaggregated memory, shared memory, or memory pooling inside a rack. What about deep disaggregation, where you have maybe a rack full of GPUs and some compute, some CPUs, talking to a rack full of memory—fast memory, maybe 10 meters away? Do you see that happening? And when, possibly?
OK, yeah, that's a good question. Actually, as you know, the Kappa cable can reach some rack level. So, our first approach is targeting a one-rack distance. As you told me, yeah, there are a lot of challenges to follow. The reach is far beyond a rack. So, we are thinking about some optical solutions and other technologies. But that's a bit of a future story. We need to research which technology will offer better performance and cost efficiency. Yeah, that's a good question.
Yeah, one of the trade-offs is time, space, and complexity. So, for the value that you suggested comes complexity. CXL specification allows for that. There are other interconnect technologies that allow for that as well. Depending on whether it is a get/put semantics, bulk data movement, or load/store semantics, CXL can do a very good job. As a matter of fact, as Jinin suggested, with CXL switches that his box offered, we could put that in a rack and cable it together. It can very well be done. But I normally talk about three principles. First, do no harm. You just make sure it is backward compatible and software can run on it. And then, put things where they belong. If it makes sense for everything to be in one chassis, reduce the complexity of that and go do it. If we can benefit by adding value, cabling it to something else, and separating accelerators totally away from memory—because memory technology might change every two years, but accelerators change every six months—that is investment protection. Let's do that. But if it hurts, don't do that. In other words, if by adding all of this complexity we're reducing the reliability of the system, well, let's stop, think about it, and find another way. Maybe the last question?
Yeah, I'm seeing there's a delay from companies that are making CXL switches, especially for 3.1. I know that you're just focused on memory—memory expansion and memory pooling. But for memory copy with persistent memory, you need CXL 3.1. Do you see that the delay of IC vendors coming out with the CXL switch impacts your product rollout?
Okay. Yeah, for the CXL switch, we are working with some limited companies, as well as XConn. I believe XConn is the only company that will make some H version of the CXL switch. But the problem is that the CXL switch is based on PCIe, as you know, and many companies can make their own ASICs, of course. That's the reason why we are trying to approve the CXL technology for our customers. I know that Broadcom and Cisco have plans to make their own ASICs. At that time, I believe that the ecosystem will expand for CXL support.
Okay, so I know and understand XConn has the CXL switch. And also, they can support PCIe for Gen 6. They currently have Gen 5 and CXL 2.0. But then, you think that in the future, they're more likely to adapt to make more switches for the industry, say, from the CXL consortium. They will likely produce more switches. And if a company doesn’t have the switches available, then the company can use their own ASIC or FPGA.
As you know, we are technologists. We are working together, and we dream big—dream about what things can be done. We work together and come up with specifications, writing them down: if you were to do it, do it this way so it is interoperable. The CXL specification is an example. What comes next is realizing that one and bringing it into practice. POCs are happening—go on the show floor. You do see CXL switches. You do see CXL controllers. You plug them in, and they work. Okay, so we covered maybe desirability: people want it. There’s a marketing requirement for it. And we talked about physical implementation and the capabilities that the technology has. Then, what's left is business viability. Does it make sense? Is it too expensive? Will people buy it? Do you need it? What problem does it solve? That's part of the work that's going on. That's why we're working with a lot of different partners across the world to do POCs, measure, review the results, look at the reliability marks, check the performance marks, and assess if it makes sense or not.
Thanks. I think, I hope it's not going to be like the Optane from Intel. You have, they also have, a proof of concept, but then the market's not ready to adopt it because they spent the money on accelerators. And now, with CXL, we're seeing that it might be delayed for another year or so.
It's just hard to predict the future. But certain attributes might be important to talk about. A particular technology, if there's sole sourcing from only one company, will have certain gathering. If a particular technology has over 260 companies participating, go on the show floor, and you see a lot of different people just testing out different things. If things are vibrant, if they're solving problems, you can predict that, hey, something good will come out of it. We're banking on that.
Thank you very much.
Of course.Thank you.