-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path339
40 lines (20 loc) · 10.1 KB
/
339
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Hi. So, I'm Siamak Tavallaei, serving at Samsung. I'm a server architect, computer architect. I've been involved with CXL and OCP for a long time. Composable memory system is a very active work stream for all of us.
So, today I'm going to talk about various ways that we are engaging to take advantage of memory, basically. Memory, you guys know about it, has density, bit rate, capacity, fault tolerance. They're all important aspects of delivering a good system.
For when we have memory, the closer it is to the compute, the easier it is for us to manage. Data is important, but when data is processed, it's more important, more insightful, more expensive. And then moving data takes energy. So, these are all the considerations that we talked about. So, people have used SRAM as caches, and when we don't have enough SRAM, the space was an issue just in a previous section. If we don't have enough SRAM, we need to back it up with DRAM, for example. And when we don't have enough DRAM, we sometimes back it up with storage. Sometimes we have to go to the network and bring new and fresh data. So, in all of that, then we can see that as we process data, it gets more valuable. And moving data takes energy—those are the main points of transforming data into insights. Okay, so when we have memory close to the processing element, the same memory can be pushed farther away. I use the term extended memory for that, but when we take advantage of memory, DRAM for example, the buffer back-end NAND flash for example, I use the term expanded memory. So hopefully, that will be clear in some of the discussions we have today. And then, as you guys see, CXL provides other utilities for us: we can increase our access to memory using pooling, and pooling really means that we have a device that has a lot of memory. It wants to subdivide itself and give each segment to a different processor or different GPU, not concurrently share. They are sequentially shared, or the concept of memory sharing, which means every region of memory can be accessed by more than one processing entity.
So again, the traditional memory hierarchy people have talked about, and I have some diagrams and cartoons to show here. We have CPUs, and they're directly connected to some memory. Nowadays, HBM, high bandwidth memory, is an interesting segment of the tier of memory. Traditionally, programs, software programs, need load-store semantics; they don't want to know where memory is, they just want it to work. Whereas other models can deal with moving data ahead of time, managing data in big blocks, and that is artificial intelligence and HPC benefit from that as well. So all in all, if we were just to talk about local memory and then bringing in fresh data from the network that sometimes is stored in some SSD local to the server, and then bulk memory abstracts storage or networks for the of being executed on by the main CPU, so the dilemma that sometimes we face is that because different VMs, virtual machines, different workloads need different memory capacity, maybe a system is provisioned maximally for all kinds of applications and programs that run on it. But then if we over provision, we are not using all of the resources properly, and that's uncomfortable. It is too expensive. So CXL has come in and tried to facilitate a solution for that. And because it is an interoperable interconnect, different suppliers can build to that, so that's perhaps the reason that CXL is taking off and becoming the de facto standard for at least memory expansion now, and hopefully soon for accelerators as well.
So again, this is a concept of thin provisioning. In other words, you could allocate less than what you normally would to a CPU, but if you have access to the pool of resources, you always know that the program will not crash the moment you need it. You will start allocating from the pool. It takes some time, it takes some software understanding how that works, but this is what the CXL consortium has been talking about. A capacity dynamic capacity device allows onlining and offlining semantics for adding and subtracting memory from a CPU. So the concept of sequence or sharing again is akin to pooling. You don't touch the same memory block; you sequence that. You touch it now, and then you give that access to somebody else. But the sharing is concurrent sharing. That's when cache coherence models are important. Now, at the diagram over here, we'll go to it a little bit deeper to reflect some of the things that Samir and Manoj talked about. You have a block of memory that's accessible by multiple CPUs in a pool in a shared or pooled area.
So, cartoons to demonstrate some of these: a processor connected to local memory and storage can nowadays use actually CXL as well. Networking can use CXL, and CXL provides an environment for all of that to work on a domain of memory. They're all coherent to each other.
The numbers—some numbers to share: local memory but could have 500 gigabytes, for example, LPDDR directly connected to the CPU. HBM, high bandwidth memory, can be locally attached. Some numbers to keep in mind, something around 200 gigabytes of HBM is reasonable to have these days. On storage, one server is comfortable with two, three, four terabytes of SSD. This one is a node, Manoj referred to a node. This is a node that runs one operating system. And then normally you need to go outside to get fresh data. According to 800 gigabits per second of I/O bandwidth is normal these days.
Now, in these two diagrams, I'm trying to illustrate that CXL comes in and expands the capacity for us and adds bandwidth. You can see that you have DDR bus connecting memory to the CPU and in the same almost hierarchical relationship, DDR. The same DDR can be connected through a CXL controller and be accessed by the same CPU. This looks very much like a dual socket system. So memory off of CXL looks like one NUMA hop away. And from a software point of view, it is in the same category, in the same rank, the same tier of operation. Load store, cache coherent, everything's done by hardware.
So the HBM, HBM itself, high bandwidth memory is a very important factor these days to be connected to directly on the processor die or GPU die.
Now, with CXL, we have now the option of adding more capacity and more bandwidth by just using another port. So to the extent that the CPUs have CXL ports, CXL protocol running on PCIe lanes, CXL will be successful. Devices that will be generated in the future have a choice of being PCIe only or CXL only. But as long as processors provide CXL protocol support on every root port, you can imagine that future suppliers would choose to implement a superset CXL instead of just PCIe.
Now, so, Manoj and Samir talked about memory being directly connected to the GPU such as HBM or being connected to a CPU to support artificial intelligence and ML workloads and HPC. And then on the back end of it, CPUs can have direct access to a pool of memory and therefore data movement is simpler, takes less time, less energy. And then the flexibility of choosing the size of the footprint that each memory has is available through this method.
Now, if we put them all together, we are OCP. We have a number of suppliers who are building chassis and system enclosures. We can put them all in an enclosure. These interconnects can be handled using a copper backbone. So we reduce the complexity of cables, for example, all within the chassis. And each memory module can be replaced, in this case, a multi-ported memory module can be replaced with a switch. And that switch can be connected to storage or memory devices and serve internal or external nodes.
So, a concept that I'd like to share with you is very similar to this hotel model. Independent people live in hotel rooms. They have local resources. But to get access to facilities such as this, the restaurants or convention center, they can come downstairs and have access to shared resources that are local to that hotel. But then in this model, I'm also sharing a node controller that can be enabled by CXL. We don't live in this hotel. We came from a different hotel. But we are using some of the resources on the mezzanine level of this hotel. So, mezzanine level resources can serve internal tenants and external tenants using CXL concepts in the form of pooling or sharing.
So, we are very familiar with memory devices and tradeoffs that we have for fault tolerance and size and dimension.
Densities, nowadays, we are, memory technology is getting to 32 gigabit technology. With a 32 gigabit technology and a dual rank DRAM, we can very easily have a 128 gigabyte DIMM. These are just giving you some numbers that are useful. And DDR5 provides robust error detection and correction. On-die ECC. And then at the DIMM level, 8 DRAM devices plus 2 gives us good enough protection on that. On the bit rates, as we get closer to higher bit rates, MRDIMM is a multiplex way of multiplexing multiple ranks under the same channel. We can get to 8800 mega transfers per second, for example.
So different form factors are developed here to house all of these DRAMs, either in a traditional way using DIMMs, RDIMMs, DDIMM was done by Gen Z people. And nowadays with E3.S, 80 DRAM devices on the show floor you can go see.
E3.S packs 80 DRAM devices that's equivalent to 2 DIMM worth of capacity and that can be connected through CXL. Different form factors here for you to take a look at, AIC adding cards on PCIe or proprietary methods.
So when we put them all together, then we can enable memory tiering, putting everything in one machine, one chassis and demonstrate a disaggregation using a local disaggregated pool.
And once we have multiple of these devices, You could have buffered DRAM or NAND flash backed DRAM for memory capacity. These are all possibilities that we have these days.
So interoperability is a key point for us here. That's perhaps one of the reasons that CXL will be successful because we have over 260 companies interested in that. It is creating a standard interoperable method of expanding the bandwidth and capacity of memory. That's the memory use case is the one that's being exploited right now. So, Composable Memory System is a track that you could join us and help us make better sense of all of this. Thank you.