-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path275
51 lines (27 loc) · 22.1 KB
/
275
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
YouTube:https://youtu.be/i6sO5lrFY50
Text:
Very excited to be here! We have a couple of important pieces of news to share with you. Today, we're introducing the CXL 3.0 specification on a public site, so it's going public today. As you know, the CXL Consortium started with nine promoters earlier in 2019. In 2019, CPU suppliers, system suppliers, and hyperscalers got together, came up with important use cases that needed to be resolved. And then later, as the consortium grew, other companies joined, and now we have over 206 members joining the consortium.
We started the CXL specification at 1.0 when it was first released. Soon after that, we added a compliance chapter to that as part of CXL 1.1, and then, with only 15 members. When it grew to CXL 2.0, we had about 130 members joining the consortium, and now we are at the 3.0 level. The specification is going public today, and we have over 200 members as part of the group that's developing the specification.
So you know that the reason that we have to have this kind of effort is the fact that data is growing, requirements for computation are growing, and the number of cores in CPUs are growing when we need more and more bandwidth to memory, and memory itself needs to be efficient. And CXL provides that methodology for an infrastructure efficient enough for data centers and large clouds.
So you can imagine for computation, we need a compute element, and we need scratchpad memory, and data needs to come to the compute element, and eventually the results need to be stored, stored or sent over the fabric. So throughout the history of computing, computer science, we have created an extensive network from the processing element all the way to the public networks. CXL sits smack in the middle of that as a fabric close to the CPU and GPUs, and DPUs and IPUs between compute elements and memory elements. So CXL 2.0, as we started, the team, based on initial promoter companies and immediate contributors who came in, it was a group effort. It was a team-building effort on CXL 2.0, and we created a number of features that I'm going to talk about in a minute. But after the group formed, after the teams worked efficiently together, they tackled more and more use cases, old use cases, doing them in a more efficient way using CXL, and came up with new use cases, and because of all of that, came up with features that we all wanted to implement, and CXL 3.0 is a collection of all those good features.
CXL is happening. It is real now. It is not just PowerPoints and written, but it is all the CXL solutions. It is all the CXL application specifications. People have already built emulators, POCs based on FPGA, and silicon-based solutions are already out in development and in validation and qualification. Today, I hear that people do have demos of these silicon-based solutions at the showroom. At Supercomputing 21, a number of announcements, and more announcements will happen today.
Another piece of news I'd like to share with you guys is that CXL is gathering momentum on other fronts as well. You have heard that Gen Z IP and assets were transferred to CXL in February, and yesterday OpenCAPI Consortium also announced the letter of intent to move OpenCAPI and OMI IP and all assets to CXL Consortium as well. That makes it a wonderful environment for the two consortia to work together. There are a number of members that already belong to both consortia. Having the specification available to the team allows the team members, as they tackle different use cases, to be able to take advantage of the IP and techniques that are available either through Gen Z or OpenCAPI, OMI, or CXL, or whatever else they can invent as new.
As I stated earlier, CXL spec is out. It was approved. It went through an extensive review. It is out there, and also, it's my proud pleasure to announce that it will be on a public site. So it is a public document now, and it's available for all developers to build to.
So as you know, the industry grows. Use cases emerge. Old use cases. People take a look and would like to... Build efficiencies around them. A lot of solutions have been working very well using networks, using Ethernet, using RDMA, over InfiniBand. With CXL, you can bring them closer to memory. Memory and compute can be closer to each other. So the efficiency of moving data is a key point that we would like to promote using CXL. Not in a proprietary way, in a common and standard way. There have been... A lot of different solutions in the past that did a very good job of moving data, but they were proprietary. CXL Consortium has a standard-based specification, does that one with a number of companies involved, and common solutions emerge. So a number of features were developed to address that. I will go through those in a minute. But CXL... At a high level... Provides a method for multi-ported devices. It provides an enhancement in moving data through the fabric. The notion of fabric is introduced within CXL 3.0. And because of that, compute disaggregation, and then composition of systems are possible through this methodology. Based on fabric manager. Based on... Efficiencies in routing data through switches. Memory pooling was introduced as part of 2.0. But there are enhancements here that I will get into in a minute. And backward compatibility is very important for CXL Consortium. One of the interesting features of this is that as CXL moved from 1.0 to 2.0, now to 3.0, it's a very interesting feature. It's a very interesting feature. CXL 3.0, it has created a number of features and capabilities as an a la carte offering. Companies who are already building towards, let's say, 2.0 specification, when 3.0 spec is available, they can pick and choose features of CXL 3.0 and be 3.0 compatible. For example, CXL 3.0 uses PCIe's physical layer. For example. But CXL 3.0 is running the link at 64 giga transfers per second as a new feature. But if companies are already building to 32 giga transfers per second, they can pick and choose other features and not necessarily tackle the PAM-4 model yet. That's a choice that they can make.
So, as a summary. So I need. You, as. Out. major features using the 64-byte flit mode running at 32 giga transfer per second and defining device types based on three protocols: .io, .mem, and .cache. Type one device supports .io and .cache. Type three device supports .mem and .io. Type two device combines the two. It supports .io, .cache, and .mem. Those were the essential features of CXL 1.0. It was basically designed for point-to-point connections, but as the team formed immediately what they saw was the need for fanning out capabilities from one root port to multiple devices. For that, we introduced the CXL switch as a fan-out device and just because we could, once you fan out two devices and these devices are capable and memory devices are growing to have high capacity, it did make sense for a device to subdivide itself into multi-logical devices as an MLD and be connected to multiple hosts. That introduced the concept of memory pooling. Aside from that, because now we have a little bit more complex formation of devices and switches and processors as root complexes, the notion of security was important, so IDE as an encryption method over the link was part of the CXL 2.0 as well. Then, as the team grew, as the momentum got formed behind CXL 2.0, different use cases were introduced. Old use cases were explored for them to be more efficient, and a strong team got formed as part of multiple workgroups, software workgroup, protocol workgroup, PHY workgroup, memory subsystem workgroup. They all worked together to enable new features within CXL 2.0. 3.0. Now, after a year and a half, we are here, and the specification is complete on a number of new features. For example, on the bitrate, we doubled the bitrate without adding latency using PCIe Gen 6 data rate at 64 giga transfers per second using PAM4. That was on the data rate. No additional latency, but increased throughput. Memory pooling that CXL 2.0 offered, it was very natural. When you look at the topology, when you look at the system block diagrams, multiple devices have access to the same memory device. Memory pooling did not allow data to move from one virtual hierarchy to another. Another hierarchy, but memory sharing was a natural progression of memory pooling. With memory sharing, a segment of memory within a memory device can be allocated to more than one host, and the host using the cache coherence model of CXL can cache, actually, the shared region of memory. For that, techniques such as new memory channels, new transfer channels, back-invalidate channel, had to be introduced and invented, and that's what CXL 3.0 offers. In addition to that, on the peer-to-peer connection, when you look at the block diagram, multiple devices connected to a switch, you can imagine that it is natural to expect devices to be able to talk to each other without having to go through the CPU. For that, again, back-invalidate channel will help. Accelerators talk to their subordinate devices through the switch directly without having to choke bandwidth through the processor.
So, these, when you put all of these elements together, you have switches, you have a computer, you have the computer that can do all of this, and then you have a user that can control the devices, and then you have the user that can control the devices, and then you have the user that can control the devices, and then you have the user that can control the system. Now, with CXL 3.0, it can be cascaded, it is not just one layer of switches. You can interconnect switches to each other using fabric ports. Together, it does create a very robust fabric of interconnect available for a large ensemble of devices, accelerators, memory devices, storage elements, all on the same fabric. And system designers can choose to put them together in large ensembles and large crates and bring efficiency to the computing and composition of systems as we move forward.
So we encourage people to come and join, take advantage of all of these capabilities that your colleagues have already done. Specification is nice and robust. We keep evaluating different use cases. For each use case, different companies, different individuals come up with proposals. Uh, They work together. They arrive at a common effective method of implementing the solutions and eventually specifying them in a coherent way so people can benefit and implement solutions.
If I have time, I can show you some of these diagrams, uh, Frank, do I have time to give you an idea, uh, again, CXL 2.0. Uh, It's a very complex set of tools and it's very hard to find. Um, You know, what I'm trying to say is that CXL 2.0 offers one layer of switching and offered a fan out to multiple memory devices, uh, but it's, it offered only one, uh, uh, cat dot cache capable device. CXL 3.0 breaks that barrier. All of the devices can be type one, type two, or type three. Uh, do .mem, .cache on any of those devices.
Uh, As I suggested earlier, multi-layer switching is capable. Uh, You know, we do multi-layer switching. We do multi-layer switching. Capable within CXL 3.0. Doing it that way, you can create topologies like shown here, which allows different devices to reach from one end of the fabric to the next, creating concept of disaggregated computing, and the reverse of that, composing new systems from this disaggregated computing. All that done using a fabric manager, a piece of software that can be in-band using a virtual machine running on one of the hosts, or it could be totally out-of-band using MCTP-capable devices, managing the interconnect between these devices.
The concept of peer-to-peer I touched on earlier. Since we have a little bit more time, it's good to go over the diagram here. You see, this switch is multi-host capable. Each. Color is designating a different virtual hierarchy and different host ensemble. In this model, host 2 has access to device 1, 2, 5, and 6. All of these devices are within that virtual hierarchy, whereas device 4 has access only to host number 4. Now, within this switch interconnect, devices 1, 2, 3, and 4 can talk to each other directly through the peer-to-peer mechanisms invented by the team using capabilities that are there. The end result is that, as compared to CXL 2.0, they do not have to interconnect to each other through the CPU. So the link between the CPU and the switch, is not a choke point anymore. That is the value that CXL 3.0 is offering.
We did talk about memory pooling. Memory pooling using switches takes advantage of a device that is a multi-logical device. A multi-logical device has multiple segments of memory. Each segment can be dedicated to a host with a memory pooling concept. The pooling concept is the difference between the two. The pooling concept is the pooling versus sharing disambiguation. Sharing allows multiple hosts to concurrently access the same region of memory within the device. Pooling does not allow that. So with sharing, you see that S1, which is the shared region 1, can be cached by host 1 and host 2, as an example. And they can talk to each other. They can pass data back and forth through memory, very similar to what people expect from symmetrical multiprocessing systems, SMP systems. So devices can be developed to do pooling or support sharing. And the same devices can be put on a fabric, not only a single layer of switch, but multi-layer of switch. Once we do that, those are globally fabric-attached memory capable.
This is an example that shows memory pooling and memory sharing done at the same time using a switch that is multi-host capable. Again, to highlight, these concepts have been done in the past. The use cases that took advantage of these have been available. What CXL is doing is doing it in a standard way. Other companies have already done some of these features. Using PCIe, having a multi-host capable switch within PCIe is what's been done before. But it's been done in a non-standard, did not follow the PCIe standard, for example. It had to be done using proprietary capabilities. All these smart people got together as part of CXL, decided to do that one in a standard way. And the solution emerged as part of CXL 3.0 as a complete set, common for everybody to use.
Thank you very much. They're asking for questions. Do we have time? Yeah. Yes, sir.
So what is the purpose of integrating Gen Z and OpenCAPI into this forum or CXL? Are you going to run them in parallel or take good things from those standards and roll them into CXL 4.0 or whatever?
What? What I've seen is that a lot of good people have gotten together. We'll do what makes sense. I'll give you examples. Gen Z companies, individual companies, joined CXL before the notion of moving assets from Gen Z to CXL was thought of. The individual members joined CXL. We were working towards building a larger fabric, as it is done by CXL 3.0. OK. Thank you. At the time we were working on CXL 1.1 and 2.0, the notion, the use case emerged that, hey, OK, we do a nice interconnect between a CPU and a memory. Oh, what if we do more? And who has already done that more? Well, OK, Gen Z, that was their central focus for that. Individual companies got together. As a matter of fact, first, we did an MOU. We created a subgroup that was a, I guess, an intersection of Gen Z members and CXL member companies. They worked together to come up with definition of that use case. And that definition of use case eventually would, could be introduced into CXL or could be introduced into Gen Z. So we worked under that environment for a while. And after that, people saw that it was kind of inefficient. The best method forward was for people to just, feel comfortable using their own experiences, using the tools that are available to themselves, into one common form without having to worry about IP contamination, or sharing, or any of those blocking minds. So that model worked very well, because after Gen Z was formed, people saw that, oh, OK, this is the way I used to do it this way, now it does make sense to do it within CXL. And they didn't label it that, hey, this is a particular method that I'm doing. It came from a particular chapter of Gen Z. But it was just in their brain. They just knew how to do that. So CXL 3.0 is a collection of things that came from their past experiences and understandings. The same can be expected of OpenCAPI and OMI assets moving into CXL Consortium. So the members of CXL Consortium, which will be a union of the two consortia, will have access to all the assets that are available, and pick and choose whatever makes sense to address the best way a particular use case of interest to the community.
Thank you. I have a second question. So CXL 3.0 just got introduced. And as you know, it's probably very complex. We have seen standards with fabric integration and multi-host integration. And the solution will require multiple players to come together to offer a solution. So when you look at your crystal ball, when do you think the actual broad-based systems will be deployed using 3.0 specification? And not just in like boutique applications, but broad-based applications. Thank you.
A lot of times to address questions like this, as you said, look at the crystal ball. Sometimes it is very hard to know what will happen. But it is sometimes easier to understand what will not happen. If we're working as a community of smart people from different companies, everybody is trying to invent something new, do something useful. So what I can predict will not happen is that if we have a force fit shoehorning solutions. So that will not happen. It might happen for one month, one generation, but it will not prevail. So what I'm encouraging people to do is to identify important use cases, come up with solutions, perhaps at the beginning multiple solutions that need to be merged and receive feedback. And based on the feedback, solutions can form and gel into something that's solid so it can stay the course. That I expect. That I expect to happen, especially with putting a lot of wood behind the same arrow. All of the smart people used to do OpenCAPI and OMI and GenZ and CXL. Now they are part of the same team. Please join. Please help them. Come up with your own use cases that are important. Come up with your proposals. Work together. Receive feedback. And put it into the CXL 4.0 spec.
You mentioned that the I didn't quite understand how you're saying that we're getting transfers. And how does the new spec relate to FLIT efficiency?
So as far as CXL goal has been to, again, take advantage of the good work that's been done elsewhere. So on a physical layer, CXL is drafting what PCIe SIG is doing. So at the beginning, PCIe SIG is doing PCIe SIG. PCIe Gen 5 running at 32 giga transfer per second was established. A lot of us have already designed to PCIe 1, 2, 3, 4, and 5, and soon 6. So it does make sense to take advantage of that efficiency, the design tools, debug tools, protocol analyzers, all of the tools that have already been built around that physical layer. So taking advantage of physical layers already established was important. On top of that, the protocol layers were there to produce the results here. The same thing is happening as bit rates go up from 32 giga transfer to 64 using PAM4 and PCIe SIG. It does make sense for CXL to pick that up. Efficiencies, security, and robustness, the bit error rates, are all part of that physical layer as well. So again. If the solution is working, we'll use it. If the solution is not working to the fullest possibility and they're inventive people as part of the team, we can, in fact, deviate from that. But to the extent that it's physical layer working, we will use it.
Hi, this is Rishikesh from Samsung. Just wanted to have a higher level question. And I'm not a DRAM person. So adding all these switches, multiple levels, it will add latency. And what is that impact? Is it like 30%, 2x, 3x? And whatever the impact is, doesn't it dilute the value proposition of DRAM? People pay for DRAM because they get that lowest latency. So are there still use cases willing to pay? Is the same cost of DRAM for a diluted latency?
Very good question. So it is true. It is the fact that when signals go through a physical element, such as a switch, they will incur latency. That is the fact. Signal going across a cable takes latency also. Speed of light, time, close. So latency is there. And the question is. How much is the latency? Is the latency affordable or not? Compared to other solutions, if we were to do memory disaggregation using common solutions such as Ethernet, you can imagine the latency through that is large. And large enough that those solutions that we talked about did not emerge. So CXL says, yeah, latency is there. But now latency is much smaller. The latency target. So the latency target is 50 nanoseconds. So you're talking about the size of the signal that we have asked people to work through is 15 nanoseconds through each hop in each direction. But I've seen people beating that. They can do less than 50 nanoseconds of latency through one of these physical elements. So it does make sense if you add layers of switching for every hop so the end solution is to take a look at what is the end device that I have on the other side what value it is providing to me as far as pooling sharing and and consolidation of other things compared to what I already have I already have RDMA using InfiniBand or DMA or Ethernet methods I already have those so it will see Excel bring something new to me yes it seems that the latency is getting smaller and smaller and other capabilities for cache coherence is there also when you put them all together applications can emerge to take advantage of it one simple model to think about is if you have one memory controller one CXL memory controller connected to the CPU the latency expectation for DRAM behind that CXL controller is the same latency expectation that you have if you have two memory controllers and you have two memory controllers and you have two processors that are interconnected to each other so the memory off of the second processor one NUMA node is one NUMA hop away that is the framework to work through and you know that people have built dual socket four socket eight socket systems the applications work well in those environments for the value they gain they have accepted that latency impact so therefore it goes that CXL will be successful.