-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path271
96 lines (48 loc) · 49.2 KB
/
271
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Good morning and good afternoon. Thank you for attending the CXL Consortium CXL 3.0, enabling composable systems with expanded Fabric capabilities webinar. We will have a Q&A session at the end of the presentation, so please submit your questions via the chat feature and they will be addressed. At this time, I will turn the presentation over to Danny Moore from member company Rambus.
Hey, thanks, Elsa. Hello, everybody, and welcome to our 3.0 webinar. We're all really excited to introduce the latest 3.0 specification and some of the novel device types, capabilities, and interconnects that the consortium has been busy architecting. My name is Danny Moore. I'm a senior manager in strategy and product management at Rambus, where I primarily drive product planning in our data center business unit, kind of with a focus on CXL. I've also been fortunate enough to have been involved in the consortium since its beginnings in 2019, and currently I'm an active contributing member in the CXL marketing work group and most recently been focused on the 3.0 specification release. My co-presenter is Debendra Das Sharma. He's one of the foundational architects of the CXL specification. He resides as co-chair of the CXL technical task force and is a senior fellow at Intel.
So, in the CXL consortium, we're always interested in growing our participation. There's a few different levels that you or your company can get involved in CXL. There's an adopter membership, which is free of charge. Or if you choose to take an active role in the specification development, you can sign up for contributor or promoter levels of membership. You can head over to computexpresslink.org for more information. Our membership continues to grow year over year. We're currently over 200 plus member companies. And like the 200 plus member companies, our board of directors represents a wide array of companies from industry, including CSPs, hyperscalers, the major CPU and accelerator vendors, memory suppliers, and several OEMs. To reiterate, and as I mentioned, we're always interested in strengthening our member community. So if you're currently not participating, please do head over to computexpresslink.org and find out how you or your company can contribute.
So as I mentioned previously, it was back in March of 2019 when the 1.0 specification was first released. And by September, we'd already published the 1.1 specification and officially incorporated the CXL consortium. By November of 2020, we had released the 2.0 specification. And just in August of this year, we published the 3.0 specification. I won't get into any details regarding features and capabilities that come along with that specification cadence as Debendra will be walking you through some of those historical details before jumping into the 3.0 features and capabilities. So with that said, I'll hand it off to Debendra to provide a technical overview of the CXL specification. Thank you very much.
Thank you, Danny. Greetings, everybody. Here's the agenda for today. We'll start with a recap of the industry landscape and CXL, as well as a recap of CXL and CXL 1.0 and 2.0. Then we're going to dive into the CXL 3.0 features, as well as capabilities, and conclude. There have been a lot of webinars where we have talked about CXL 1.0 and CXL 2.0 in a lot of detail. So you may want to go look into those to get more details. Today, I'll just do a very brief recap of those.
So when we look at the industry landscape today, we see some very clear mega trends emerge. Cloud computing has become ubiquitous. Networking and edge computing is exploding, increasingly using the cloud infrastructure. Artificial intelligence and analytics to process the huge amount of data that we are generating. All of these are driving a lot of innovations across the board. We see the increasing demand for heterogeneous processing as people want to deploy different types of compute for different applications, whether it is general purpose, CPUs, GPGPUs, custom ASIC, FPGA. Each of them is important and best suited to solve some class of problems. Now, in addition to the demand for heterogeneous computing, we also see the need for increased memory capacity and bandwidth in our platforms. And that's natural because as we process more, we need to feed the beast, so to speak. We also are having a class of memory between DRAM and SSD that needs to be thought of as a separate memory tier, offering compelling value proposition due to its capacity and persistent semantics. So these mega trends that we are discussing here, they take advantage of these types of memory in addition to the evolution in the traditional DRAM memory and storage.
Compute Express Link is defined ground up to address the challenges in this evolving landscape by making both heterogeneous computing as well as different types of memory efficient. And it's meant to sustain the needs or the demands of the computing segment for many years to come. So CXL, we're going to look into these in a lot of detail. It's built on PCIe infrastructure, adds memory and coherence semantics, and it is an open industry standard as we have talked about.
So CXL has followed a three-pronged approach for ensuring the ease of ecosystem adoption. At the fundamental level, CXL is built on top of PCI Express infrastructure and leverages PCI Express. It overlays the caching and memory protocols on top of existing PCI Express protocol. It runs on PCIe 5, runs on PCIe channels. There are three protocols, CXL.io, which as we said is identical to PCI Express. It is used for device discovery, think event reporting like errors, et cetera, I/O virtualization services, direct memory access for data move. Pretty much we didn't want to reinvent any of that wheel. What we did was we added the CXL.cache semantics for doing cache coherency and CXL.mem for doing the memory protocol. Those were the key addition that we did in order to enable the types of usages that we talked about. Now we start with the 32-gig data rate because of the high performance it offers. PCIe adoption helps lower the barrier of entry since PCI Express 5 is ubiquitous. It has got low latency characteristics and low power characteristics. This also enables us to do plug and play. A customer can choose to use a PCI slot to plug in either a PCI device or a CXL device depending on the need. Different customers will have different needs. The second aspect of CXL, so the first aspect was plug and play with the CXL PCI infrastructure and just build the coherency and memory semantics on top of that. The second aspect is the low latency aspect. It is critical because cacheing and memory protocols need low latency to not adversely affect system performance. We are expecting latencies in the range of existing symmetric cache coherency links for the cache and memory graphic. For example, the specification provides guidance on expected latency, 50 nanoseconds, pin to pin for snoops, 80 nanoseconds for D-RAM accesses, pin to pin. Now there is also a reporting mechanism so that system performance is not adversely affected if slower memory devices are used. You can report that and the system software will map things accordingly. Third and another important aspect is Compute Express Link CXL. It is an asymmetric protocol. So the protocol flows and message classes are different between the host processor and the devices. And that's a conscious decision to keep the protocol simple and the implementation easy. That's the reason why you will see the home agent is present in the host processor and that tends to be very dependent on the inherent microarchitecture. It's changing different people to very differently. We have abstracted all of those away and given some very simple, MESI protocol based coherency semantics and even for the memory side, you don't really need to understand how cache coherency works. You just need to provide the data along with the metadata.
CXL 1.0 and 1.1, it's a direct connect between a host processor and a device. And it targeted three types of usage models. The left is a type one device, which basically needs caching semantics for usages like a SmartNIC to deliver better performance. Examples would be like if you do a PGAS ordering model, where that ordering model is different than the PCI Express ordering model. Now, if you do the caching, you can, of course, complete those things within your local cache and you can enforce that ordering or enhance it. Those operations, so this can be done in addition to the traditional PCI style load store I/O. So that's for the type one device. The middle example is a type two CXL device. Typical usages include things like GPGPU, FPGAs, dense computing. These devices have some local memory attached to them, which are used for their computation. We expect them to implement all three protocols, CXL.io, CXL.cache, and CXL.mem. Caching and memory semantics would be used to populate and pass operands and results back and forth between the different computing entities with very low latency and high bandwidth efficiency. System on the right is a type three device. And the usage here would be things like memory bandwidth expansion, the memory challenges that we talked about, memory capacity expansion, and also tiered memory, including storage class memory. So these only need to implement the CXL.io and CXL.mem semantics. The memory will be mapped to the system memory as cacheable memory. The host processor orchestrates the cache coherency here, and that relieves the devices from having to implement the complex coherency flows because of the asymmetric nature of CXL. So the device doesn't even need to know anything about caching semantics. It doesn't need to implement CXL.cache.
So after CXL 1.0, we did CXL 2.0, 1.1.1, we did CXL 2.0, where we added support for pooling of resources across multiple servers, as shown in the picture here. Now these resources can be accelerators, or they can be memory. Memory pooling is done at a finer grain level and can be supported by multiple logical devices, which can support up to 16 hosts or 16 servers or virtual hierarchies, we use the terms interchangeably. And they are all assigned non-overlapping memory. So for example, if you look at the picture, D3 is assigned between two sets of servers or two sets of hosts, H3, which is color purple, and H2, which is color green. So you can do up to 16. And these resources, by the way, can move between different hosts following a standard hot plug flow. Now this capability with CXL 2.0, it enables resources like accelerators or memory to be pooled at the rack level. It's a huge innovative breakthrough, as we no longer need to have captive resources in a tightly coupled cache coherent system. Today's systems, you've got memory, you've got I/O, you've got GPGPUs, you've got whatever, right? You've got your accelerators along with your processors. Those are tightly coupled to that particular server. Let's say there's another server that needs more memory, you cannot borrow memory from a neighboring server and still use the load store semantics. With the CXL 2.0, we just enabled that. It's a huge breakthrough from that perspective. And now what we can do is that we can dynamically compose systems with additional memory or accelerators from the pool. And in the picture, even though it's only memory pooling, you can imagine you have accelerators in the picture and it can be pooled similarly, right? So for example, if a host needs an additional three accelerators for a task that it is running, it can come to the pool, get the three accelerators. Once it is done, it can release them back to the pool and somebody else can use it. So that results in significant total cost of ownership benefit along with delivering better power efficient performance.
CXL 2.0, of course, enables a single level of switching for the fan out. I mean, fan out is where you can have a single link and then you can connect multiple devices underneath a switch. So that part we enabled with CXL 2.0. And of course, the more advanced thing is the pooling notion that we talked about.
So fundamentally, if we look into CXL 2.0, we talked about resource pooling and disaggregation and that happens through the managed hotplug flows to move resources across different hosts. A type one or type two device is assigned to one host at a time, behind the switch, it's assigned to one host at a time. Now a type three device, if it is a multi-level MLD kind of device, multi-logical device, it can be assigned to up to 16 hosts at a time and you can still move them and you can move different chunks of memory across the host and it enables pooling at the rack level. And all of these have direct load store, low latency access, similar to memory that is attached to neighboring CPU, something very different than doing an RDMA over a network which has got much, much higher latencies. We defined persistence flows to support persistent memory. We also have defined how a fabric manager needs to work with APIs for the switches as well as for the devices so that you can manage this resource migration across different hosts. We enhance security by adding authentication and encryption with the CXL 2.0. Fundamentally CXL 2.0 goes from a node level to a rack level connectivity. It enables the disaggregation of the system with CXL which basically is done to optimize resource utilization and that in turn will lower total cost of ownership and give us better power efficient performance. All of this while CXL 2.0 being fully backward compatible with CXL 1.0 and CXL 1.1.
Now let us recap the industry trend that we just talked about. So from the trends point of view, we got use cases that need ever higher bandwidth and you know that's because we got a lot of high performance accelerators. You need system memory that you know memory bandwidth demand is insatiable. You've got SmartNIC, you've got leading edge networking, all of these things right. As you are processing more of course you know you need to have more bandwidth that needs to move around in the system while maintaining low latency characteristics and low power and all of those good stuff. Now we also know that CPU efficiency is declining due to the reduced memory capacity and bandwidth per core. So we need to shore that up with CXL. We need to also do efficient peer to peer resource sharing and we're going to see this in some of the examples across multiple domains. And we have memory bottlenecks due to the CPU pin and thermal constraint because the DDR buses are pin inefficient. So those are the things that are in the industry trends. Some of them we have solved with CXL 1.0 and 1.1 2.0. With CXL 3.0 what we have done is we are introducing things like fabric capabilities and we're going to see more of that things like multi-handed devices, we've got enhanced fabric management, we've got composable disaggregated infrastructure that we're going to see examples of. We have better capacity and better scalability, better resource utilization for memory pooling, larger systems, new enhanced coherency capabilities and improved software capabilities. At the same time we have managed to double the bandwidth while keeping the latency flat. All of this while we are being fully backward compatible with CXL 2.0, CXL 1.1 and CXL 1.0.
From a CXL 3.0 specification point of view this is again what are the capabilities. Of course we got better bandwidth, we got fabric capabilities, we got improved memory pooling and sharing capabilities, enhanced coherency semantics, peer to peer, all of those things and we're going to go through those in a lot more detail in the upcoming slides.
This is a nice tabular view of the different features that are available with the different revision of the specification. We started as I said with 32 giga transperce per second in CXL 1.0, we kept that through 2.0. With CXL 3.0 we are doubling the bandwidth to 64 gig. Of course you're still going to be operating at 32 gig but maximum data rate is 64 gig. The flit were 68 bytes up to 32 gig, of course that will be maintained all the way through 3.0 but with 3.0 at 64 gig we run it at 256 byte flit. Type 1, type 2, type 3 devices have been supported from 1.0 days so that continues but with 2.0 we saw the memory pooling also accelerator pooling with MLDs starting with 2.0 continues through 3.0. We define the persistence flows with 2.0 onwards. IDE is the security enhancement that I talked about so that's there from 2.0 onwards. And single level switching got introduced in 2.0, multi-level switching got introduced in 3.0. In addition to that CXL 3.0 has got other features like direct memory access for peer to peer which results in better bandwidth and of course results in a lot better bisection bandwidth in the system. We got enhanced coherency to better manage accesses which will result in better efficiency in the system as we will see. We also have defined memory sharing across multiple nodes. This is a very new concept not just pooling but sharing of memories which enables us to go for a lot of different usages. We also are enhancing the number of type 1, type 2 devices that can be supported per root port so this goes from 1 to multiple and of course have big capabilities and we are going to go through this again in some detail in the upcoming slides.
So first thing is that we did double the bandwidth with CXL 3.0 so we basically use PCI 6.05 running at 64 giga transfers per second. In order to go to 64 giga transfers per second we have to do PAM 4 signaling which results in very high bit error rate and this has been mitigated by PCI 6.0 with the use of forward error correction or FEC and 8 byte CRC. The PCI 6.0 flit layout is shown on the top picture there so you got some number of PLPs which stands for transaction layer packets and then you got DLP which is data link layer packets and then that's followed by 8 byte CRC and then a 6 byte FEC and the FEC basically the way it works is that you get a particular FLIT, you correct that FLIT using the FEC then you apply the CRC. If the FLIT passes you consume it, if it fails you basically cause link level replay. There is a PCI webinar on that that goes through these in a lot of detail and also there are few papers, one of them I have listed here if you are interested you can go and take a look at that paper it goes through a lot of the technical details behind those. Now for CXL 3.0 we took advantage of the same thing there are two types of FLIT arrangement. The picture on the middle shows us what we call a standard FLIT layout as you will see it's very similar to PCI 6.0 FLIT layout. We got a 2 byte FLIT header basically you need the header in the beginning to tell which stack it is going whether it is CXL.io FLIT, CXL.cache FLIT and CXL.mem FLIT in addition to that you are going to get all the things like sequence number, things that you need for the management of that particular sequence number, ACK, NAC, replay those kind of things. Now the 8 byte CRC you will see is in the same place and then you got the 6 byte FEC and within the rest of it you got what we call the data which can be the CXL.io or CXL.cache or CXL.mem. We also have a latency optimized version which is the picture at the bottom basically the 256 bytes get split into 228 bytes each of them is independently perfected by a 6 byte CRC and the notion here is but the FEC is across the whole 256 bytes. So the idea here is that when you get 128 bytes if you pass the CRC you just consume it without even doing an FEC. If you fail the CRC then you gather the entire 256 byte apply the FEC then do the independent CRC check on each 128 byte half if it passes it consumes you if it fails you go and ask for that replay and that results in much lower latency because you are only doing the FEC correction when there is a failure and also the CRC the accumulation is not the 256 byte it is at 128 bytes. So that's the reason why we result in zero latency adder and all of that good stuff and this these things extends to the lower data rates also because once you are in a particular type of flit mode you need to stay in that particular flit mode you cannot go back and forth between them. So we also have enabled because the flit sizes are bigger we have enabled several new CXL3 protocol enhancement with this 256 byte flit format.
Now coming to some of the protocol enhancements right there are two major ones that have got that are very simple constructs but extremely powerful and very profound they are going to have big profound impact in the land of compute for decades to come. Those are Unordered IO, UIO for short back invalidate BI for short and you know basically what happens is that there are multiple reasons why we went with this one of the reasons is listed here we basically wanted to enable non tree topologies which is so far if you look into PCI Express or CXL those are based on tree topologies and we need a tree topology because on the IO side you have what is known as producer consumer ordering semantics and those semantics gets enforced at every entity whether it is a switch whether it is an endpoint whether it is a root code everybody enforces that producer consumer semantics and that doesn't work if you do anything other than a non tree topology hierarchical tree topology. So that works fine that works for a lot of application but in for a lot of these you know like fabric kind of application where you know you want to have the resources that are disaggregated you want them to talk to each other directly you don't want to always go through the post for every communication like that's that basically causes the post to be the bottleneck. So in this example let's say I got a device D1 and that wants to access HDM memory which is residing maybe in D5 or D6 HDM stands for host managed device memory it's basically coherent memory so it wants to access that now in today's mechanism and with CXL 1.0 or 2.0 it would have to go through the post which wants the task coherency for that particular memory and then it is going to go through the access the host is going to fetch the line from the particular device let's say D5 and you know if it is a read it's going to then send it back instead we want to just bypass the whole thing and go to that device directly. So that enables first of all a lot of unnecessary traffic and also it enables parallel paths which is good from delivering better bandwidth better latency and all of those. So those are the things that enables with peer-to-peer now what happens is that now why do I need the Unordered I/O and the Back Invalidate right now if the let's say if you're trying to read something D1 tries to read from D5 it's coherent memory now if it is coherent memory and let's say D5 notices that the host has that particular line as private sorry exclusive right if it has it exclusive then it may not get the latest and the greatest copy of the data today with CXL 1.0 or 2.0 there is no way for a device to really send a snoop back to it's an asymmetric protocol. So what we did was that we basically took the CXL.mem which is the which has got the least dependency from a protocol dependency point of view and sent the Back Invalidate back to the host saying hey somebody wants to access this particular line and the host they will then it's its cache coherency mechanism will kick in and it's going to then respond to them Back Invalidate. So most of the time you really do not expect you know there to be a conflict so that you know you're going to just go to the device figure out nobody has it so you can complete your transaction. In the rare case where that is not the case you're going to invoke the Back Invalidate flow. With the UIO what happens is that you know once you are trying to mix the coherency world with the producer consumer world it runs into problem because you know you've got multiple paths right so Unordered IO is a way to break that what unordered IO does is it moves the producer consumer enforcement to the source. So this is a these two mechanisms are going to have that profound impact now with that what happens is that even for IO traffic I don't have to have a pretty high right because guess what my ordering point is at the source. D1 has to enforce that ordering rights are no longer posted rights will get a completion back so I know when the right made it so then I can basically do my producer consumer ordering that way. So these are the two constructs which have got extremely powerful and you know you can build pretty powerful systems using disaggregation and by composable system using using fabric topology with these two protocol enhancements. So you know fundamentally peer to peer with this doesn't to coherent memory doesn't need to involve the host unless there is a coherency conflict so that it removes that bottleneck. So for example if you are in NIC you can directly access the HDM memory which may have its own local processing and simply inform the host only when there is a coherency conflict otherwise you can just complete it at the at the device level at the type 3 or type 2 device level.
With Seattle 2.0 and 1.1 and 1.0 before that we had a bias flit flow and of course we I talked about why we need Back Invalidate from the peer to peer access. Now there are other reasons why we also needed that we do have the existing bias flip mechanism which is available for type 1 and type 2 it's not for type 3 and the problem with that is that that needs to be tracked fully since the device could not back scope the host. So you need to basically have a complete you know whatever memory you are hosting you need to track the entire thing either through the directory or your whatever is the size of your smoke filter dictates the size of the memory that you are mapping into the HDM space. Now with back invalidate with CXL 3.0 we can enable smoke filter implementation resulting in large memory that can be mapped to a HDM. Now let's look at the example here the system on the left shows the type 2 device with a smoke filter implementation it uses to it uses that to track which lines are with the host. If you look at the picture on the right now when we get a memory read request to cache line X at the device level let's say the smoke filter is full. So there is no you know we are going to hold off on X now we are going to figure out where would X go X would go into the same location where cache line Y is so it is going to issue a Back Invalidate to cache line Y and basically it needs to evict Y from its smoke filter in order to make an entry for X. So it goes that Back Invalidate goes in the CXL.mem channel as you see Back Invalidate Y that results in all the smoke flows from the host side host home agent side into other peer caches it's all going to get resolved finally you will you may end up with getting a memory write in this example to Y and do the completion for that the memory write goes to the device memory at that point what happens is that now you can make room for X get the data for X and then provide that data back to the device. So this basically enables to implement a smoke filter and still map your entire memory into the system coherency space or the HDM space.
This picture here shows CXL 3.0 what happens is you know earlier it was only a single level of switch the picture on the right shows a hierarchical kind of switch you can have one switch underneath that you can have a set of switches and you can have a lot of devices that can connect to a host so we enable much larger fan out much larger system construction with CXL 3.0 with multi level of switching. The picture on the on the left shows more of a you know cascaded kind of switching mechanism where you got different devices connected and they are like you know it's like it's like a fabric kind of a topology where you got switches connecting to each other and then you got multiple hosts multiple devices connected and you can build your system that particular way.
The picture here shows you know multiple devices per root port which got introduced with CXL 3.0 the picture on the left is the CXL 2.0 picture now notice that there is only one type one or type two device underneath a switch. You can have more memory but you cannot have more than one type one type two devices because every link only track one outstanding caching agent right on the other side. With CXL 3.0 we have removed that restriction and we can have up to I believe 16 of them that you can track type one and type two devices of course you can have type three as many as you want but multiple type one type two devices underneath the CXL 3.0 switch so that gets enabled with CXL 3.0.
Now for CXL 3.0 also enables the notion of of course we had the notion of pooling with 2.0 we also enabled the notion of sharing we of course expanded the use case for the notion with pooling because we have multiple levels of switches but what does sharing really mean right I mean what sharing means is that if pooling effectively refers to EPIC any given memory location it is assigned to a given host at a given point of time. Now at a different point of time that same memory location can be assigned to a different host as you go through the hot plug flow but with sharing what happens is that you can have multiple hosts share the same memory location in a coherent fashion. So how is that possible. Because each of the hosts are different cache coherency entities right I mean they are independent systems they just you know there is no cache coherency the home agent in H1 for example is not going to talk to the home agent in H2 to orchestrate cache coherency again these are very independent systems they have got their own independent system map and all of those things right. So what happens is that if you are a device let's say the if you look at the picture device D4 you know you have the shared memory in D4. now what D4 will do is that let's say if it is sharing the memory across multiple hosts and let's say five hosts want it shared fine you can give them shared right that's allowed. Now if somebody wants exclusive what happens is that you will in you are going to launch the Back Invalidate flow for the other five wait for them to complete before you can give it to somebody who is asking it for an exclusive ownership. So that's the Back Invalidate flow that we introduced and again because it is there in the same case in CXL.mem now you can only have type 3 device that really don't understand much about the cache coherency they can just do the Back Invalidate to the respective to the hosts that are involved and enforce a shared coherent memory space. We have defined something called a global fabric attached memory GFAM which can provide accesses for up to 4095 entities. So from up from 16 pooling in CXL 2.0 to 4095 that can not just do pooling but also do memory sharing and of course we have enhanced the CXL fabric manager to do all the setup deployment and all of those stuff.Actually, going back to that, let me elaborate a little bit on the type 3 devices. We have three basic types of type 3 devices that we have defined in CXL. We have the single logical device (SLD), which is assigned only to a single host CPU, right? This comes from the CXL 1.0/1.1 base, and I get this question a lot, so I thought I'll elaborate a little bit here. The other type is the multi logical device (MLD), which we introduced in CXL 2.0. Here, what happens is that you can assign the device to multiple hosts, up to 16, for pooling, and the maximum number of hosts that can be supported at a time is 16. Then, the third type—and you can also support multiple with 3.0—you can do with MLDs. You can also do share. Now with 3.0, we also introduced the notion of global fabric-attached memory (GFAM). This is the same as MLD in its basic capabilities, but supports large scale in terms of the number of hosts—up to 4095—that are actively using the device. This scaling up to 4095 relies on the device directly participating in something that we call port-based routing (PBR protocol extension), and that comes with 3.0. So, type 3 devices have got these three different flavors: SLD, MLD, and GFAM, and all three can be used within a switch hierarchy at any level. Now you can build a CXL switch with traditional hierarchical decode routing HBR as originally defined in CXL 2.0 or with the new port based routing PBR extensions that we have defined on 3.0. So the GFAM devices they rely on PBR extensions to be supported by the switch whereas the SLD and MLD can be connected to either switch type hope that clarifies that.
Now, looking into, you know, multiple levels of switching—with the CXL 3.0. With 2.0, we only had a single level of switching and only one Type 1 device. With 3.0, we have up to 16 CXL.cache devices, so 16 Type 1 or Type 2 devices. And, of course, you can have any number of those Type 3 devices that the fan-out is going to support.
Now we talked about you know shared memory mechanism few slides back so this is just going through that in some more detail. Device memory can be shared across all hosts and this is done the usage model is to increase data for efficiency and also improve memory utilization. So earlier there was no coherent memory mechanism for independent hosts including the devices within the host to do memory based message passing between them or semi-course between them with shared memory now those are possible. So now you can imagine building large HPC systems where of course you are going to have your pooled memory but you can also have some shared memory where you know different entities are doing different compute and they can just use the shared memory as a region where they can either pass messages between each other or do you know have common data structure there that they are all working off of. So we enable a lot of those usages with very large scale systems right up to 4095 under a switch hierarchy that we have.
Now, this is the essence of the capabilities that we have; this slide talks to that, brought to bear with CXL 3.0.Now, just to recap, we have broken the limitations of three hierarchy topology, which is enabling high bisection bandwidth with parallel paths. As you can see in the picture here, you know, this is a true composable system, a true fabric. There is nothing like a tree topology here. And of course, this is not the only way you can construct, but you can imagine constructing systems with spine level switches, leaf level switches. The picture on the right shows basically a rack that's there, so you can do things, you know. And at the bottom of it, you have got multiple hosts' CPUs, but fundamentally, a CPU is like an independent server, right? That's what it represents. It has its own memory, I/O, all of those. You've got some amount of memory, you've got a bunch of accelerators, you've got GFAM memory device, you've got NICs, right? These are all the end devices connected through leaf switches, through spine switches, and you know, the extension can be with 64 giga transfers per second per lane, and with cable, you definitely rack is within reach, and depending on the distance and how many re-timers you use, or whether you go optical, you can do even a part-level connectivity, right? CXL 3 enables for that. We also have enabled computational storage with CXL 3.0. You can have the memory, for example, that you see; you can have it do some local computation because it participates, indirectly, in the cache coherency by doing back invalid. So, you can ask the memory entity to do some local processing. Entity, it will still be coherent with the rest of the system. We do enable direct peer-to-peer, which will enable better performance, and of course, we talked about doing shared coherent memory that enables communication across hosts and devices using the load-store semantics. So, fundamentally, what you have is a set of compute nodes, a set of memory nodes, a set of accelerator nodes, and all of those, and you know, the other types of I/O nodes, and you can create fungible systems. Not only can you create fungible systems—like you can mix compute nodes, memory capacity, grow by going to the pooled memory location or going to the—you know, it's going and asking for some additional accelerator—but also now you can have these systems work collaboratively through things like GFAM or message passing using CXL.io, all of those things. So, we have all of these capabilities built in. Fundamentally, it's a tremendous breakthrough that we have been able to achieve with 3.0. Right? We started with 2.0, and with 3.0, you know, effectively, the load-store interconnect is moving from the node level—which was, you know, a single domain, a single server—to now, to the rack level and beyond.
So, in conclusion, CXL 2.0 offers full fabric capabilities along with fabric management. We have expanded the switching topologies, offered enhanced coherency capabilities, and are able to do peer-to-peer resource sharing. We have doubled the bandwidth while keeping the latency flat compared to CXL 2.0, which is very important for us because we are doing coherency and memory semantics. And all of this while being fully backward compatible with CXL 2.1, prior generations. And, you know, this backward compatibility enables us to really innovate without having to create a lot of angst amongst people, right? Because their investments are protected, they can make the transition whenever they want to; it's going to just work.Now, of course, we enabled a lot of new usage models with memory sharing between posts and peer devices. We, of course, support multi-header devices now because of the fabric capability. In other words, you can have a type 3 device where you can have, you know, multiple links talking to different CPUs directly or different switches directly. You've got the enhanced coherency capabilities with the Back Invalidate, and we've got the expanded support for type 1, type 2 devices. And with GFAM, we provide expansion capabilities for the current and future memory.So, you can download the 3.0 specification; it's available, and you know, as Danny said, if you have not joined the consortium, please do join the CXL journey. In my mind, it has just started. We have plenty of new, innovative usages that we are working on to evolve this technology further in a fully backward-compatible manner. And you know, CXL is already changing the compute landscape, and it's going to continue to change the compute landscape very, very profoundly in the coming decades.
So with that we'll go for Q&A.
Yeah thank you Debendra and Danny so we will now begin the Q&A portion of the webinar so please share your questions in the question box.So, the first question about the host cache that a GFAM device accesses for backend validation: Does it consist of cache memory lines of other GFAM devices too, or do they all have separate host caches for different GFAM devices? How about different media partitions in a GFAM device; do they have a separate cache for each media partition?
So, okay, I'm assuming the question is from the host's perspective. If you are accessing a given well in any caching agent, right, if it is accessing a memory location, it is doing that on a per cache line basis. So, yes, the accesses will still be on a per cache line basis. As far as the GFAM device is concerned, it's going to, just like any memory device, provide that access. And if it is supporting the shared coherency across different hosts, it needs to then, of course, enforce that using hardware cache coherency mechanisms. It needs to enforce that using the hardware cache coherency mechanism. Danny did you want to add anything to that?
No I think that covers it all.
Thank you with change in flit size will CXL 3.0 remain compatible to CXL 2.0 devices can hosts or switches support mix of CXL 2.0 and 3.0 devices connected to the same switch or fabric?
So if you notice the in the table that we saw CXL 3.0 needs to support the 68 byte flit size as well as the 256 byte flit size so yes that's how it will interoperate right so 68 byte flit size at 32 gig is the lowest common denominator so yes switches and hosts can do the mix and match of CXL 2 and CXL 3 and will just work fine.
Yeah, maybe I'll just add a small tidbit there. You know, one of the primary goals in the CXL specification is to maintain that backward compatibility. So, you know, we've taken cautious steps to make sure that compatibility is maintained as we proceed through the newer versions of the specification.
So, the Back Invalidate flow expects devices or switches to keep track of the status of all cache lines; otherwise, it will trigger BI on all peer accesses. Keeping track of all these cache lines will make switching devices more complex and will break the premise of keeping devices simple. Is that a correct assumption?
So let me answer the question first and then you know as some some part we'll get to that later.So, all optional capabilities—anytime you want to have a new feature, there is an extra amount of hardware that you need to build. That's expected. Now, the question is: Is that hard? You know, does that cause a lot of complexity? Right, let's look into this. If you're, for example, doing a type 3 memory, and that's pretty much where you would expect, or even type 2, anytime you have a memory that you have mapped into the HDM space, fundamentally, you are supporting the meta bits, which are basically the directory. So, okay, those exist. Now, what you are doing is really looking into that and trying to figure out, 'Do I need to issue a back invalidate?' I don't think that's a huge lift. We are not participating in any other cache coherency action; all the orchestration still gets done by the host, by the host processor. So, yes, I believe it is still really simple. And, you know, if I may, yeah, that assumption is not a correct assumption; it is still simple.
Could you repeat the latency limits you mentioned?
The latency is a guidance in the specification. What guidance we have given is: snoop to response, pin to pin is 50 nanoseconds; and, you know, if you're accessing memory like DRAM or HBM on the device side, it should be pin to pin 80 nanoseconds. It's a guidance, it's not a limitation or mandate.
Could you please explain the relation between UIO and BI again?
Sure, UIO—what it does is it basically moves the producer-consumer ordering enforcement to the source. By that, what I mean is, let's say I'm a NIC device. I'm doing a bunch of writes and reads to memory. Today, what happens is that I just issue those writes and reads, and I expect—of course, I expect myself to follow the producer-consumer ordering, and I expect the switches to do that and all of that. So, as a result, writes—I can just post write. I can just send the writes and forget about it. With unordered I/O, which is again an optional feature, what you can do is you can say, 'Hey, look, I am going to wait, you know, for the writes are not posted. I'm going to wait for the writes to get a response, and because I'm waiting for the writes to get a response, everybody else in the hierarchy, me as the source—meaning me as the NIC that is generating this—I'm going to make sure that producer-consumer is enforced. Rest of you, you can do things, do these reads and writes, in any order in the UIO. Why is that important, and how does that work with BI back invalidate? That's the question, right? It becomes important because imagine you are sending, you know, some reads and writes as a device, like a NIC device, to memory which can be in multiple places. You don't have to worry about, 'Oh, did this write get to the other memory device or not?' because you just told them that, 'Hey, you can do them in any order. I'm enforcing my ordering here.' So now, they get those writes. Each of those memory controllers or memory type three devices, that might be, you know, whatever, 10, 15, or whatever in a system, each of them might get a different write from the same device; they just do it locally.Now, if there is a coherency conflict—let's say if you are writing to a device and, let's say, the post-processor has its say—you need to, then, it's well, back invalidate. But most of the time, when the I/O is accessing the memory, we really do not expect that to be a coherency conflict. It can happen, but it's not a huge probability event. And even if there was a conflict, if you look into going back and forth to the processor, if you count the number of hops back and forth, you will still come out ahead with this back invalidate that we have.
Is it possible to connect with multiple racks with CXL 2.0 3.0?
So, from a protocol point of view, you can connect multiple racks. I think the real question is: is the physical reach right? So, you know, cables can only go a certain distance—as electrical cables can only go a certain distance. So, as long as those can be managed, yes, you can reach them using electrical methods. The other option is you can do E2O2E conversion, and of course, the distance problems get solved, right? So, from a CXL protocol point of view, we wanted to solve the large-scale thing from a protocol perspective, and, of course, there is a reach aspect of it as well. As I said, within a rack, the reach is easy with cables; across racks, it's a challenge with electrical cables, but that's where you can deploy optical. Danny did you want to add something there?
Yeah, you know, maybe I'll add — you know, with the port-based routing, you know — we've implemented fabrics that can reach up to, you know, over 4,000 devices. So, you know, to implement something that large, you know, fundamentally, you would think you would be able to, or you need to, go beyond a single rack. As Debendra mentioned, there's obviously some physical limitations with respect to SI and the physical link, but those can be overcome with different mechanisms, such as optical. Or even retimers you can add those and longer yeah.
So, in a GFAM scenario, how do we meet the latency and bandwidth? Are any retimers planned? If so, how many are allowed?
So, on the—okay, so let me do the easier one right. Are any retimers planned? Yes, we support retimers, right, and that's independent of GFAM or no GFAM, right? So, CXL has been supporting retimers. In fact, we have, you know, even when we were doing the 128B/130B encoding, we had things like, you know, sync header off, or doing low latency in the retimer. So, those are supported currently, and those are basically—you can pretty much use a PCIe retimer, and it will still work, right? And that is since CXL optimizes, and if you get that, you are going to get additional lower latency.Up to two retimers and a given link is allowed now in the GFAM scenario. How do we meet the latency and bandwidth requirement? So, the bandwidth is easy, right? I mean, it's whatever bandwidth you get, most likely. You know, it's going to be about how many conflict cases you are dealing with. Now, on the question of latency, it becomes interesting because, of course, you are going to go through multiple levels of checks, especially for the shared memory. So, you are going to increase the latency a bit, but the most important thing there will be if you run into conflicts. That is where the latency can start going up because you know you're trying to resolve coherency conflicts there. So, that's a little bit of a nuanced answer there. Daniel did you want to add something?
So, you know, the latency targets that Debendra mentioned earlier in the call are just that, right? There are targets, and those, you know, as we expand to larger and larger fabrics, that, you know, it will be harder to hit some of those targets. But obviously, you get the benefit of the expansion to more devices, so it's not super critical that those targets get hit. When you talk about, say, a fully deployed, you know, 4,000 device fabric, for example, you know there's the potential of tiering things, such as tiering memory. So, you may have longer latency to certain memory devices, for example, but that could be acceptable based on your workload and, you know, the overall TCO of the system. So, there's a lot of trade-offs there when you start talking about latency and fabrics, in particular, that are going to be system-dependent.
Okay so will traditional PCIe devices attached to a CXL 3.0 switch port be able to use the UIO peer-to-peer flows with other PCIe and/or CXL devices attached to the switch?
So the UIO part we are developing with PCIe 6 so that PCIe devices can also take advantage of that.I'm assuming, when it is for other PCIe devices, remember that that memory is non-coherent memory. So, yeah, we can still do UIO to the non-coherent memory, but primarily, you know, that coherent memory is where you are going to get a lot of benefit from a CXL perspective. So, you should be able to take advantage of that, even for PCIe, assuming PCIe devices have implemented UIO semantics.
Okay well thank you Debendra and Danny for sharing your expertise. The presentation slides will be available on CXL consortium's website and we will address all the questions we received today in a future blog post so please follow the CXL consortium on Twitter and LinkedIn for updates. Danny back to you.
Thanks Elza and thanks everybody for attending and submitting your questions. The presentation recording is going to be available on the CXL consortium YouTube channel. The slides will be available on the consortium website.We hope you'll reach out with any questions you may have after reviewing the evaluation copy of the 3.0 specification, which can also be found on our website at computexpresslink.org. Thanks again. Have a wonderful remainder of your week. Thanks for watching.