271


Good morning and good afternoon. Thank you for attending the CXL Consortium CXL 3.0, enabling composable systems with  expanded Fabric capabilities webinar. We will have a Q&A session at the end of the presentation, so please submit your questions  via the chat feature and they will be addressed. At this time, I will turn the presentation over to Danny Moore from member company Rambus.

Hey, thanks, Elsa. Hello, everybody, and welcome to our 3.0 webinar. We're all really excited to introduce the latest 3.0 specification and some of the novel  device types, capabilities, and interconnects that the consortium has been busy architecting. My name is Danny Moore. I'm a senior manager in strategy and product management at Rambus, where I primarily drive  product planning in our data center business unit, kind of with a focus on CXL. I've also been fortunate enough to have been involved in the consortium since its beginnings  in 2019, and currently I'm an active contributing member in the CXL marketing work group and  most recently been focused on the 3.0 specification release. My co-presenter is Debendra Das Sharma. He's one of the foundational architects of the CXL specification. He resides as co-chair of the CXL technical task force and is a senior fellow at Intel.

So, in the CXL consortium, we're always interested in growing our participation. There's a few different levels that you or your company can get involved in CXL. There's an adopter membership, which is free of charge. Or if you choose to take an active role in the specification development, you can sign  up for contributor or promoter levels of membership. You can head over to computexpresslink.org for more information. Our membership continues to grow year over year. We're currently over 200 plus member companies. And like the 200 plus member companies, our board of directors represents a wide array  of companies from industry, including CSPs, hyperscalers, the major CPU and accelerator  vendors, memory suppliers, and several OEMs. To reiterate, and as I mentioned, we're always interested in strengthening our member community. So if you're currently not participating, please do head over to computexpresslink.org  and find out how you or your company can contribute.

So as I mentioned previously, it was back in March of 2019 when the 1.0 specification  was first released. And by September, we'd already published the 1.1 specification and officially incorporated  the CXL consortium. By November of 2020, we had released the 2.0 specification. And just in August of this year, we published the 3.0 specification. I won't get into any details regarding features and capabilities that come along with that  specification cadence as Debendra will be walking you through some of those historical details  before jumping into the 3.0 features and capabilities. So with that said, I'll hand it off to Debendra to provide a technical overview of the CXL  specification. Thank you very much.

Thank you, Danny. Greetings, everybody. Here's the agenda for today. We'll start with a recap of the industry landscape and CXL, as well as a recap of CXL and CXL  1.0 and 2.0. Then we're going to dive into the CXL 3.0 features, as well as capabilities, and conclude. There have been a lot of webinars where we have talked about CXL 1.0 and CXL 2.0 in a  lot of detail. So you may want to go look into those to get more details. Today, I'll just do a very brief recap of those.

So when we look at the industry landscape today, we see some very clear mega trends emerge. Cloud computing has become ubiquitous. Networking and edge computing is exploding, increasingly using the cloud infrastructure. Artificial intelligence and analytics to process the huge amount of data that we are generating. All of these are driving a lot of innovations across the board. We see the increasing demand for heterogeneous processing as people want to deploy different  types of compute for different applications, whether it is general purpose, CPUs, GPGPUs,  custom ASIC, FPGA. Each of them is important and best suited to solve some class of problems. Now, in addition to the demand for heterogeneous computing, we also see the need for increased  memory capacity and bandwidth in our platforms. And that's natural because as we process more, we need to feed the beast, so to speak. We also are having a class of memory between DRAM and SSD that needs to be thought of as  a separate memory tier, offering compelling value proposition due to its capacity and  persistent semantics. So these mega trends that we are discussing here, they take advantage of these types of  memory in addition to the evolution in the traditional DRAM memory and storage.

Compute Express Link is defined ground up to address the challenges in this evolving  landscape by making both heterogeneous computing as well as different types of memory efficient. And it's meant to sustain the needs or the demands of the computing segment for many  years to come. So CXL, we're going to look into these in a lot of detail. It's built on PCIe infrastructure, adds memory and coherence semantics, and it is an open  industry standard as we have talked about.

So CXL has followed a three-pronged approach for ensuring the ease of ecosystem adoption. At the fundamental level, CXL is built on top of PCI Express infrastructure and leverages  PCI Express. It overlays the caching and memory protocols on top of existing PCI Express protocol. It runs on PCIe 5, runs on PCIe channels. There are three protocols, CXL.io, which as we said is identical to PCI Express. It is used for device discovery, think event reporting like errors, et cetera, I/O virtualization  services, direct memory access for data move. Pretty much we didn't want to reinvent any of that wheel. What we did was we added the CXL.cache semantics for doing cache coherency and CXL.mem for doing  the memory protocol. Those were the key addition that we did in order to enable the types of usages that we  talked about. Now we start with the 32-gig data rate because of the high performance it offers. PCIe adoption helps lower the barrier of entry since PCI Express 5 is ubiquitous. It has got low latency characteristics and low power characteristics. This also enables us to do plug and play. A customer can choose to use a PCI slot to plug in either a PCI device or a CXL device  depending on the need. Different customers will have different needs. The second aspect of CXL, so the first aspect was plug and play with the CXL PCI infrastructure  and just build the coherency and memory semantics on top of that. The second aspect is the low latency aspect. It is critical because cacheing and memory protocols need low latency to not adversely affect system  performance. We are expecting latencies in the range of existing symmetric cache coherency links for  the cache and memory graphic. For example, the specification provides guidance on expected latency, 50 nanoseconds, pin to  pin for snoops, 80 nanoseconds for D-RAM accesses, pin to pin. Now there is also a reporting mechanism so that system performance is not adversely affected  if slower memory devices are used. You can report that and the system software will map things accordingly. Third and another important aspect is Compute Express Link CXL. It is an asymmetric protocol. So the protocol flows and message classes are different between the host processor and the  devices. And that's a conscious decision to keep the protocol simple and the implementation easy. That's the reason why you will see the home agent is present in the host processor and  that tends to be very dependent on the inherent microarchitecture. It's changing different people to very differently. We have abstracted all of those away and given some very simple, MESI protocol based coherency  semantics and even for the memory side, you don't really need to understand how cache  coherency works. You just need to provide the data along with the metadata.

CXL 1.0 and 1.1, it's a direct connect between a host processor and a device. And it targeted three types of usage models. The left is a type one device, which basically needs caching semantics for usages like a  SmartNIC to deliver better performance. Examples would be like if you do a PGAS ordering model, where that ordering model is different  than the PCI Express ordering model. Now, if you do the caching, you can, of course, complete those things within your local cache  and you can enforce that ordering or enhance it. Those operations, so this can be done in addition to the traditional PCI style load store I/O. So that's for the type one device. The middle example is a type two CXL device. Typical usages include things like GPGPU, FPGAs, dense computing. These devices have some local memory attached to them, which are used for their computation. We expect them to implement all three protocols, CXL.io, CXL.cache, and CXL.mem. Caching and memory semantics would be used to populate and pass operands and results  back and forth between the different computing entities with very low latency and high bandwidth  efficiency. System on the right is a type three device. And the usage here would be things like memory bandwidth expansion, the memory challenges  that we talked about, memory capacity expansion, and also tiered memory, including storage  class memory. So these only need to implement the CXL.io and CXL.mem semantics. The memory will be mapped to the system memory as cacheable memory. The host processor orchestrates the cache coherency here, and that relieves the devices  from having to implement the complex coherency flows because of the asymmetric nature of  CXL. So the device doesn't even need to know anything about caching semantics. It doesn't need to implement CXL.cache.

So after CXL 1.0, we did CXL 2.0, 1.1.1, we did CXL 2.0, where we added support for pooling  of resources across multiple servers, as shown in the picture here. Now these resources can be accelerators, or they can be memory. Memory pooling is done at a finer grain level and can be supported by multiple logical devices,  which can support up to 16 hosts or 16 servers or virtual hierarchies, we use the terms interchangeably. And they are all assigned non-overlapping memory. So for example, if you look at the picture, D3 is assigned between two sets of servers  or two sets of hosts, H3, which is color purple, and H2, which is color green. So you can do up to 16. And these resources, by the way, can move between different hosts following a standard  hot plug flow. Now this capability with CXL 2.0, it enables resources like accelerators or memory to be  pooled at the rack level. It's a huge innovative breakthrough, as we no longer need to have captive resources in  a tightly coupled cache coherent system. Today's systems, you've got memory, you've got I/O, you've got GPGPUs, you've got whatever,  right? You've got your accelerators along with your processors. Those are tightly coupled to that particular server. Let's say there's another server that needs more memory, you cannot borrow memory from  a neighboring server and still use the load store semantics. With the CXL 2.0, we just enabled that. It's a huge breakthrough from that perspective. And now what we can do is that we can dynamically compose systems with additional memory or  accelerators from the pool. And in the picture, even though it's only memory pooling, you can imagine you have accelerators  in the picture and it can be pooled similarly, right? So for example, if a host needs an additional three accelerators for a task that it is running,  it can come to the pool, get the three accelerators. Once it is done, it can release them back to the pool and somebody else can use it. So that results in significant total cost of ownership benefit along with delivering  better power efficient performance.

CXL 2.0, of course, enables a single level of switching for the fan out. I mean, fan out is where you can have a single link and then you can connect multiple devices  underneath a switch. So that part we enabled with CXL 2.0. And of course, the more advanced thing is the pooling notion that we talked about.

So fundamentally, if we look into CXL 2.0, we talked about resource pooling and disaggregation  and that happens through the managed hotplug flows to move resources across different hosts. A type one or type two device is assigned to one host at a time, behind the switch, it's  assigned to one host at a time. Now a type three device, if it is a multi-level MLD kind of device, multi-logical device,  it can be assigned to up to 16 hosts at a time and you can still move them and you can  move different chunks of memory across the host and it enables pooling at the rack level. And all of these have direct load store, low latency access, similar to memory that is  attached to neighboring CPU, something very different than doing an RDMA over a network  which has got much, much higher latencies. We defined persistence flows to support persistent memory. We also have defined how a fabric manager needs to work with APIs for the switches as  well as for the devices so that you can manage this resource migration across different hosts. We enhance security by adding authentication and encryption with the CXL 2.0. Fundamentally CXL 2.0 goes from a node level to a rack level connectivity. It enables the disaggregation of the system with CXL which basically is done to optimize  resource utilization and that in turn will lower total cost of ownership and give us  better power efficient performance. All of this while CXL 2.0 being fully backward compatible with CXL 1.0 and CXL 1.1.

Now let us recap the industry trend that we just talked about. So from the trends point of view, we got use cases that need ever higher bandwidth and  you know that's because we got a lot of high performance accelerators. You need system memory that you know memory bandwidth demand is insatiable. You've got SmartNIC, you've got leading edge networking, all of these things right. As you are processing more of course you know you need to have more bandwidth that needs  to move around in the system while maintaining low latency characteristics and low power  and all of those good stuff. Now we also know that CPU efficiency is declining due to the reduced memory capacity and bandwidth  per core. So we need to shore that up with CXL. We need to also do efficient peer to peer resource sharing and we're going to see this  in some of the examples across multiple domains. And we have memory bottlenecks due to the CPU pin and thermal constraint because the  DDR buses are pin inefficient. So those are the things that are in the industry trends. Some of them we have solved with CXL 1.0 and 1.1 2.0. With CXL 3.0 what we have done is we are introducing things like fabric capabilities and we're  going to see more of that things like multi-handed devices, we've got enhanced fabric management,  we've got composable disaggregated infrastructure that we're going to see examples of. We have better capacity and better scalability, better resource utilization for memory pooling,  larger systems, new enhanced coherency capabilities and improved software capabilities. At the same time we have managed to double the bandwidth while keeping the latency flat. All of this while we are being fully backward compatible with CXL 2.0, CXL 1.1 and CXL 1.0.

From a CXL 3.0 specification point of view this is again what are the capabilities. Of course we got better bandwidth, we got fabric capabilities, we got improved memory pooling  and sharing capabilities, enhanced coherency semantics, peer to peer, all of those things  and we're going to go through those in a lot more detail in the upcoming slides.

This is a nice tabular view of the different features that are available with the different  revision of the specification. We started as I said with 32 giga transperce per second in CXL 1.0, we kept that through  2.0. With CXL 3.0 we are doubling the bandwidth to 64 gig. Of course you're still going to be operating at 32 gig but maximum data rate is 64 gig. The flit were 68 bytes up to 32 gig, of course that will be maintained all the way through  3.0 but with 3.0 at 64 gig we run it at 256 byte flit. Type 1, type 2, type 3 devices have been supported from 1.0 days so that continues but with 2.0  we saw the memory pooling also accelerator pooling with MLDs starting with 2.0 continues  through 3.0. We define the persistence flows with 2.0 onwards. IDE is the security enhancement that I talked about so that's there from 2.0 onwards. And single level switching got introduced in 2.0, multi-level switching got introduced  in 3.0. In addition to that CXL 3.0 has got other features like direct memory access for peer  to peer which results in better bandwidth and of course results in a lot better bisection  bandwidth in the system. We got enhanced coherency to better manage accesses which will result in better efficiency  in the system as we will see. We also have defined memory sharing across multiple nodes. This is a very new concept not just pooling but sharing of memories which enables us to  go for a lot of different usages. We also are enhancing the number of type 1, type 2 devices that can be supported per root  port so this goes from 1 to multiple and of course have big capabilities and we are going  to go through this again in some detail in the upcoming slides.

So first thing is that we did double the bandwidth with CXL 3.0 so we basically use PCI 6.05  running at 64 giga transfers per second. In order to go to 64 giga transfers per second we have to do PAM 4 signaling which results  in very high bit error rate and this has been mitigated by PCI 6.0 with the use of forward  error correction or FEC and 8 byte CRC. The PCI 6.0 flit layout is shown on the top picture there so you got some number of PLPs  which stands for transaction layer packets and then you got DLP which is data link layer  packets and then that's followed by 8 byte CRC and then a 6 byte FEC and the FEC basically  the way it works is that you get a particular FLIT, you correct that FLIT using the FEC  then you apply the CRC. If the FLIT passes you consume it, if it fails you basically cause link level replay. There is a PCI webinar on that that goes through these in a lot of detail and also there are  few papers, one of them I have listed here if you are interested you can go and take  a look at that paper it goes through a lot of the technical details behind those. Now for CXL 3.0 we took advantage of the same thing there are two types of FLIT arrangement. The picture on the middle shows us what we call a standard FLIT layout as you will see  it's very similar to PCI 6.0 FLIT layout. We got a 2 byte FLIT header basically you need the header in the beginning to tell which  stack it is going whether it is CXL.io FLIT, CXL.cache FLIT and CXL.mem FLIT in addition to that you are  going to get all the things like sequence number, things that you need for the management  of that particular sequence number, ACK, NAC, replay those kind of things. Now the 8 byte CRC you will see is in the same place and then you got the 6 byte FEC  and within the rest of it you got what we call the data which can be the CXL.io or CXL.cache  or CXL.mem. We also have a latency optimized version which is the picture at the bottom basically the  256 bytes get split into 228 bytes each of them is independently perfected by a 6 byte  CRC and the notion here is but the FEC is across the whole 256 bytes. So the idea here is that when you get 128 bytes if you pass the CRC you just consume  it without even doing an FEC. If you fail the CRC then you gather the entire 256 byte apply the FEC then do the independent  CRC check on each 128 byte half if it passes it consumes you if it fails you go and ask  for that replay and that results in much lower latency because you are only doing the FEC  correction when there is a failure and also the CRC the accumulation is not the 256 byte  it is at 128 bytes. So that's the reason why we result in zero latency adder and all of that good stuff and  this these things extends to the lower data rates also because once you are in a particular  type of flit mode you need to stay in that particular flit mode you cannot go back and  forth between them. So we also have enabled because the flit sizes are bigger we have enabled several new CXL3  protocol enhancement with this 256 byte flit format.

Now coming to some of the protocol enhancements right there are two major ones that have got  that are very simple constructs but extremely powerful and very profound they are going  to have big profound impact in the land of compute for decades to come. Those are Unordered IO, UIO for short back invalidate BI for short and you know basically  what happens is that there are multiple reasons why we went with this one of the reasons is  listed here we basically wanted to enable non tree topologies which is so far if you  look into PCI Express or CXL those are based on tree topologies and we need a tree topology  because on the IO side you have what is known as producer consumer ordering semantics and  those semantics gets enforced at every entity whether it is a switch whether it is an endpoint  whether it is a root code everybody enforces that producer consumer semantics and that  doesn't work if you do anything other than a non tree topology hierarchical tree topology. So that works fine that works for a lot of application but in for a lot of these you  know like fabric kind of application where you know you want to have the resources that  are disaggregated you want them to talk to each other directly you don't want to always  go through the post for every communication like that's that basically causes the post  to be the bottleneck. So in this example let's say I got a device D1 and that wants to access HDM memory which  is residing maybe in D5 or D6 HDM stands for host managed device memory it's basically  coherent memory so it wants to access that now in today's mechanism and with CXL 1.0  or 2.0 it would have to go through the post which wants the task coherency for that particular  memory and then it is going to go through the access the host is going to fetch the  line from the particular device let's say D5 and you know if it is a read it's going  to then send it back instead we want to just bypass the whole thing and go to that device  directly. So that enables first of all a lot of unnecessary traffic and also it enables parallel paths  which is good from delivering better bandwidth better latency and all of those. So those are the things that enables with peer-to-peer now what happens is that now why do I need  the Unordered I/O and the Back Invalidate right now if the let's say if you're trying  to read something D1 tries to read from D5 it's coherent memory now if it is coherent  memory and let's say D5 notices that the host has that particular line as private sorry  exclusive right if it has it exclusive then it may not get the latest and the greatest  copy of the data today with CXL 1.0 or 2.0 there is no way for a device to really send  a snoop back to it's an asymmetric protocol. So what we did was that we basically took the CXL.mem which is the which has got the  least dependency from a protocol dependency point of view and sent the Back Invalidate  back to the host saying hey somebody wants to access this particular line and the host  they will then it's its cache coherency mechanism will kick in and it's going to then respond  to them Back Invalidate. So most of the time you really do not expect you know there to be a conflict so that you  know you're going to just go to the device figure out nobody has it so you can complete  your transaction. In the rare case where that is not the case you're going to invoke the Back Invalidate  flow. With the UIO what happens is that you know once you are trying to mix the coherency world  with the producer consumer world it runs into problem because you know you've got multiple  paths right so Unordered IO is a way to break that what unordered IO does is it moves the  producer consumer enforcement to the source. So this is a these two mechanisms are going to have that profound impact now with that  what happens is that even for IO traffic I don't have to have a pretty high right because  guess what my ordering point is at the source. D1 has to enforce that ordering rights are no longer posted rights will get a completion  back so I know when the right made it so then I can basically do my producer consumer ordering  that way. So these are the two constructs which have got extremely powerful and you know you can  build pretty powerful systems using disaggregation and by composable system using using fabric  topology with these two protocol enhancements. So you know fundamentally peer to peer with this doesn't to coherent memory doesn't need  to involve the host unless there is a coherency conflict so that it removes that bottleneck. So for example if you are in NIC you can directly access the HDM memory which may have its own  local processing and simply inform the host only when there is a coherency conflict otherwise  you can just complete it at the at the device level at the type 3 or type 2 device level.

With Seattle 2.0 and 1.1 and 1.0 before that we had a bias flit flow and of course we I  talked about why we need Back Invalidate from the peer to peer access. Now there are other reasons why we also needed that we do have the existing bias flip mechanism  which is available for type 1 and type 2 it's not for type 3 and the problem with that is  that that needs to be tracked fully since the device could not back scope the host. So you need to basically have a complete you know whatever memory you are hosting you need  to track the entire thing either through the directory or your whatever is the size of  your smoke filter dictates the size of the memory that you are mapping into the HDM space. Now with back invalidate with CXL 3.0 we can enable smoke filter implementation resulting  in large memory that can be mapped to a HDM. Now let's look at the example here the system on the left shows the type 2 device with a  smoke filter implementation it uses to it uses that to track which lines are with the  host. If you look at the picture on the right now when we get a memory read request to cache  line X at the device level let's say the smoke filter is full. So there is no you know we are going to hold off on X now we are going to figure out where  would X go X would go into the same location where cache line Y is so it is going to issue  a Back Invalidate to cache line Y and basically it needs to evict Y from its smoke filter  in order to make an entry for X. So it goes that Back Invalidate goes in the CXL.mem channel as you see Back Invalidate  Y that results in all the smoke flows from the host side host home agent side into other  peer caches it's all going to get resolved finally you will you may end up with getting  a memory write in this example to Y and do the completion for that the memory write goes  to the device memory at that point what happens is that now you can make room for X get the  data for X and then provide that data back to the device. So this basically enables to implement a smoke filter and still map your entire memory into  the system coherency space or the HDM space.

This picture here shows CXL 3.0 what happens is you know earlier it was only a single level  of switch the picture on the right shows a hierarchical kind of switch you can have one  switch underneath that you can have a set of switches and you can have a lot of devices  that can connect to a host so we enable much larger fan out much larger system construction  with CXL 3.0 with multi level of switching. The picture on the on the left shows more of a you know cascaded kind of switching mechanism  where you got different devices connected and they are like you know it's like it's  like a fabric kind of a topology where you got switches connecting to each other and  then you got multiple hosts multiple devices connected and you can build your system that  particular way.

The picture here shows you know multiple devices per root port which got introduced with CXL  3.0 the picture on the left is the CXL 2.0 picture now notice that there is only one  type one or type two device underneath a switch. You can have more memory but you cannot have more than one type one type two devices because  every link only track one outstanding caching agent right on the other side. With CXL 3.0 we have removed that restriction and we can have up to I believe 16 of them  that you can track type one and type two devices of course you can have type three as many  as you want but multiple type one type two devices underneath the CXL 3.0 switch so that  gets enabled with CXL 3.0.

Now for CXL 3.0 also enables the notion of of course we had the notion of pooling with  2.0 we also enabled the notion of sharing we of course expanded the use case for the  notion with pooling because we have multiple levels of switches but what does sharing really  mean right I mean what sharing means is that if pooling effectively refers to EPIC any given  memory location it is assigned to a given host at a given point of time. Now at a different point of time that same memory location can be assigned to a different  host as you go through the hot plug flow but with sharing what happens is that you can  have multiple hosts share the same memory location in a coherent fashion. So how is that possible. Because each of the hosts are different cache coherency entities  right I mean they are independent systems they just you know there is no cache coherency  the home agent in H1 for example is not going to talk to the home agent in H2 to orchestrate  cache coherency again these are very independent systems they have got their own independent  system map and all of those things right. So what happens is that if you are a device let's say the if you look at the picture device  D4 you know you have the shared memory in D4. now what D4 will do is that let's say if  it is sharing the memory across multiple hosts and let's say five hosts want it shared fine  you can give them shared right that's allowed. Now if somebody wants exclusive what happens is that you will in you are going to launch  the Back Invalidate flow for the other five wait for them to complete before you can give it  to somebody who is asking it for an exclusive ownership. So that's the Back Invalidate flow that we introduced and again because it is there in the same  case in CXL.mem now you can only have type 3 device that really don't understand much  about the cache coherency they can just do the Back Invalidate to the respective to the  hosts that are involved and enforce a shared coherent memory space. We have defined something  called a global fabric attached memory GFAM which can provide accesses for up to 4095  entities. So from up from 16 pooling in CXL 2.0 to 4095 that can not just do pooling but also do memory  sharing and of course we have enhanced the CXL fabric manager to do all the setup deployment  and all of those stuff.Actually, going back to that, let me elaborate a little bit on the type 3 devices. We have three basic types of type 3 devices that we have defined in CXL. We have the single logical device (SLD), which is assigned only to a single host CPU, right? This comes from the CXL 1.0/1.1 base, and I get this question a lot, so I thought I'll elaborate a little bit here. The other type is the multi logical device (MLD), which we introduced in CXL 2.0. Here, what happens is that you can assign the device to multiple hosts, up to 16, for pooling, and the maximum number of hosts that can be supported at a time is 16. Then, the third type—and you can also support multiple with 3.0—you can do with MLDs. You can also do share. Now with 3.0, we also introduced the notion of global fabric-attached memory (GFAM). This is the same as MLD in its basic capabilities, but supports large scale in terms of the number of hosts—up to 4095—that are actively using the device. This scaling up to 4095 relies on the device directly participating in something that we call port-based routing (PBR protocol extension), and that comes with 3.0. So, type 3 devices have got these three different flavors: SLD, MLD, and GFAM, and all three can be used within a switch hierarchy at any level. Now you can build a CXL switch with traditional hierarchical decode routing HBR as originally  defined in CXL 2.0 or with the new port based routing PBR extensions that we have defined  on 3.0. So the GFAM devices they rely on PBR extensions to be supported by the switch whereas the  SLD and MLD can be connected to either switch type hope that clarifies that.

Now, looking into, you know, multiple levels of switching—with the CXL 3.0. With 2.0, we only had a single level of switching and only one Type 1 device. With 3.0, we have up to 16 CXL.cache devices, so 16 Type 1 or Type 2 devices. And, of course, you can have any number of those Type 3 devices that the fan-out is going to support.

Now we talked about you know shared memory mechanism few slides back so this is just  going through that in some more detail. Device memory can be shared across all hosts and this is done the usage model is to increase  data for efficiency and also improve memory utilization. So earlier there was no coherent memory mechanism for independent hosts including the devices  within the host to do memory based message passing between them or semi-course between  them with shared memory now those are possible. So now you can imagine building large HPC systems where of course you are going to have  your pooled memory but you can also have some shared memory where you know different entities  are doing different compute and they can just use the shared memory as a region where they  can either pass messages between each other or do you know have common data structure  there that they are all working off of. So we enable a lot of those usages with very large scale systems right up to 4095 under  a switch hierarchy that we have.

Now, this is the essence of the capabilities that we have; this slide talks to that, brought to bear with CXL 3.0.Now, just to recap, we have broken the limitations of three hierarchy topology, which is enabling high bisection bandwidth with parallel paths. As you can see in the picture here, you know, this is a true composable system, a true fabric. There is nothing like a tree topology here. And of course, this is not the only way you can construct, but you can imagine constructing systems with spine level switches, leaf level switches. The picture on the right shows basically a rack that's there, so you can do things, you know. And at the bottom of it, you have got multiple hosts' CPUs, but fundamentally, a CPU is like an independent server, right? That's what it represents. It has its own memory, I/O, all of those. You've got some amount of memory, you've got a bunch of accelerators, you've got GFAM memory device, you've got NICs, right? These are all the end devices connected through leaf switches, through spine switches, and you know, the extension can be with 64 giga transfers per second per lane, and with cable, you definitely rack is within reach, and depending on the distance and how many re-timers you use, or whether you go optical, you can do even a part-level connectivity, right? CXL 3 enables for that. We also have enabled computational storage with CXL 3.0. You can have the memory, for example, that you see; you can have it do some local computation because it participates, indirectly, in the cache coherency by doing back invalid. So, you can ask the memory entity to do some local processing. Entity, it will still be coherent with the rest of the system. We do enable direct peer-to-peer, which will enable better performance, and of course, we talked about doing shared coherent memory that enables communication across hosts and devices using the load-store semantics. So, fundamentally, what you have is a set of compute nodes, a set of memory nodes, a set of accelerator nodes, and all of those, and you know, the other types of I/O nodes, and you can create fungible systems. Not only can you create fungible systems—like you can mix compute nodes, memory capacity, grow by going to the pooled memory location or going to the—you know, it's going and asking for some additional accelerator—but also now you can have these systems work collaboratively through things like GFAM or message passing using CXL.io, all of those things. So, we have all of these capabilities built in. Fundamentally, it's a tremendous breakthrough that we have been able to achieve with 3.0. Right? We started with 2.0, and with 3.0, you know, effectively, the load-store interconnect is moving from the node level—which was, you know, a single domain, a single server—to now, to the rack level and beyond.

So, in conclusion, CXL 2.0 offers full fabric capabilities along with fabric management. We have expanded the switching topologies, offered enhanced coherency capabilities, and are able to do peer-to-peer resource sharing. We have doubled the bandwidth while keeping the latency flat compared to CXL 2.0, which is very important for us because we are doing coherency and memory semantics. And all of this while being fully backward compatible with CXL 2.1, prior generations. And, you know, this backward compatibility enables us to really innovate without having to create a lot of angst amongst people, right? Because their investments are protected, they can make the transition whenever they want to; it's going to just work.Now, of course, we enabled a lot of new usage models with memory sharing between posts and peer devices. We, of course, support multi-header devices now because of the fabric capability. In other words, you can have a type 3 device where you can have, you know, multiple links talking to different CPUs directly or different switches directly. You've got the enhanced coherency capabilities with the Back Invalidate, and we've got the expanded support for type 1, type 2 devices. And with GFAM, we provide expansion capabilities for the current and future memory.So, you can download the 3.0 specification; it's available, and you know, as Danny said, if you have not joined the consortium, please do join the CXL journey. In my mind, it has just started. We have plenty of new, innovative usages that we are working on to evolve this technology further in a fully backward-compatible manner. And you know, CXL is already changing the compute landscape, and it's going to continue to change the compute landscape very, very profoundly in the coming decades.

 So with that we'll go for Q&A.

Yeah thank you Debendra and Danny so we will now begin the Q&A portion of the webinar so  please share your questions in the question box.So, the first question about the host cache that a GFAM device accesses for backend validation: Does it consist of cache memory lines of other GFAM devices too, or do they all have separate host caches for different GFAM devices? How about different media partitions in a GFAM device; do they have a separate cache for each media partition?

So, okay, I'm assuming the question is from the host's perspective. If you are accessing a given well in any caching agent, right, if it is accessing a memory location, it is doing that on a per cache line basis. So, yes, the accesses will still be on a per cache line basis. As far as the GFAM device is concerned, it's going to, just like any memory device, provide that access. And if it is supporting the shared coherency across different hosts, it needs to then, of course, enforce that using hardware cache coherency mechanisms. It needs to enforce that using the hardware cache coherency mechanism. Danny did you want to add anything to that?

 No I think that covers it all.

Thank you with change in flit size will CXL 3.0 remain compatible to CXL 2.0 devices can  hosts or switches support mix of CXL 2.0 and 3.0 devices connected to the same switch or  fabric?

 So if you notice the in the table that we saw CXL 3.0 needs to support the 68 byte flit  size as well as the 256 byte flit size so yes that's how it will interoperate right  so 68 byte flit size at 32 gig is the lowest common denominator so yes switches and hosts  can do the mix and match of CXL 2 and CXL 3 and will just work fine.

Yeah, maybe I'll just add a small tidbit there. You know, one of the primary goals in the CXL specification is to maintain that backward compatibility. So, you know, we've taken cautious steps to make sure that compatibility is maintained as we proceed through the newer versions of the specification.

So, the Back Invalidate flow expects devices or switches to keep track of the status of all cache lines; otherwise, it will trigger BI on all peer accesses. Keeping track of all these cache lines will make switching devices more complex and will break the premise of keeping devices simple. Is that a correct assumption?

So let me answer the question first and then you know as some some part we'll get to that  later.So, all optional capabilities—anytime you want to have a new feature, there is an extra amount of hardware that you need to build. That's expected. Now, the question is: Is that hard? You know, does that cause a lot of complexity? Right, let's look into this. If you're, for example, doing a type 3 memory, and that's pretty much where you would expect, or even type 2, anytime you have a memory that you have mapped into the HDM space, fundamentally, you are supporting the meta bits, which are basically the directory. So, okay, those exist. Now, what you are doing is really looking into that and trying to figure out, 'Do I need to issue a back invalidate?' I don't think that's a huge lift. We are not participating in any other cache coherency action; all the orchestration still gets done by the host, by the host processor. So, yes, I believe it is still really simple. And, you know, if I may, yeah, that assumption is not a correct assumption; it is still simple.

 Could you repeat the latency limits you mentioned?

The latency is a guidance in the specification. What guidance we have given is: snoop to response, pin to pin is 50 nanoseconds; and, you know, if you're accessing memory like DRAM or HBM on the device side, it should be pin to pin 80 nanoseconds. It's a guidance, it's not a limitation or mandate.

 Could you please explain the relation between UIO and BI again?

Sure, UIO—what it does is it basically moves the producer-consumer ordering enforcement to the source. By that, what I mean is, let's say I'm a NIC device. I'm doing a bunch of writes and reads to memory. Today, what happens is that I just issue those writes and reads, and I expect—of course, I expect myself to follow the producer-consumer ordering, and I expect the switches to do that and all of that. So, as a result, writes—I can just post write. I can just send the writes and forget about it. With unordered I/O, which is again an optional feature, what you can do is you can say, 'Hey, look, I am going to wait, you know, for the writes are not posted. I'm going to wait for the writes to get a response, and because I'm waiting for the writes to get a response, everybody else in the hierarchy, me as the source—meaning me as the NIC that is generating this—I'm going to make sure that producer-consumer is enforced. Rest of you, you can do things, do these reads and writes, in any order in the UIO. Why is that important, and how does that work with BI back invalidate? That's the question, right? It becomes important because imagine you are sending, you know, some reads and writes as a device, like a NIC device, to memory which can be in multiple places. You don't have to worry about, 'Oh, did this write get to the other memory device or not?' because you just told them that, 'Hey, you can do them in any order. I'm enforcing my ordering here.' So now, they get those writes. Each of those memory controllers or memory type three devices, that might be, you know, whatever, 10, 15, or whatever in a system, each of them might get a different write from the same device; they just do it locally.Now, if there is a coherency conflict—let's say if you are writing to a device and, let's say, the post-processor has its say—you need to, then, it's well, back invalidate. But most of the time, when the I/O is accessing the memory, we really do not expect that to be a coherency conflict. It can happen, but it's not a huge probability event. And even if there was a conflict, if you look into going back and forth to the processor, if you count the number of hops back and forth, you will still come out ahead with this back invalidate that we have.

 Is it possible to connect with multiple racks with CXL 2.0 3.0?

So, from a protocol point of view, you can connect multiple racks. I think the real question is: is the physical reach right? So, you know, cables can only go a certain distance—as electrical cables can only go a certain distance. So, as long as those can be managed, yes, you can reach them using electrical methods. The other option is you can do E2O2E conversion, and of course, the distance problems get solved, right? So, from a CXL protocol point of view, we wanted to solve the large-scale thing from a protocol perspective, and, of course, there is a reach aspect of it as well. As I said, within a rack, the reach is easy with cables; across racks, it's a challenge with electrical cables, but that's where you can deploy optical. Danny did you want to add something there?

Yeah, you know, maybe I'll add — you know, with the port-based routing, you know — we've implemented fabrics that can reach up to, you know, over 4,000 devices. So, you know, to implement something that large, you know, fundamentally, you would think you would be able to, or you need to, go beyond a single rack. As Debendra mentioned, there's obviously some physical limitations with respect to SI and the physical link, but those can be overcome with different mechanisms, such as optical. Or even retimers you can add those and longer yeah.

So, in a GFAM scenario, how do we meet the latency and bandwidth? Are any retimers planned? If so, how many are allowed?

So, on the—okay, so let me do the easier one right. Are any retimers planned? Yes, we support retimers, right, and that's independent of GFAM or no GFAM, right? So, CXL has been supporting retimers. In fact, we have, you know, even when we were doing the 128B/130B encoding, we had things like, you know, sync header off, or doing low latency in the retimer. So, those are supported currently, and those are basically—you can pretty much use a PCIe retimer, and it will still work, right? And that is since CXL optimizes, and if you get that, you are going to get additional lower latency.Up to two retimers and a given link is allowed now in the GFAM scenario. How do we meet the latency and bandwidth requirement? So, the bandwidth is easy, right? I mean, it's whatever bandwidth you get, most likely. You know, it's going to be about how many conflict cases you are dealing with. Now, on the question of latency, it becomes interesting because, of course, you are going to go through multiple levels of checks, especially for the shared memory. So, you are going to increase the latency a bit, but the most important thing there will be if you run into conflicts. That is where the latency can start going up because you know you're trying to resolve coherency conflicts there. So, that's a little bit of a nuanced answer there. Daniel did you want to add something?

So, you know, the latency targets that Debendra mentioned earlier in the call are just that, right? There are targets, and those, you know, as we expand to larger and larger fabrics, that, you know, it will be harder to hit some of those targets. But obviously, you get the benefit of the expansion to more devices, so it's not super critical that those targets get hit. When you talk about, say, a fully deployed, you know, 4,000 device fabric, for example, you know there's the potential of tiering things, such as tiering memory. So, you may have longer latency to certain memory devices, for example, but that could be acceptable based on your workload and, you know, the overall TCO of the system. So, there's a lot of trade-offs there when you start talking about latency and fabrics, in particular, that are going to be system-dependent.

Okay so will traditional PCIe devices attached to a CXL 3.0 switch port be able to use the  UIO peer-to-peer flows with other PCIe and/or CXL devices attached to the switch?

So the UIO part we are developing with PCIe 6 so that PCIe devices can also take advantage  of that.I'm assuming, when it is for other PCIe devices, remember that that memory is non-coherent memory. So, yeah, we can still do UIO to the non-coherent memory, but primarily, you know, that coherent memory is where you are going to get a lot of benefit from a CXL perspective. So, you should be able to take advantage of that, even for PCIe, assuming PCIe devices have implemented UIO semantics.

Okay well thank you Debendra and Danny for sharing your expertise. The presentation slides will be available on CXL consortium's website and we will address  all the questions we received today in a future blog post so please follow the CXL consortium  on Twitter and LinkedIn for updates. Danny back to you.

Thanks Elza and thanks everybody for attending and submitting your questions. The presentation recording is going to be available on the CXL consortium YouTube channel. The slides will be available on the consortium website.We hope you'll reach out with any questions you may have after reviewing the evaluation copy of the 3.0 specification, which can also be found on our website at computexpresslink.org. Thanks again. Have a wonderful remainder of your week. Thanks for watching.