178


Good morning and good afternoon. Thank you for attending the CXL Consortium introducing  the CXL 3.1 specification webinar. Today's webinar will be led by CXL Consortium Technical  Task Force Co-Chair Mahesh Wagh and Protocol Working Group Co-Chair Rob Blankenship. Now  I will hand it off to the presenters to begin the webinar.

 So CXL ecosystem, right, the CXL ecosystem, it started with, you know, a direct connect  and over the span of the last four years has really grown. If you look at the growth of  CXL ecosystem since its inception, you have the CXL specification at its center, but it's  addressing lots of things. It's addressing the needs for CPU and GPU interconnects, for  example, it's meeting the needs of accelerators and FPGAs. This is where you have, you know,  programmable compute, you have accelerators that can participate in, you know, providing  you a solution. It's opening up a lot of opportunities for CXL IPs. As the specification is developed,  there is a room for IP vendors to provide these IPs that can be made use of, you know,  whether it's a CPU, GPU, accelerator or devices, right? So pretty big business opportunities  for CXL IPs. It's meeting the needs for memory expanders. What CXL is fundamentally changing  is it's enabling the memory hierarchy that we've been, you know, the industry has been  working on for many, many years, but it was missing a key piece of that development, which  was having an interconnect that would allow you to have memory expansion capabilities. So that has enabled memory expanders, quite a few of them you would see in the ecosystem  right now. From a scaling perspective, it's providing the business opportunities there  for switches, because as you look at and we'll cover some of the developments for CXL, that  says you're increasing your fan out as well as you're providing new and interesting use  cases that would be enabled with, you know, with switches. As you develop any interconnect,  it goes with, you know, you need analyzers, you need traffic generators, so you can get  both your validation as well as debug infrastructure all set up. So CXL does enable that. And then  what rides on all top of that is the software solutions. So you have either accelerator  solutions that are very new and interesting memory use cases and memory solutions that  run on top of CXL. So overall, you know, we have the specification and that's basically  meeting the needs across all of these components and their interconnection with each other. So from a specification perspective, a couple of things that we make sure that we address  is we work very hard to look into keeping your investment in the protocol and your products  so that you can recoup your investments. So we look at fully backwards compatible solutions. That is a key for all of the developments that we do. We also keep in mind the overall  system cost. You can do a lot of innovations, but we continue to look at making sure that  we can address the solutions with keeping cost as well as performance in mind. And then  what goes along with that is we have a comprehensive support for compliance and testing support. Doing specification is one you need to along with that specification for success of that  technology, you need to have a comprehensive compliance and testing support. So that is  already planned and provided. The specification has sections that talk about how to design  for compliance and testing as well. So from an overall perspective, it covers all of those  aspects that are needed for a very strong ecosystem.

 So we'll cover where the scope of CXL is, where did it start with as an overview and  where it's scaling. So it started with CXL 1.1, focusing on single node core and interconnect  between the CPU, enabling use cases for an accelerator direct attach or a memory expansion  use cases. And we'll cover those in more details. We brought in the notion of type 1, type 2  and type 3 type of profiles for the devices and their use cases. So that's how it started  with CXL 1.1. In that time frame, the CXL link itself was sort of not visible to legacy  PCI bus drivers. So there were some ways in which you could discover those devices. But  having said that, it enabled the use case of directly attaching accelerators and memory  expanders to a processor interconnect in that time frame.

As we looked at CXL 2.0, then it built on the success of, you now define the ability  to attach accelerators and memory expansion devices. The need was then to provide the  scaling for those devices. So that with CXL 2.0, we brought in a couple of new features,  which is the ability to force support hot plug capability with CXL 1.1. So you can add  and remove devices. We extended the support where the link itself would be discoverable  by a PCI bus driver. So that allowed you to take advantage of all of the development that  was done in PCIe for hot plug, for example. On the memory and accelerator side of things,  it enabled the ability to now support a CXL 2.0 switching capability that allowed you  to at a first level provide fan out. So you can fan out from one link to many CXL devices,  as an example. And then it introduced the concept of memory pooling. This is where you  can have a particular memory resource and then allow that particular resource to be  pooled, in which case, you take that resource, divide it up into different distinct segments,  and then assign those resources to a given particular host image. And you could do that  with having an entire device being allocated through a CXL 2.0 switch to one host image,  or you can share, you can pool a device between multiple host images. So that was supported  through CXL 2.0 switch capabilities. And it's just an example that's shown here. Those  H1s, those colors denote a host image, and then the colors on the bottom for D1, D2,  for example, those are devices. And the association is that a color belongs to a given host image,  and you're showing the mapping of the resources to that given host image. So one of the additional things that CXL 2.0 did was define in the spec the notion of multi-headed  CXL 2.0 switches. And this has always been possible, for example, in PCI Express, but  CXL went in and defined the switching architecture that supported multi-headed CXL switches,  as well as supporting multi-headed CXL devices. As you can see, for example, the device D2  has the ability to be connected and pool its resources across host 1 and then host 3, just  as an example. So that's what CXL 2.0 enabled at a high level in this capability. One additional thing from a security perspective, CXL 2.0 brought in the support for link encryption  that you would enable across either direct attach or across CXL 2.0 switches. So from  a security perspective, that was the enhancement that was added to CXL 2.0.

When we get to CXL 3.0 and 3.1, it is really growing sort of the scale for what we could  do with switching. It introduced the concept of multi-layer switching, as well as introducing  the notion of a composable fabric, in which case you can now have many of these devices  as well as the host nodes interconnected with each other and providing a fabric, a composable  fabric that would allow both disaggregation and pooling of accelerator and the resources. One additional thing that we will cover is CXL 3.0 and 3.1 brings in the concept of sharing  of resources, which means that a given memory resource can now be shared directly, accessible,  and mapped between multiple host images. And then to support that, you could always do  that with software coherency mechanisms, but if you wanted to support the coherency mechanisms  through hardware-based coherency, then you had to support a new type of protocol. So  there was a new protocol channel that was introduced in CXL 3.0 to enable the back-invalidate flows. And we'll cover some of those details today. So at a high level, this shows sort of the scope of CXL as it has expanded from node-level  properties at CXL 1.1, continue to develop that through fan-out switches, pooling in  CXL 2.0, extend those capabilities to CXL 3.0 and 3.1. And as we're doing this development,  continue to keep the backwards compatible aspect of CXL so that the investments that  people have made in CXL 1.1 continue to be useful where you can take a CXL 1.1 device  and still play in the CXL 3.0 ecosystem.

So with that, we'll do a quick recap of what are the representative CXL use cases. It starts  with really having what's called CXL Type 1. It's caching devices or accelerators. For  those who are familiar with CXL, it runs on PCI Express physical layer, but it brings  in three different protocols that you can interleave at a much finer granularity. So  you can interleave between what's CXL.io, CXL.cache, and CXL.mem. For Type 1, you only  make use of CXL.io and CXL.cache protocols. And the use case for that is accelerators,  which make use of just the CXL.cache protocol. A good example of that would be PGAS, NIC,  or atomics, where you can bring a cache line into the NIC, and then you can apply either  your atomics or you can do source-based ordering if that is what you intended to do with the  Type 1 devices. Type 2 devices are the ones that have memory associated with an accelerator that you're  mapping in the system memory address space. And to enable that, now you need to support  all the three protocols, the CXL.io, CXL.cache, and CXL.mem. In all of these examples, the  CXL.io protocol is required for device discovery, configuration, and enabling the devices. And  then you make use of CXL.cache and CXL.mem, depending on your use case. So for a Type 2 device, you're using the CXL.cache as well as CXL.mem protocol. And it allows the  processor to cache device-attached memory, and it also allows the accelerator to cache  both processor-attached memory as well as device-attached memory. A good example of  that would be either you have GPGPU use cases, accelerators which have high touch for memory  components. So that type of use cases can be addressed with Type 2. And finally, on the right side is memory buffers. So this is providing memory expansion capability  with CXL. And in that case, you need to use CXL.io protocol for discovery and then CXL.mem  protocol for accessing the memory. And in this case, you have a memory controller, so  you're either using it for memory bandwidth expansion. So to your processor-attached memory,  you can expand it by going over CXL link and supporting memory bandwidth expansion or memory  capacity expansion. In addition to that, you can also think about storage-class memory  that could be supported. The way to think about these memory buffers  are they provide you with an access to .mem controller, and they abstract out the media  that you can connect behind that memory controller. So it's enabling a lot of interesting use  cases for making use of different types of memory, either it is DDR4, DDR5, or storage-class  memory behind the memory controller, and provide the opportunities for the memory controller  ecosystem to innovate with CXL Type 3. And on top of CXL Type 3, you can then start  to make use of some of the extensions that I talked about that support the use cases  for memory pooling or memory sharing. So as you get more advanced, then you can extend  those capabilities to support those use cases. 

Quick look at where we have, I talked about  the expansion, but if you look at it from a CXL specification perspective, the 1.0 spec  was released in March 2019. Then in September, there was a revision to that. The consortium  was officially incorporated. There was a CXL 1.1 specification that was released based  on initial feedback. November 2020, we released the 2.0 specification, addressing some of  the things that I talked about. August 2022 was CXL 3.0 specification. And then last year,  November 2023, we released the CXL 3.1 specification. And today, we'll cover the aspects of CXL  3.1 spec. 

So from a feature summary perspective, if you're looking for how does that progression  work, what are the things that are covered through the specs, as I was talking about  1.1, the data rate is 32 gigatransfers per second. It supports what we now know as a  flit mode, 68 byte flit. Type 1, type 2, type 3 devices were supported. CXL 2.0 added more  expansion as I touched on. It still kept with 32 gigatransfers per second, but it added  the ability to now do memory pooling with multi-logical devices. We added the support  for persistent memory with global persistent flush, added security and supported switching. And then as we get into 3.0, a couple of things change here with CXL 3.0. We changed the data  rate to 64 gigatransfers per second. So now you're getting doubling of the bandwidth. And along with that, we introduced new flit types, which was 256 byte flits. This was  cleverly done so that you reduce the impact of going to 256 byte and don't not take a  threat on latency. So there were several enhancements in there to make sure that we focus on performance. And then from a use case perspective, it broadened the mechanisms to address the switching multi-level  use case, direct memory access for peer to peer. There's enhanced coherency support. And then the ability to support multiple type 1 and type 2 devices per root port. CXL 2.0,  there were some limitations on how many type 1, type 2 devices you could support. That  was extended with CXL 3.0. And then fabric capabilities were defined, and they were even  further enhanced with CXL 3.1 that Rob will cover in the next few slides.

So we'll cover quickly on what is the CXL 3.0, a quick recap of what's the difference  between pooling and sharing. So I've talked about pooling before where a good example  of that would be if you look at that device D2, it's shown in two different colors. Those  colors denote a host image where those devices are mapped. So if you look into the host images  up there, the device D2, its resources are partitioned. And then all of its sort of orange  resources are allocated to host 4 and the magenta color resources that are allocated  to host 3 as an example. And what the specification supports is the ability to do this as a static  allocation, or you can also do this dynamic, which is at runtime, you can change those  allocations. So all of that support for enabling the dynamic allocation of memory is supported  in the CXL 3.0 specification. There's a lot of development that's happening in the ecosystem  to enable the dynamic pooling of those memory resources. Now when we get into sharing, the notion there is when pooling was you take your resources,  you partition, and then you allocate a given partition to a host, what sharing allows you  to do is map that same device sort of memory into multiple different address spaces. So  as shown here, the S denotes sharing. So you can see if it device 1, S2 is the region that's  now shared between sort of the host 1 and the host 2, in which case both of them have  that particular memory region mapped in their address space, and then they can simultaneously  access that particular memory. The coherency for that, depending on the scheme that you  choose, can be done either through software-based coherency, or you can do hardware-based coherency  to support that particular use case. The specification provides all the mechanisms that you need  to enable that hardware-based coherence. Now a key thing in all of these memory sharing and pooling use cases is you need a fabric  manager that is available to do this setup, deploy, and basically manage this particular  memory. And so there are specific APIs that are defined that allow for a fabric manager  to understand and allocate those resources across this multiple different host images  so that you can enable a memory sharing or a pooling use case. So all of those APIs,  the commands that you need to enable that fabric manager are also supported in the specification.

All right, as we go into CXL 3.1, from a feature enhancements perspective, it was basically  addressing the new use models. And if you look into what those new use models were coming  into, for one was the ability to now scale out the CXL fabric resources. So in which  case now you're looking for CXL fabric improvements. And to provide at the scale that we were looking  at, you needed to provide new ways of routing those requests or flits across that fabric. So the notion of CXL fabrics with port-based routing was defined. So now you can support  the scale out needs of the ecosystem. The additional thing that was looked into is the  ability to support CXL attached memory in confidential compute sort of environments  that required you to support basically the TSP protocol, which is the Trusted Execution  Environment Security Protocol. So we had at the baseline memory link encryption, but you  needed something more than that to allow for confidential compute. So TSP was added to  CXL 3.1 specification, and there were some other memory expander improvements to increase  the metadata as well as RAS capabilities of the CXL 3.1 specification. And Rob will go  through and cover those.

So with that, I'll hand over to Rob for going through the next level of details for what  is specific in CXL 3.1.

Thanks Mahesh. So on the fabric improvements, we enhanced CXL 3.0 to find the basic PBR  fabric. We've enhanced that in terms of defining more details on fabric decode and routing  rules. We added this feature we call global integrated memory, which allows for host to  host communication. We added the ability to do direct peer to peer dot mem support. This  is with along with part of the precursor to that is PBR switches. And this effectively  allows CXL.mem protocol to be symmetric from an accelerator device point of view. Accelerator  device can both be a destination CXL.mem protocol as well as an initiator targeting peer device. And, you know, 3.0 defined the ability to do unordered IO to CXL memory. But this is  enabling using the CXL.mem protocol natively, which allows the accelerator to cache, directly  cache that memory without having to go through the host. And then lastly, we added basically  the full definition of the fabric manager API for PBR switches. 3.0 defined PBR switch  construct but didn't left the API open. And so 3.1 is closing that hole and defining the  API for fabric manager. 

Okay. So talking about the details of fabric, let's start with the  picture on the hierarchy on the left side of the slide, which is showing HBR switches. HBR switches were part of CXL 2.0 definition. And these switches are built on the idea of  a tree based hierarchy and every link has an upstream and downstream direction. So things  flow from the hierarchy point of view in top to bottom, bottom to top direction. There  are peer to peer traffic within that. But most but our CXL.cache and CXL.mem protocols and specifically  are directional in this in a hierarchy based switch topology. When we go to this picture  on the right, we have PBR switches and you'll see that in each switch in this picture, we  have hosts and devices, meaning that and there is still a concept of a tree based topology  from a logical tree based topology. But the links between the switches, they can flow  in both directions, meaning you can have a host in one talking to a device in the other  and that same that other switch can have a host and that switch talking to a device in  the other switch. So we have traffic on those links between the switches that are is really  flowing in both directions. There's no longer directional nature to the protocols on those  links. So we break down the basic construct that we had with the hierarchy based switches  in the past. And this enables a much more flexible topology.

So I'm talking about the global integrated memory. The idea with global integrated memory  is that each host in this picture can expose a memory region that's on the fabric. And  this allows other hosts or requesters to use the unordered IO protocol to directly access  that host memory. And this, you know, it's basically a window into that host. And this  is going to enable kind of host to host communication models. And it's built on top of the basic  fabric address mechanism that PBR introduced. And this is different, although, again, it's  similar to fabric attached memory. So you see the device at the bottom, the type 3 memory  device. Think of that as a fabric attached memory device that's exposed to many hosts. And basically, there's a decode that happens as messages enter the fabric that decode the  fabric address space and can map to both that fabric attached memory device or the global  integrated memory in a peer host. The one difference here that's important is that the  global integrated memory in a peer host is accessed only through the unordered IO protocol. It doesn't use the CXL.mem protocol to access. But the decode mechanisms are similar.

Next slide. So talking about the peer to peer CXL.mem feature that we've enabled in 3.1. This  enables an accelerator. So in the diagram, we can see the purple device is an accelerator. It can, again, be the target of a CXL.mem access from the host. And it can also initiate peer  to peer CXL.mem accesses to the peer type 3 device. This is utilizing port-based routing features. So it relies on PBR switch to do this. And it's supporting a type 3 memory device such  that the device can be either dedicated to the accelerator. So you can make this device  fully in the memory that's exposed in that device only accessible from the accelerator  or if it's a multi-logical device that memory can be shared between the accelerator and  a host. Mahesh was talking about that sort of shared memory can also be shared across  multiple hosts.

So, Rob, there's one question on this one as you were talking about. It's from Chandraprakash. Question is, what do you mean by CXL type 1 device supporting CXL.mem protocol peer  to peer? Will CXL type 1 device in this case have CXL memory inside it? I think what you  were saying is it's a type 1 accelerator that is accessing the type 3 memory.

Yeah. And I think in the use models you were showing, Mahesh, it did show the type  1 device as being able to initiate CXL.mem to a peer device. So in that context, if the  device is a type 1 device, meaning it can initiate CXL.cache request to the host, but  it may not expose memory to the host, but it can still initiate peer to peer CXL.mem to  the peer type 3 device.

It's accessing a memory that's mapped in type 1. I mean, that's mapped to CXL.mem.

All right. Okay. And Mahesh kind of went through some of this. We have a fabric manager  that is the entity that configures the PBR switches and even the CXL 2.0 switches. We  have a fabric manager. And this is a necessary entity to configure these resources, which  may be assigned to different hosts or shared between multiple hosts. And also just a general  configuration of the PBR switch, because those links between the switches aren't exposed  to individual hosts. The host really sees the PBR fabric as a single level of switching,  even though there are intermediate links. And all that is controlled through this fabric  manager. And 3.1 really is defining that fabric manager for PBR switches.

So switching now to the TSP definition. This is the trusted -- we have two acronyms here. Trusted execution environment is TEE, and the TEE security protocol is TSP. 

So just  to recap, in 2.0, we added the link IDE feature, which Mahesh also discussed. This is providing  protection, both encryption and integrity protection against hardware adversaries to  protect the links, both between host and device and host and switch and switch and device. So providing that mechanism. And our TSP is really building on top of this feature, extending  further into the protocol level. 

And specifically, we're enabling virtualized  environments to support trusted virtual machines running on -- you know, in a host where the  host may be running trusted virtual machines as well as regular virtual machines. And the  TSP protocol is intended to provide kind of memory isolation so that we can have -- and  you'll see on the left side, we have a host that has its own memory that's connected to  the host and has regions of memory that are trusted memory for particular VMs, trusted  VM and regular host memory. So there's two colors kind of indicating that. We're extending  that into the CXL. So on the right, we have a CXL capable device that has both the blue  memory, the trusted memory, as well as untrusted memory or memory for regular VMs. And we use,  again, building on top of the link IDE feature to provide security across the link, protecting  against hardware adversaries there. And let's see. Anything else to note on this  slide? You know, this is building -- we're making sure all the features were added to  meet the needs of kind of that virtualized environment where, you know, different hosts  have different particular requirements that are on the host, but we've abstracted the  trusted security protocol such that we enable the device to meet the needs of different  hosts running their trusted VMs and conform to all the, you know, best practices in terms  of security to make sure that we can meet the needs of customers running these trusted  VMs. 

So talking about specific elements of TSP.  So we have trusted executions, state and access control, configuration of those -- of the  different elements, attestation and authentication, as well as locking. We have memory at rest  encryption. This is the memory that's stored in the device, how that data gets encrypted  is enabled through the spec, as well as the transport security. This is the data in flight. This is primarily the IDE part. And we can -- currently, the TSP definition is defined  to support HDMH, kind of directly connected memory expanders, but we do intend to extend  this in the future for coherent memory, the HDMDB memory type, as well as for switches.how access to memory  is controlled, this is the first part. The configuration, basically how we determine  the security features in the device and enable those features and lock that configuration  to ensure we don't -- that it doesn't change while the trusted application is running. Attestation, authentication, this, you know, is just trusting who you're talking to over  the device. The data at rest part, this is allowing the device to do that data at rest  encryption or the host may do it. So this is configurable depending on the needs of  the host and the needs of the use model in the device. And then, again, the transport  and security is that link IDE primarily. 

So last topic is memory expander improvements.  

So for 3.1, we added this extension to provide 32 bits of metadata per cache line. The spec  originally defined two bits of metadata that could be used for non-coherent memory, it  can be host-specific use, or for coherent memory, it defines a coherent state. So that  eliminates the ability for coherent memory, the host now uses those bits for coherence  definition that can't be used for host-specific use. But these 32 bits are intended for -- to  work for both coherent and for HDMH memory, which is the non-coherent memory type. So  it can be used for either and allows a host-specific use model to be extended for both coherent  and non-coherent memory expansion. And we obviously get up to 32 bits of data now, so  we have more data available for more advanced use cases. Some of the use cases that have  been talked about, the spec doesn't define these, but they leave it up to the host or  the platform design to define those use cases. But possible use cases may be access control. So you could have data tagging where the requester makes sure that it has the right tag for the  accessor. You could use -- have memory tiering algorithms that rely on this extra storage  to manage the tiering of memory. And DDR6 is also proposed to align to having up 16  to 32 bits of metadata for each cache line as well. So this is also following in that  regard from other DDR memory technology. All right. And then we also included basically  some new spec-defined API for managing memory devices. You can manage the correctable error  limits on the device. We may have the ability to expose more information about the source  of errors, the type of errors that happen in the device. Added more control over memory  RAS, memory sparing, patrol scrubbing, and those sort of features. And then lastly, we  added the direct peer-to-peer CXL.mem. We covered that in the Fabric discussion, so I won't  talk about that more. But this is also specific for memory expansion devices. 

All right. So  this is the feature list again. I just went over the first three and just highlighting  again the four new features that are shown on this table for 3.1 are closing the PBR  connection with the Fabric Manager API, the host-to-host communication through the global  integrated memory for port-based routing, the TSP definition for the security protocol,  and the memory expander enhancements, those RAS capabilities in the 32 bits of metadata.

 All right. So to summarize, the spec is continuing to evolve to meet new use cases. We have the  three kind of areas of new features introduced in 3.1 for Fabric improvements, extensions,  the TSP security protocol, and the memory expander improvements. So to support future  spec developments, please join the consortium. We have many member companies today and many  contributors. The 3.1 spec is available for download at the consortium website, and you  can follow us on social media for additional updates. 

All right. Q&A.

 Thanks, Rob. There are about four or five questions and more coming in. Let's start  with the first one from Ankar from NVIDIA. His question is, what is the difference between  the symmetrical CXL.mem and UIO usage for P2P?

Yeah. So UIO is a new capability that's part of PCI Express now, but was envisioned from  originally 3.0 definition that allows both PCI Express devices as well as CXL accelerators  to directly access memory in pure CXL type 2 accelerator or a type 3 device. Now, symmetric CXL.mem  is really focused on a CXL accelerator directly accessing a memory expander in that it's not  intended for targeting a CXL type 2 device, for example. It's really targeting pure memory  expanders. And there's some reasons that... And the value of that potentially is that  the CXL accelerator can now use kind of basic simple HDMH memory expanders and get direct  access to those. And those targets don't need the unordered I/O protocol. But I think... So it is more of a specific use case for the CXL.mem protocol where the UIO is more general  purpose and maybe more flexible depending on their use.

Thanks Rob. This next one from Vishnu from Micron. Hey Vishnu. His question was, what  solutions in CXL spec offering to get around the blast radius issue in a fabric? Customers  are concerned about single device failure impacting multiple servers in a pooling environment. I can take a stab at that and Rob, you can add to that. My take on it is, if you think  about from a pooling perspective, yes, you should look at that as device and media type  of errors. And you can think about that as all of that really needs to be handled through  from a device perspective. But as you get into that media and when you do see errors  on that, still where your link is functional and the device controller is functional, you  have means of mapping that to data poison or viral. You have those means of then notifying  the host of memory and then that would get handled in a way that hosts typically handle  memory errors today. And the next level is, well, the device is non-responsive. How do  you handle that? And CXL 2.0 specification added, there was an ECN to that for error  isolation that made sure that when there is a link that goes down or a device is not responsive,  then you take care of that at the host interface by making sure that the host takes ownership  of those responses and then starts to, does requests and starts to create responses for  those devices. So there are all of those mechanisms that are built in. In addition to that, as  you look at pooling, there is the DCD device. So there are mechanisms that a device vendor  can build on their side to differentiate their capabilities where you're monitoring your  media and your device for its health and you can take certain actions a priori depending  on what you see with the device. So there are various ways you can address it. The spec  provides you means to handle these situations. But of course, as there are new and interesting  use cases, if there is something that you see that needs to be addressed in the spec,  please bring that through your representative into the work group to address the specific  things or if you have a proposal, you should bring that into the work group so we can understand  the problem statement and then come up with ways to solve it. So that would be it. Rob,  anything you want to add to that?

That was good. Maybe just one thing. We do want to be careful in CXL not to go overboard  on adding very heavyweight features, so say like end-to-end protocol retry or something  that would compromise kind of the lightweight low latency attributes that we get in CXL. So we have to be discriminating to make sure we address the right problems and we don't  want to do Ethernet style reliability because that just is probably overboard for a CXL environment.

All right. Hey, so lots of interesting questions on peer-to-peer and those use cases. Maybe we go with the one from Mohan from HPE. A question is, does the GIM need to be CXL  memory or it can be DDR5 memory directly attached to the host?

It can be either. So, yeah, GIM is a region of memory that's owned by a particular host  and that memory could be native, DDR connected, or CXL attached below that host.

Okay. I'm going to group, Rob, a bunch of questions that are on UIO related. Maybe  we could kind of address those. One is from Zaman from Rambus. What is the difference  in benefit of host to H2H with GIM using UIO versus multiple hosts sharing a type 3 device?

Yeah, I mean, it's a different case. You have fabric attached memory or GFAM, which  is a memory expansion device that may be shared by multiple hosts. That's a very flexible  approach that I would expect would address many kind of high bandwidth use cases that  maybe can be where the host also may want to cache that data directly. The GIM use case  with UIO is the requester would be using the unordered IO. So, it wouldn't be expecting  to cache that data. It would be more like a memory mapped IO region from the requester  point of view and it would be coherent in the destination. So, kind of a different asymmetric  data flow versus the fabric attached memory, which can be made more symmetric from multiple  hosts.

All right. Thanks, Rob. Next question is from Mahmoud from Siemens. Do we need UIO? I think it's UIO to support host to host communication use case or the device peer to peer communications  or switch to switch communications?

Yeah, maybe I largely addressed that maybe in the last response. Yeah, GIM is UIO only  that host to host flow. I think device peer to peer can today would can use traditional  PCIe kind of ordered flows. It can use unordered IO and the unordered IO specifically for peer  to peer flows lets you access a CXL memory region without going through the host. So,  within your switch hierarchy, you can use UI to directly access another accelerator's  memory.

And then you address that through the symmetrical CXL.mem. You can still get to peer to peer,  but that's only intended use cases to reaching out to memory expansion that's connected,  right? 

So, yeah.

 There's a question from Randy Bright from Intel. Hey, Randy. For CXL switches, are there  any unique needs besides the various types of command packets to be moved around? And  like other switches, is there a sensitivity to a certain performance like latency?

Yeah, I mean, the CXL.cache and CXL.mem protocols are, you know, extremely focused on latency  and cache line size access where, you know, UIO that we were talking about is it has variable  length. It flows through the CXL.IO stack or PCIe stack and would tend to have higher  latency because of that compared to the CXL.cache and CXL.mem protocols. So, yeah, we generally  use CXL.cache and CXL.mem protocol or focus on that for latency optimization.

Yeah. And I would like to add to that, which is we're trying to keep and maintain  node level properties as you're looking at anything of this, you know, as we're scaling. I think from a use model and from a system perspective, you need to take that into account  in terms of for that solution, what do you expect from a performance perspective, right? Because at the end of the day, a lot of these protocols would be dependent on, you know,  the resources that you have to, you know, sort of sustain the latency to get a certain  performance, right? And that doesn't change. So, you have to carefully weigh your use case  and your performance requirements. CXL is bringing a lot of solutions, but at the end  of the day, it will depend on what problem you're solving, what are your performance  requirements and then see whether you have adequately resourced to enable those use cases. We have about four minutes. Maybe we can take a couple of questions. There's one from Sanjay  Goyal from Rambus. From memory read fill to send DRS back, which credit needs to be available? Is it S2M DRS or S2M NDR? 

Yeah, maybe I'll just give a little more context  here. Memory fill is a new command that was introduced for TSP. And it's intended to enable  the host to do data merging when there's host based encryption. And it flows on the same  channel that writes flow in the RWD channel the request does. And it does, it is unique  in that it's a request. It flows on RWD channel, but does return a DRS from a memory expander. And it does use standard DRS credits. So, it uses the same response. The DRS is the  data response, uses the same data response that any normal read would use to return data.

Okay, thanks. I think there was one question from Kulwinder Singh from Microchip. This  was, you know, when you were talking about in the peer to peer, and the question was,  you know, will the fabric manager take all the control? I think part of it is the fabric  manager is responsible for sort of doing the, you know, enabling the setup, right, and the  path. And, right, but then, you know, control is very different, right? I think that's it's  more of the management plane. That's the way to think about the fabric manager. Rob, you  want to add more to that? 

No, I agree. 

Yeah. Okay, I think there was one last question,  maybe in a minute. It was just reference that was from Donald from Red Hat. He was just  asking what is an MC? It was in one of the slides. Probably it was a memory controller,  but I could be wrong without the context. 

All right. Well, thank you, Rob and Mahesh  for sharing your expertise. Being mindful of the time here, we will wrap up today's  webinar. We weren't able to address all the questions we received today, but we will be  addressing them in a future blog. So please follow the CXL Consortium on social media  for updates. Once again, I'd like to thank you for attending the CXL Consortium Introducing  the CXL 3.1 Specification Webinar. Good day. 

Thank you. 

Thank you.