182

YouTube:https://www.youtube.com/watch?v=9QTOR9zgcXc
Text:
Okay, thank you very much, Frank, for a kind introduction. Charles put out the opportunities that we have with memory sharing and software applications,  and Thibault covered the opportunity that CXL brings for all of us. Today, I'm going to cover the incremental specification update on 3.1. Tomorrow, if you guys join, I will cover more activities within OCP around CXL. So today is just a summary note.

Basically, as you guys know, the interconnect, the job of interconnect is connecting compute and storage and memory across a large data center. And as you know, CXL 1.1 covered a local node. CXL 2.0 increased the capability through one layer of switching. And now with CXL 3.0 and 3.1, we are covering fabric-related and new topologies that are possible. This is just a summary note, and we tried to peel the onion a little bit.

Charles did talk about the advantage of CXL and coherence memory for the sake of sharing memory. It's important for us to build this capability on top of things that we already know. CXL relies on PCIe infrastructure for the physical layer and for programming for most of the parts. So first, do no harm starts there. What you already expect from PCIe, CXL offers.

 Then it offers two new protocols that are optimized for memory transfers for cache-coherent memory at cxl.cache and cxl.mem.

When we put them all together, that's when we get to the converged memory concept. And once the accelerator and the CPU can work off of the same memory block, you can imagine data movement is reduced, energy consumed is reduced, time is reduced. And therefore, we gain performance.

So all in all, the converged memory environment is basically the main key for ease of software programming. That's a lot of friction that people might have if you ask them to do new work. It would be difficult for it to be adopted. But then efficient movement is good for sustainability. It's good for Earth. It's good for time. It's good for money as well.

Okay, we did cover, but I have a summary of the use cases that people have thought about CXL. Simply, memory plugged into one CPU or one GPU or a new capability, a new medium, not necessarily DRAM, can be housed within a package and plugged into the CPU independent of what the CPU natively offers. That is a capability that CXL brings in as well. And then using multi-ported memory expanders or switches, we can increase the scope of the capabilities to multiple hosts and larger memory pools as we talked today. So those are the current use cases.

Now, life doesn't stop. People's expectations grow when we produce something new. And specifications need to respond to that as well. So CXL 3.1 is in response to some of the good feedback. People talk about reliability and security as an example.

So let's see what CXL 3.1 does in that regard. So the new features of CXL 3.1 can be summarized in three aspects. Improvements in the fabric and extensions that is there so that we can create new topologies. And then specific on the security and privacy and encryption, there are pieces that I will touch in a minute. And for the memory expander itself, there are some features added to align it with future DDR6, for example.

So on the fabric, as we expect to connect more and more devices, we expect them to be connected through different topologies. So for that, new routing requirements have been discussed and observed. And then once we have multiple hosts connected to a fabric, being able to share memory across those hosts have been an interesting concept. And that is done through the global integrated memory. And then devices, if they can talk to each other directly, that would be nice so they don't create choke points. And that's the direct peer-to-peer using CXL.mem is an issue. Now, putting it all together, some manager, fabric manager, needs to create and compose a system. So there have been enhancements in definition of the fabric manager as well.

So to the left, you see a system based on the normal hierarchy that we know of, a tree hierarchy. Devices can talk to each other, but to do that, a CXL device or PCIe device using this hierarchy, this topology, has to go up the tree to some central point and then down to a device. To have efficient data movement, it would be nice if we could think of different topologies. Machine learning and artificial intelligence, HPC, is growing in multiple dimensions. Moving data east-west and not only north-south have been important. And topologies in the past were drawn on paper, but now we can, in fact, implement them using CXL 3.1 fabric features. So it would be nice that any device can talk to any other device, not having to go through a normal tree, which increases the congestion on the top branches of that tree. So how do we do all that?

For example, device-to-device can have communication through a switch without having to bother the CPU or the link that goes to the CPU. And that is done using a peer-to-peer .mem. And in that model, since it is CXL .mem, it can be cached. So that's a new feature for CXL 3.1 that accelerators may touch and read or write Type 3 devices and cache the data that they receive locally. That is done based on port-based routing technique that the switch might offer. And the devices need to be connected to the downstream port of an edge switch.

Similarly, if two hosts are connected to a CXL fabric and they want to communicate through memory, traditionally they have to go through another fabric, such as an Ethernet or InfiniBand, to move data between hosts. In a fabric that's built on CXL, it has always been the question, why couldn't we move data amongst servers themselves, amongst the hosts themselves? And that capability was added to 3.1 using a concept of global integrated memory. To do that, what we need now is the concept of unordered I/O, that software can move and push data from one host to another host. But just remember that that data is not cached because it is using the .io, the unordered I/O semantic, not the .mem.

 Again, to manage all of this, we need a fabric manager. And to describe port-based routing capable switches, the fabric manager semantics have been enhanced as well to comprehend how you can disaggregate components and then re-aggregate them through composing and enable dynamic capacity device concept,  in which memory can be allocated to different hosts as the system is run.

Thibault mentioned the need for management and specifically security as a headwind. CXL 3.1 tries to shoot in front of that requirement and a number of companies worked very hard to build a new security protocol on top of the available T, trusted execution environment, and that is part of CXL 3.1.

 You know that we've had the link-level encryption as part of the integrity and data encryption IDE that covered data that goes on the link from one device to another device.
 
What TSP does, the T security protocol, builds on top of it and builds on top of an already known trusted execution environment to eventually create an environment for confidential computing. So the data at rest is encrypted, data in transit is encrypted, and once a host or accelerator understands all those topics, then they can compute and therefore have an enclave for basically workloads that require confidential compute. Once we do that, then we could have trusted VMs do whatever they need to do and not have to share or expose their data to the hypervisor or the virtual memory manager of a cloud service provider, for example, so they can be assured that their data is sovereign and protected. To do that, we need to do the configuration of these devices, encrypting sensitive data, and all of these devices need to be verified, so that is part of what the TSP section of CXL 3.1 covers.

Other enhancements basically describe how you would identify a device, what type of capability it has, configure it, and then enable the specific features that is consistent with what the system administrator requires or the platform software requires. Once you do that, it is important to lock it so that nobody else can reconfigure and reassign things without the administrator knowing. These are all the features that have been covered as part of CXL 3.1. Once we do all of that, then we can create this requirement of data address encryption, data in flight and transit encryption, and therefore eventually arrive at the computation done in enclaves for confidential compute.

On the memory expander capability, being aligned with the future capabilities of DDR6, CXL 3.1 introduces 32 new bits as metadata. You know that two bits are sufficient to define cache coherence in a form of cache line being shared, exclusive, modified, or invalid. That is customary as part of metadata. But then we have with 3.1, we have 32 bits of data that can be used for many use cases. Different companies, different software platforms can choose to use those 32 bits to enhance the integrity of the data either against faults or RAS or against maluse security. Some of these features are being discussed at JEDEC for DDR6. CXL 3.1 is shooting in front of that puck as well. Once we do that, we enable a little bit more resiliency and security that customers and workloads are requiring. That's how we can move up the chain and provide a better fabric that is highly available and robust for all the expansion capabilities that we expect of a fast fabric.

 Putting it all together, CXL 3.1 is a superset with new features over CXL 3.0. These are backward compatible with earlier specifications. This table is trying to show that. Therefore, when people are building a new device, new capability, they can in fact start with CXL 3.1 feature set. It is true that not all processors or not all devices support all of the features. But since the spec is there and there is a very methodical way of describing which features a particular device is offering through discovery and through capability bits,  it is quite all right to pick some of the features that are 3.1 and still use it in a system that is mostly CXL 2.0 because it is backward compatible. That device can live through another generation and be forward compatible with future processors and future systems. People are doing that already, picking up 3.1 spec features and testing them through POCs, even with CXL 2.0 or even 1.1 systems. Because through software magic, you can test them out, learn from it, and be ready for when the full system is aligned with CXL 3.1. That is basically my recommendation for people not to have to wait until everything is ready before you start. You really need to start now and align with it when everybody is compatible. But basically, I'm summarizing by saying that CXL specification continues to evolve to meet all of the usage models. And for that, CXL 3.1 has offered new fabric capabilities, added features for trusted environments, and has increased capabilities for memory expanders. That's basically a summary of CXL 3.1 spec modifications and enhancements. Tomorrow, we will go much deeper in how these will be applied to systems and how OCP members, Open Compute Project members, are taking advantage of CXL as a technology to build larger and larger systems. There's great collaboration going on as part of three major workstreams. CMS, Composable Memory Systems, Extended Cable and Connectivity Workstream, and Data Center-Ready Modular Hardware System. That touches on shapes and module sizes and dimensions and mechanical features of these systems that are CXL-enabled. Putting it all together, eventually going to racks, large racks, and Rack and Power is another project very active within OCP. So, we will cover some of these things tomorrow. If there are any questions, I'm open.