72


All right. My name is Shyam Iyer. I am the chair for SNIA SDXI Technical Working Group. SDXI stands for Smart Data Accelerator Interface. I'm also a member of SNIA Technical Council and a distinguished senior adult. But here I'm representing the technical work group and what it managed to accomplish for the version 1.0 specification and beyond.

So standards, you know, local notice on the presentation. So in terms of the agenda, what I'm covered today is kind of what's happening in the industry and talk a little bit about the needs for a memory to memory data mover. I'm gonna touch on some of the use cases where a memory to memory data accelerator could help you. Then I'm gonna tell you about the SNIA SDXI Technical Working Group and how it's trying to solve some of these problems, what kind of design goals it has in mind when it wrote the specification. I'll go a little bit into the in terms of the specification as well, kind of give you a little bit of a pre-read if you haven't had the chance to read the specification. And then I would dive into some of the futures of where the standard is heading. And then, you know, we're not just spec talk, we are also, you know, looking at implementations. So one of the key things is community, also software community and how we are trying to foster a community around SDXI. Then I'll sum it up. Before I begin, how many of you have heard of SDXI? Okay, that's fair amount. Did you already have a question in mind before we get started? Or let's get to it and then we can, you know, make it interactive as we go.

So if you've been paying attention or viewing the computer architecture industry for a while, we've had some undefined bubbles, which I call. Some, an application typically interacts with a compute bubble for any kind of computing horsepower. When data is computed, that's stored in memory because that's where the data computation is stored. That's the data in-use memory. An application needs to scale, then, you know, you get more threads, more cores attached to it. Typically, the compute and the memory are in a coherence domain. That's been historically the case for performance reasons because that's where the near memory for the compute is. Anytime you wanted to store the data or transport data, you typically used an IO device. The compute and IO have typically been in a, you know, non-coherent relationship, if you would. If you wanted to transport the data that is in memory or store the data that is in memory or bring it back to memory from storage or transport layers, you typically had DMAs that would help you with that. And, you know, the IO performance has been, you know, optimized with latency and bandwidth optimization. So this has kind of been happening in our computer architecture for a long time now.

I mean, this isn't anything new, but things have been changing, right? I mean, all of us have been seeing application demands are not being met by just one kind of compute. So you have a variety of compute elements that can help an application perform its functions, whether it be CPUs, GPUs, or, you know, specific drive computations or NICs, DPUs that can help you with some kind of processing or even FPGAs that help specific functions they end up meeting for the application. Memory is also changing. If you've heard the talk right before this talk, you know, Jim gave the various types of memory that are becoming available in the market, not just because Optane enabled them, and now it's not there, but CXL is now enabling a whole bunch of memory types. And to connect them, we have these new memory links or fabrics like CXL. But if you look, an application still is concerned about, you know, latency, bandwidth, currency, control, capacity, all these things. So there is, you know, a smashing of these different technologies, data memory types, and we kind of need a memory-to-memory data movement standard for various reasons.

If you kind of look at what is our data movement standard today, it's a software-based memcpy. Does anyone disagree? I mean, it's a trick question, but it is what it is, because for many years, we have had a very stable instruction set architectures, whether, I mean, it doesn't matter if you're x86 ARM or any kind of processor family, memcpy has been very stable. But if you're spending all your time moving the data, and that's not what a compute may want to do all the time. So you're taking away from the application performance by using memcpy is just for doing the data movement. You may also incur software overhead just because you need to provide various context isolation layers, or take virtualization, containerization, whatever it is. But they are there for a reason. They provide isolation security. Offload DMA engines are not like a new concept. I mean, how many of you have worked with DMAs all your life? I'm sure I can't count them in my fingers. So each of them has their own programming interface. And the funny part is each vendor, for performance reasons, may not be able to keep the same programming interface between generations, or if the use cases change. So we haven't had an architectural data mover in the x86 or any processor architecture family for a long time. I mean, old timers tell me there used to be a DMA device that existed, but that has long gone. And we don't have an architectural data mover in a server architecture today. And it's not standardized for user-level software. So which means that an application cannot directly use a DMA interface if it needed to, unless you hide behind different kinds of frameworks, software frameworks that enable that.

So I talked a little bit about the topologies, the architecture topologies, but you also have newer application patterns that you may want to think about, right? Like if all you're doing is mem copy from an application space, absolutely there is a case for an accelerator, right? Instead of waiting for the copy to be finished, I mean, imagine trying to copy a terabyte worth of memory and your application is just doing that while it could be doing something else. What if you told an accelerator, 'Hey, I have some new work for you. I want you to copy this terabyte of memory from this location to this other location.' So what you can do is you can enqueue a work descriptor and kind of ring a doorbell and tell the accelerator, 'I have some work for you.' The accelerator goes and does this data copy of one terabyte of memory. And when it's done, it gives you a completion saying, 'I'm done.' All this while, the application can be doing something more useful. That might be interesting for an accelerator.

Another application pattern could be that you're trying to store data into a storage or trying to retrieve from it. If you closely look at your stack today, an application, typical application, maybe in user space, and there is a kernel space boundary. So the data may need to be copied from the user space to a kernel buffer, right? And depending on operating systems, you may need to do another copy from a kernel buffer to a DMAable memory buffer. Why? Because you need a DMA buffer for this particular storage device to pick up the data from. It can't linearly copy it from a space that it can't reach into, right? And so you do that read. And so you've already done two copies before the data was stored in the storage. If you're trying to retrieve it back, you have to walk back the same steps in terms of the number of copies. You have to do a copy into a DMAable buffer, and then another copy to a kernel buffer, and then it gets copied to a user space buffer. So it's not very efficient in many cases, right? Many of the persistent memory use cases in World A(?), why don't we just store data into a persistent memory buffer? And why don't we just use a simple mem copy to be able to do that kind of data copies, right? So you do get a lot of benefits with that. Lots of research papers have been written on terms of the performance benefits you can get. But if you're doing that for a large data copy, you still have the same problem, right? Your application is now stuck just trying to do those mem copies. You could use an accelerator for these kind of application patterns as well, just like the previous one.

A third use case could be like, okay, you have a server, you have different virtual machines in that server, and now you want to move this data from one virtual machine's user space buffer to another virtual machine's user space buffer. If you think what is gonna happen here, then a copy will happen from that virtual machine's user space buffer to a kernel buffer. And let's imagine for a moment that Hypervisor has optimized it in such a way that you can actually point that same buffer to an I/O device to transfer it to the other VM. The I/O device may need to, you know, use some kind of an I/O fabric in the backend, maybe gets back to the same I/O device or a different I/O device, like a NIC. And then you have to, the NIC may have to write that as a DMA to a kernel mode buffer before it gets finally copied back to the second virtual machine's user buffer. Again, context isolation is nice, but it does add to the number of buffer copies that you need to actually do to be able to do the data movement. Wouldn't it be nice if you had like an accelerator interface that you could interact with from a virtual machine's user space, tell it, "Here's my user space virtual address buffer, and this is what I want you to move to this other virtual machine." And the accelerator did that. It resolved those user virtual address spaces, took the right physical address space, got the data, turned it around, and wrote it back to the other user virtual machine's address space, protected by hardware isolation though. That's the key part, right? And this is the kind of needs that we are looking to solve with an accelerator.

Something else, like we talked about, is the memory expansion itself. I mean, you had DRAM on DDR buses, and now you're expanding them to like a far memory with CXL-based fabrics. And you want to basically able to read different tiers of memory. Remember, all of these have different kinds of memory characteristics. So some of them may be fast, some may be slow. If you are, I mean, certainly everyone is trying to make it as fast as possible, they can. There is going to be certainly more latency than accessing near memory. You may want to use an accelerator to actually do these kind of data copies instead of having maybe a computer or a CPU, just wait on the data being transferred.

So what would an stack look like if you were to build something like this? There are various places in which applications exist where you might need this kind of an accelerator because olden days, people used accelerators only in a kernel mode driver or an application in the kernel because accelerators could not target user more addresses. So a kernel mode driver may at least initialize the accelerator, discover its capabilities. You may want to use a kernel mode application to be able to perform those data movement. Say for example, a operating system wants to tear memory between persistent memory and local DRAM. And that may be an application that's in the kernel space that wants to do that. So you want to give it a work descriptor ring for it to be able to perform these type of tearing and directly work with the accelerator. Now, what about an application in user space? So for that, you may want to at least enable that with the help of a libraries, have it set it up to make sure that it get access to an accelerator interface. And then now the accelerator is directly available to a user mode application. And this application gets its own descriptor ring, a context and ways for it to interact with the accelerator with standard states as it needs to be. So this is kind of desirable if you are just looking at a bare metal stack.

If you were trying to do the same thing with different tiers of memory, you want the same kind of access both from a kernel mode as well as from user mode applications, because you want to know when to bring those different tiers of memory or move them elsewhere.

It's not just one application. You may have multiple user mode applications and for good reasons, they are separated into different address space. Think about a container with another container. They are generally in a different processor address space. Now, if you wanted to move them, again, you're going down to the path of trying to move those data copies through multiple layers of software. This accelerator should be able to do, let you do from a user space buffer to another user space buffer without involving the privileged software to do those copies for you in a secure way, of course.

So same thing you want to allow for if your application is virtualized, whether it's a virtualized kernel mode application doing this kind of VM to VM data copy or a user mode application doing this kind of data copy. Sometimes people tell me, okay, I have an RDMA device emulated in my guest kernel and that's a kernel mode application to me. It may or may not exist in a real RDMA device in that virtual machine. But if you're just doing the memory to memory copy, you could default to this kind of an accelerator for it to be able to do that kind of data movement from a VM to VM, even if you didn't actually have an RDMA device on that system. But you still got hardware.

So, SNIA SDXI TWG started working on this new standard for memory to memory data movement and acceleration interface that is at its heart extensible, also forward compatible and independent of the IO interconnect technology. I mean, so far I haven't even said anything about what interconnect this memory data mover is gonna be. And that's the goal because we are trying to describe the structures that are required for you to perform the data movement and implementations can bring in their actual implementations and make it perform faster for different kinds of workloads and use cases. So this TWG was formed in June of 2020 and tasked to work on this. Almost 23 companies contributed, although their membership is much more than the number of companies I listed here. 89 members also participated in various meetings and reviews. The good news is that 1.0 is released. So we started in 2020 and we released in 2022. So that was a really fast spec development effort. And part of it is because some of the companies had started work on this previously to SNIA and brought in a, you know, not fully formed spec and got all the reviews from the community. And that went through multiple public reviews before becoming a 1.0 standard.

So what are the design tenets? Like I mentioned, just didn't want to repeat, but we are trying to make sure the data movement can happen between different address spaces, whether they are in different virtual machines or different user address spaces. Data movement should also happen without mediation by privileged software, right? Once you have allowed access for, say, two user mode address spaces to be able to do data movement, you don't have to do mediation. And so that's built into the spec. It also allows for abstraction and virtualization pretty easily because we try to architect those states in such a way. You can also quest or suspend or resume the state of this data mover. And that's very important because when you're trying to use an accelerator, someone may say, "I want to move this work to some other host that may have the accelerator or may not have the accelerator. How do I stop this work in a predictable way with architected states before I move into the next server?" We have also tried to look at enabling forwards and backward compatibility for future versions. And the way that was standard is written, you can incorporate additional offloads as you're moving the data. So I'll explain that in a few slides on how we're trying to do that. And it's a concurrent DMA model, which means it's not like if you started one DMA, that's the only thing that the device will do. I mean, there are multiple DMAs happening at the same time.

Okay, something else we tried to do was, we tried to make sure that it is not specific to one CPU architecture family. There isn't a specific instruction that you really need to be able to generate the work or have the work completed. Now, certainly some architecture implementations may be able to generate work to this interface a little bit more effectively if you had a certain kind of instruction, but the specification does not require you to be wed to a specific CPU architecture. And that's really important for a standard like this. We make sure that we are cutting down the user, I mean, the software layers, and you can directly program this using the user mode while preserving the stability of, and security constructs. We're also designing in such a way that you won't be just targeting DRAM class memory. Other kinds of memories are also something that we intend to support as part of the standard. And we are just waiting for different kinds of memories to show up and be able to do that.

Yes.

Yeah, great question. Let me repeat the question. Does the accelerator impose any ordering or is it left to the transport? And the answer is, the accelerator does have certain knobs that can impose ordering, but because the standard is also interconnect independent, in specific interconnects, you may see other kinds of ordering imposed by the interconnect itself. So for example, if this were a PCI based DMA bus, then the DMA would happen using PCI ordering. If it were CXL, it may be a CXL ordering in play as far as the interconnect is concerned. But the accelerator also gets certain knobs or hints as well as, I mean, not hints, I should say, certain kind of controls on how you should order operations at a descriptor level. And I'll get to that in the next slide as well.

Something else that the standard is trying to achieve is to make sure that the interface can be implemented in different form factors. So an integrated CPU implementation is possible. At the same time, it can be implemented on discrete packages like GPUs or FPGAs or SmartNICs or IO devices, right?

So here's kind of like a simplified view of how things can be spun up, right? And I start with a thing on the left called the SDXI function. And I've been asked this, but it's not necessarily a PCI function, right? For all you know, it can be a software function, right? Because all we're trying to do is establish the standard memory structures that you need to interact with a standard data mover. And so if you were a PCI function, this would be a PCI implementation, this would be a PCI device function. And at this level, it can have some MMIO spaces that point and you can program them that will point you to the memory structures that you need to set up a context. A context is where a user space application may want to interact with this function and enqueue work to it. You will have certain kinds of context tables that describe the various contexts that have been set up. Each of these contexts will have their own control and state information that's all architected as part of the spec. Each context will have its own descriptor ring. And this is how we interact with the function directly. I mean, the ring works like any typical descriptor rings work. Anytime you enqueue a work descriptor, you increment the right index from a producer's point of view. The reason I say producer, because it may be software or it could be something else that enqueues the work. Anytime the function reads the descriptors, the read index is incremented. And there is a mechanism to notify using doorbells of new work in there. Each descriptor may have like two or one or any number of buffers that you are intending to do the data movement into. And each descriptor will have a pointer to a completion status block where once an operation is finished, the completion is notified.

Yes.

So I look at io_uring as a layer about us, maybe. And yeah, I mean, there's certainly applications that can be used using io_uring may use SDXI downstream to that. Although I haven't looked very deeply at io_uring, I should admit. Yeah, but the thing that I looked at, it looked like it may be higher layer than this.

So I talked about address spaces. So something that SDXI allows for you is, okay, if you look at this, this is a context address space, right? The address space that the context lives in. But if it wanted to target a remote memory, it needs to know which address spaces it can target. And this is governed by something called an A-key table or the address access key table, which tells you what are the different address spaces that you can target for your memory data movement. At the same way, you need some kind of a protection if a remote function is trying to reach into your local resources, you also need some kind of a protection, make sure that it's allowed to do that. And that mechanism is provided by an R-key table. So the shaded region that I'm talking about here is where a user space application has access to. So it's kind of a layered model that the user space application does not need to deal with how to manage the context, the control the states and how to set it up as well. The outer region is where the privilege software has control over. So this is useful for when you want to stop a context, resume a context, you have access to those data structures. There is one error log per function and that allows privilege software to know, look at how to remediate any kind of things. At the moment in 1.0 spec, if you get an error, the context comes to a stop and then you need to restart it for any reason. As you can see, all states are in memory. It's easy to virtualize. And one of the key things is, I mean, we had one of the virtualization gurus write part of the spec here. And that's one of the reasons this is so virtualization friendly is because one of the requirements was any accelerator that they want to use, it cannot be something that stops live migration. And that's one of the reasons that it was built into the standard on how to make sure that happens. Like I said, PCI implementations need a PCI device binding. And for that reasons, we went and registered a class code for PCI implementations with PCI-SIG.

This is just getting a little bit deeper into those structures, but multiple contexts can be managed by one function. The context tables basically point to different context level structures, and each of the contexts may have its own ring and address X access key table. If you notice, there's one error log for each function; that's why the Privileged Software needs to look at it. And then one R key table to allow for remote functions to access to the local resources. Again, one way to start and stop and administer all these contexts. So that's the architected way that has been put into the spec.

This is going a little bit into the functions. It might look similar to PCIe SR-IOV, but if you are a PCI based implementation, you can have this implemented using various PFs and VFs. PFs are a little bit more special than the VFs, but that's something I will let you look into the spec as well. Each of these PFs and VFs can have their own set of contexts. So, like I said, talked about in the previous slide, a function can have multiple contexts. So even though I'm showing like a few contexts per function, you can imagine that there are about two to the power 16 contexts possible per function. And something else that is also different in the spec is multiple SDXI devices can be part of one function group. And what that means is, if you are part of the same function group, you can do data movement from one device's address space to another device's address space because they are part of the same function group. And that's pretty neat because, they could be potentially different devices or they could be part of the same, let's say a PCI add-in card. The fabric between the SDXI devices, the standard does not specify how that fabric needs to be. And it's out of the scope of the spec.

So this is the descriptor ring. I mean, it's pretty standard. The one thing unique about the ring is, you can keep incrementing the indexes, the right indexes. The right index values don't wrap around because it's a 64-bit value, but the actual location does wrap around. And that's what is shown in this complicated math here. The descriptors are processed in order. They can be executed or completed out of order. So there's a whole bunch of parallelism built in. The function is always allowed to read valid descriptors even without receiving a doorbell. So if you're constantly pulling for descriptors, you don't have to wait for a doorbell just because you need to start working on the data as long as the descriptor is valid.

So this is the format, just a brief description of the format of the descriptor. If you look at the top part of the descriptor, there's that valid bit. The descriptor has to be valid for the function to start working on it. It can receive a doorbell, or it can proactively go and read a descriptor as long as it's valid. The control fields kind of, Chandra, to your question, have those bits. One of the bit is a sequential bit, which means that if a previous descriptor operation has right associated with it, you want to have rights that follow this descriptor. Okay, that came out wrong. You want the rights with this descriptor to follow the rights from the previous descriptor or within the same descriptor itself. So that kind of ordering is followed between descriptors. There is also a bit for fencing, which says that all operation prior to this descriptor has to get completed before this descriptor's operation happen. And then there is also other, I mean, control fields that we are looking at for the next version of the spec, which is like, okay, can we have read-specific barriers so that you can read only after the previous right, you know, completed. Operations are kind of grouped into operation groups. Some of the operation groups I'm showing here are, like it's a DMA group, or, you know, it's an atomic operation group, or it's an vendor-defined group. Of course, although there are not too many vendor-defined groups at the moment, there is also a administrative operation group. Something about this is that you can have, different contexts have different number of operation group, but a context zero must have at least the administrative operation group. So the context zero is a little bit more special than the other contexts for each function. You may notice that there's an atomic operation group as well as minimal atomic operation groups. Depending on certain kinds of interconnects, atomics may not always be possible. So for example, you know, with PCIe, you cannot get full atomics all the time for certain kinds of implementations. So we also defined certain minimal atomic operations that you can achieve with certain kinds of interconnects. Although that's more of a, you know, interconnect-specific decision rather than the spec. You know, if the interconnect supports the full atomics, then you are free to do that as well. These are some of the operations that have been defined in the 1.0 spec. You know, copy is pretty common. You know, write immediate is basically, you want to immediately write something into a location and you want to embed that as part of the descriptor, and that's how it goes through. Pretty useful for messaging. And the admin operations all are part of the admin operation group, like start, stop, update, or sync. The body itself has the different data structures and it depends on what kind of operation we are talking about. And so I won't get to that here right now. But the completion pointer points to a completion status block. So that's the memory address where a function will basically signal that it has finished the operation. And the spec defines a, you know, a completion signal by basically atomically decrementing the value at this location. So if you had just all ones, you can just simply overwrite it with zeros. But if you had a large number and multiple descriptors pointed to the same completion status block, the function may have to atomically decrement each of these, each value every time if they all shared the same completion status block. But that's all implementation defined from a producer's point of view and how you want to do that. You can also implement it like kind of how NVMe does, put them all in a ring buffer location and each completion goes to a, you know, a location in a ring memory buffer. So it's all, it's flexibility from that point of view.

So this is just one of the examples that I wanted to give for repeat copy, which was there in the previous picture is like, okay, something hypervisors or virtualization vendors, oh, sorry, really like to do is, okay, do a large memory buffer, fill it with all with zeros. And so this is something that went into the spec is, okay, you can take a four-kilobyte buffer with zero pages, zero in that page buffer and fill a four-gigabyte memory buffer with all zeros. So that's something that you can do with an operation like this. We are trying to do additional things into the spec, like these kinds of memory manipulation operations, more like POSIX-style memory operations as well, that we are trying to add to the spec in the next versions of the spec.

This is going a little bit into the format and some of the other things as well. So if you notice an address is an address, but we try to make sure that you can also use a host physical address or a host virtual address or a guest physical address or a guest virtual address. Other things associated with the address are, you need to know which address space that you're talking about when you're talking about that address. 'Cause I don't want to look at the map of Boston looking, I mean, I don't want to go to Boston looking at the map of San Francisco. So that's kind of one of the important things about an address space. There's also cacheability attributes associated with an address. Like if you want to basically have a, for example, PCIe has TLP hints. Do you want to have a hint on what you want to do when you're accessing that address space? Maybe the address space that you're targeting is an MMIO space. And those are the kind of attributes that you can put into your descriptor. The access, I talked about the Akey table. The Akey table has a set of Akey table entries in the contiguous location, and they denote the different address spaces that the function can address.

Yes.

Okay, so the question is, in architectures where the cache is not coherent, does the standard take care of flushing the cache? And I think it depends. So I think maybe you're looking at, will it go and flush it maybe to a persistent memory buffer? Is that likely the question? Yeah. Right. So you certainly need to know what architecture or interconnect that you are trying to do this data movement into. In many cases, if you read back what you wrote, that will take care of flushing it. But some of the guarantees have to come from the interconnect on whether the operation, like if it's a posted memory write, like PCIe, once you've done that, the operation may complete. But just to make sure that it actually went through, you might want to read it back. So that's something that you can do as an application programmer.

I was talking about the access key table entries. These are the different address spaces that you can access. If you notice, there is something called a target S func. This is basically a function handle. This basically tells you what function handle you should use when you're accessing that remote memory. In the case of PCIe, that could be as simple as a requester ID. Now, you may also want to use a pass it to address that remote address space, because even though you use the same requester ID, that may be split into multiple address spaces in that remote functions address space using passers. There is an identifier for an R-key, and this is an index that you would go and look into the remote functions R-key table. So this index points you to the R-key table entry in the remote function. If you look at the R-key table entry here, you came into the specific R-key table entry using the index from the A-key table entry that you passed in using the descriptor. If that requesting S func matches the requesting S func, only then you will be able to access that remote functions address space. And this is kind of the security built into the spec. And you also need to match the pass it that you are trying to address it.

So this is a kind of a complicated example, but you can imagine that you may do a single address space data movement, or like two address space data movement. This example deals with the data movement happening between address space A and address space C, but the entity that is actually doing the data movement is in address space B. So the producer is in address space B. An example for this may be if a hypervisor is in address space B, and virtual machine A is in address space A, and virtual machine B is in address space C, the hypervisor is reaching into address space A and taking the data and writing it into address space C. Typically that will be performed using software-based memory copies. If you had an accelerator like this, how would that happen? And you can still preserve the security of this. And basically you enqueue the work, and from the descriptor, you get the A-key indexes that get you to the A-key table entry. From the A-key table entry, you get the R-key index, which is what is used by the target function to go look into the R-key table entry. And in that entry, if the requesting function matches, you are able to access that address space. Same thing you wanna do for the destination buffer address space. If that matches, now you can do the DMA read from the source buffer here to a DMA write into the destination buffer here. So it's a more complicated example, but it gives you the feel for the power of and how you can achieve some of these things.

Let me get into the futures. We are looking at 1.1, and these are some of the investigations that we are doing. And in 1.0, we talked about how you can do this if you had multiple address spaces, but we didn't talk about what makes the connection. So you need a connection manager to be able to broker the connection between address space A and address space C. Some of the thoughts there were that can be done from a sideband or an out-of-band channel, but there is also interest in making sure that is also architected and made part of the standard. So we're kind of looking at that as part of the 1.1 investigations. We are looking at newer data mover operations as we expand the spec. We're looking at host-to-host data movement because 1.0, although it is possible for you to do host-to-host using the 1.0 spec, we are taking a more holistic approach with 1.1 to see what are additional needs for a host-to-host data movement to happen. We are trying to address some of the scalability, latency improvements that we can actually do, and some security features, not security concerns, security features. For example, if you had confidential memory, like confidential computing has, can you actually use this accelerator to reach from one confidential memory to another confidential memory, or within that confidential memory? Those are the kind of things that this accelerator interface is also looking at. We're also looking at how to do QoS so that there isn't noisy neighbor problems and other things.

Let me get into some of them in pictorial forms because some people, people like me, I do like pictures a lot. This is how we can add more kinds of data, more operations as you're moving the data from a source buffer to a destination buffer, example, a compression, or maybe you want to do a CRC, or if you wanted to just do some kind of XORing, for all you know. These are the kind of operations that we are looking at for the 1.1 spec. There may be some sample set that can get added as part of 1.1, and we may not make it, some others may get pushed out to 1.2, but if you have a favorite operation that you wanted to get standardized, now's the time.

The connection manager spec that I talked about, and this is basically, if you had an address space A and address space B, how can you actually create the connection now that an address space A can affect data movement from that address space to this address space? Like I mentioned, between the two functions, the inter-function fabric is out of scope for the spec that can be implemented in any different ways. And that allows for people to think about, oh, can I do this host to host as well? And the answer is yes, as long as you conform to the structures that you have facing the host, and for these kind of use cases, you can have any kind of inter-function fabric in between.

This is something that has come up in a lot of discussions as well is, okay, what about CXL? So CXL certainly adds to the system physical address space. So the picture on the left shows you a possible implementation of an SDXI device, which is a PCI-based implementation, and it can certainly move data from CPU attached memory to another region of the CPU attached memory, right? But once the memory address space gets expanded because of, say, a CXL bus in the architecture, now you can also target memory behind a CXL memory bus, and all of that is possible. On the picture on the right is if the device were an actual CXL device. So the SDXI device could be an CXL device as well. So it can also have memory attached with it. So now you're talking about data movement between CPU attached memory and device attached memory as well.

This is work that is ongoing with the computational storage work group, and Jason and I did a talk yesterday, a plug for that if you get a chance to do that as well. Look at those slides. The two devices on the left show some of the ways that we're imagining this could fit into, say, an NVMe subsystem as well. So an SDXI data mover could be transparent to the host and within the NVM subsystem. So that's basically the type A device on the left. Or it could be more explicit where it actually has a function interface to the host. And there are various data movement operations that it can help accelerate and do the transformations as the data is being moved using these two kinds of device types. And there are various configurations of this that we went about in yesterday's presentation. The picture on the right, I apologize, that is not meant as a vision test. It is a complicated picture in terms of all the different ways in which an SDXI data mover can move data in and out of a computational storage device or a shared memory pool or between hosts. So all of these kind of use cases are possible. And there's a subgroup that is discussing all these possibilities and ways to enrich both the specs.

So I would be remiss if I didn't talk about the ecosystem. We are already working on some software within the work group. There is a library called the libsdxi project. It's an OS agnostic user space library. We are also enabling upstream driver efforts. We are looking at how to emulate SDXI devices using say something like QEMU type emulations. And we're looking at how to enhance tools for interoperability. And I already mentioned that we are working with the computational storage group in a subgroup.

So here's my summary. I'm almost running out of time. Let me make sure if I have any questions here. I guess that was pretty clear, I suppose. All right. There are no questions. Please take a moment to rate this session. And I would be around as well for any questions that you may have. Thank you.