116


Welcome, everybody. My name is Steve Scargall. I'm a product manager and software architect at MemVerge. If you're not familiar with what we do, we're the kind of the software guys leading the effort for software development, tools, tiering, sharing, et cetera, on top of all the hardware devices that you'll see across the show and talks that you'll have today. We're also sponsoring the CXL Forum tomorrow. So if you have time, come visit us downstairs where you'll definitely get to learn more about our product suite. But today's focus is really on the telemetry side, right? So being software folk, we get to play in the sandbox with everybody. It's a great position to be in because we get to see all the innovation that's happening here. But also we see the challenges, the limitations of where software is. And we solve a lot of them ourselves. But we do need assistance from the hardware guys. That might be the device vendors, the switch vendors, the CPU and platform guys as well to help with our innovation.

So what I would like to propose is some of the challenges that we have today, mostly in the visibility area. There are currently no device statistics emanating out that we can trap into. So if you have a single device attached to a single socket, that's easy. If you attach more devices, that becomes a bit more difficult as to knowing which device is busier than the others. Similarly, there are no native OS tools. And I'm talking mostly Linux here. Like you would do with iostat, vmstat, mpstat, et cetera. So we have some tools that we've developed ourselves that kind of emulate and do a lot of this stuff. So we would like to work with the community and push some of this stuff back upstream. Similarly on the CXL topology, I mean, the CXL spec gives us the ability, at least will do when 2.0 and 3.0 are fully implemented, to go and scan the topology of the environment. But when you're talking about tiering and moving data around and access patterns and all that type of stuff, understanding the topology and what devices and device types you're talking to, particularly in a heterogeneous environment, is very important to us. So there's some challenges in that area as well. Now it is possible on Intel and AMD to go and get telemetry about CXL so you can see the throughput and you can measure latency from application layers and benchmarks and everything else. But it's not a standard, right? So it would be nice if we had some common tool or tools that worked across platforms to help us understand what was going on below us, inside of the operating system, either inside of the device or between the host and device and devices. You know, the CPU guys provide these CPU metrics. You can get them with Perf if you know which ones to look for. But it's not obvious, right? Not all of them are tagged with CXL something. So I'll talk about that in a minute. But it would be nice if we didn't have to go off and produce all these fairly complicated computations across many different metrics, some of which you have to multiplex to go and get the data anyway, which dilutes the results. But it would be nice if we had some logical view of the thing that we wanted to go and target. And then the CXL talks about CDAT in terms of what is a device intended to do? What's its latency? What's its bandwidth characteristics? And actually I checked the other day, and this last bullet point is somewhat fixed. In the very latest Linux kernel 6.6, there's some patches that have just been recently merged that now allow us to access the CDAT table. But that requires that the CXL vendors actually populate that information to tell us what the throughput and latency of the device or devices are.

So the objectives really of this talk would be to adopt and accelerate CXL adoption with some of the things that we're proposing here today. Again, improve the observability of this big black box that we're proposing, whether it be CXL 2.0, it could be 3.0 with multilevel switching and hundreds of devices, or it could just be simply that you've plugged some local CXL device into your box. And we also want to make sure that the telemetry that we're getting is useful. We want to easily get this data out and then make some action, some insight with it. That might be that it could be RAS, it could be that the device is failing, that device could be part of an interleave set. Should we go and replace that device proactively? Should we move data off that device or move data onto a newly provisioned device? So we want to be able to build all of this intelligence and insights into OS native tools, into your DCIM software for data centers, into orchestrators like Kubernetes, so that it understands when it's provisioning hardware, what hardware to provision for the application workloads that are intended to be run on. And this is nothing new. The network and the storage sectors have been doing this for decades. So we're not really proposing anything new, just kind of follow what they've been doing, just with a view of memory, now that it's becoming disaggregated.

So there's a lot of use cases and reasons why we want to do this. The current area of focus, obviously, right now, is tiering latency-based and bandwidth optimization. Meaning if I have two different types of memory in my device, or maybe many different types of characteristics of devices in my server, too, then we want to do n-way tiering for latency optimization for applications that are latency sensitive. But applications that are bandwidth sensitive, we want to be able to provision the hardware there and optimize the bandwidth strike widths and all that type of stuff. So then you get into workload allocation. Where do I place my application if I was to schedule it with Kubernetes, for example? Workload rebalancing. What happens if I want to take a cluster down, or I want to take a node down in a cluster for scheduled maintenance? What happens if it crashes? I want to be able to move my workload around based off the resources that I have available to me. Also, real-time monitoring, the obvious one. Capacity planning and scaling. What do I do today? What am I doing tomorrow? What am I doing next year? How much CXL do I need? How many servers do I need? Can I bin pack a lot more? Health and monitoring, alerting, reporting. There's a ton of use cases out there that would benefit from improved observability.

So one of the proposals is that we would like some of these CXL metrics to be exposed through sysfs just like iostat and vmstat and everything else. So it should be fairly straightforward for us to do this. Being able to get not only real-time, but also some historical information about what the latency and bandwidth of a device is. And then what kind of granularity are we looking at? Do we want an endpoint kind of view of the world? Do we want it on a CPU socket, depending on how many devices it has? Do we want it root port per region if I'm interleaving or not interleaving? Per NUMA node if I'm using system RAM kind of namespace types. So all these levels of observability and different levels of the stack are kind of what we're looking at with some of these proposals.

So what it might look like is that from a perf perspective, it would be very handy if the perf telemetry metrics actually had CXL in that name. Not all of them do. And that they were obvious as to what that metric was doing. So throughput, loads and stores, over time, real-time. Is it a local read? Is it a remote read? Maybe over a different socket, maybe to a different server, right? I mean, all that type of stuff is very useful to us to understand what it is that we're trying to get out. NUMA stats, been around a long time. But there are things that we can do here with, can we access the throughput and the latency, maybe even a heat map from a NUMA perspective? And that NUMA might be a single device. It might be an interleave set, right? Based behind a region. And then as we get into the device stats, at least device stats from a Linux perspective, either through the physical PCI path or maybe through the memdev device, right? So those are some of the things that we're currently working on.

And then as we introduce switches, switch vendors are doing this, of course, right? But maybe not necessarily from a host perspective, but certainly from a management perspective, am I saturating a link somewhere, right? Which is quite common with networks. How do I know if a device is failing? Is the device failing? Is it my link that's failing? Is it something else that's going on here? Understanding the link speed, right? And again, that goes back to device advertisement of its bandwidth. And that would come down to maybe the device is a x16, but it's been put on a x4 link, for example, right? I mean, that would be a misconfiguration. But without the ability to observe this, you wouldn't be able to know and get a real-time report out. And the usual other things, right? Port errors and throughputs and that type of stuff.

So one of the things that we'd like to bring to the OCP forum, the CMS team, is it would be nice if we could settle on some industry naming convention, right? Everybody's got their own idea of what to call things, and that's great. But if I go to Intel and I go to AMD and I go to maybe XConn, they all call it different things for the same thing. So if there's a way that we could standardize on nomenclature, that would be good. Again, bringing standard OS utilities, thinking, again, Linux here, but cxl stat doesn't exist today. Would be nice if it did exist in the near-term future for host-level telemetry, right? Again, iostat, vmstat being the obvious model here. And then from a fabric perspective, right? If I'm sitting on a management host, I can integrate some telemetry into my existing tools, and that's very well. fm_cli has been recently proposed at least at the Linux forum. It has a stat-level subcommand in it that will hopefully try and collect information about the ports and links and that type of stuff. And going back to the topology side of things, right? Again, understanding distances. NUMA distances are a discussion point probably for another day, but it gives us some indication of how far away a device is for that particular CPU. But it only works if the CXL device is in a system RAM mode, I should say. If you're provisioning this from a hardware perspective, that might not necessarily tell you exactly what you need to know. Similarly, if that device is several switch hops away, how do I know how many hops that device is away from me right now? And today, we don't know.

So in the CXL 3.0 specification, the CPMU, C-P-M-U, rather, was introduced. It's a starting point for us to have these conversations about standardizing telemetry across devices, fabrics, et cetera. So there is a section in the specification, 13.2, that talks about exactly what needs to be done, the outliner. I think there's some more work that needs to be done in this area for sure, from a perspective. And then recently, recently being about 6.5, there were some patches pushed to introduce at least the starting framework for C-P-M-U into the Linux kernel itself. So if you go install 6.5 or 6.6 release candidate today, you can go have a look at this. There is some very minimal documentation around this. But again, this is kind of an interesting area for us being the software guys, is this is the foundation for where we need to be heading for enabling a lot of the software.

So the call to action is if we can have these discussions, again, from the device vendors, the CPU guys, the switch guys, consolidate on nomenclature. If we can improve the specifications, then we can work with the Linux kernel community, upstream patches, develop the tools that I proposed, and make this a much easier technology to adopt. Again, we can talk about telemetry and the requirements and what an architecture looks like, we had a really good talk earlier today about the hotness tracking. And that's the level of detail that we need to kind of do for the telemetry side as well. So I encourage you to join the forum. If you're not already there, come in, we're always welcoming everybody. Join the kernel community if you're interested in that area. That's a very vibrant community right now. We've got Discord servers with chats going on all the time. The Linux mailing list is a great place to go find resources as well. So we definitely encourage you to join all of this. And welcome any questions or open discussion if anybody has any at the moment. Okay. Thank you very much for your time. I appreciate it.