335


So, this is a little different. I think it's a different perspective on how to use CXL. And, I'm mostly focused on shared memory, and I'm the FAMFS guy. So, I'm going to talk about what that is and why it is, and stuff.

So, first, a little bit about system RAM versus DAX mode. Then, some My Dynamic Capacity overview, because I think we're going to have competing ones of those, you know. Some FAMFS details and some info about cache coherency. I'm going to try to skip over FAMFS details unless there's a lot of questions about it. There's quite a bit of detail in the slides, so I encourage people to look at those.

So kind of a busy slide, but system RAM versus DAX mode. If it's system RAM, if you go online, Linux owns it. It's eligible for being given to anybody in response to an allocation. That's got a lot of flexibility. You can migrate pages. You can run auto NUMA. But memory and connectivity failures affect system RAS because you don't know what the kernel put in it. DAX mode, on the other hand, basically, you have to have apps that know how to use it. But that includes FAMFS, which makes it look like a file system. And when you do it that way, system RAS isn't affected in the same way because Linux doesn't put anything in the memory. And in fact, not even FAMFS puts anything in the memory from the kernel, except in response to read and write calls or mmap on behalf of the user process. So the blast radius of a failure of the memory is the process that was accessing it normally, um, okay.

So, a quick overview of DCD. One of the things I'm one of the authors of DCD in the spec, so I have some perspective on why things were put there. It's seen as something complicated, and I think there's a certain amount of misconception about it. A DCD is just a memory device with allocation and access control built in. So, when you connect it, there's no memory provided. You have to allocate some. We need a fabric manager for that. Luckily, Gisela's here. And tagged allocations. So, allocations can be tagged or not. If they're shareable, they must be tagged. It's mandatory. And tagged allocations, I claim, are file-like. How is that? Well, there's a tag that you should be able to use to find that memory. And if an orchestrator allocates memory for an application or to back a VM or something, then the orchestrator should provide the tag to the VM or to the code that starts up the VM. And then, that's how you agree on what memory is what. If the memory is shared, that's how we agree what memory we're going to go find. And I'm going to put stuff in it. And you're going to look at it, right? So, there's some talk about memfds and whatnot. I'm a little concerned about that. I know DAX devices work here. And I think memfds probably can, too. But when we introduced FAMFS, we've got an analogy of these character, DAX devices to block devices. With block devices, you've got the libblkid and lsblk and stuff. And that's a set of tools that looks at all the block devices in a system and looks whether there are recognizable superblocks on them. And whatever devices it finds with superblocks, it tells you what they are. And that includes logical volumes. Those have superblocks, too. And so, the DAX. The character DAX device container is a convenient thing to have. Another thing about tags, the size of a tag is UUID-sized. And I claim, fabric manager people, please make them UUIDs. They must be globally unique. There's some ways you can use this when you don't need that. But it's broken if broadly we don't make them globally unique. And a tag is actually local to one memory device. So, you've got to generate these with the right discipline for UUIDs so that the tag space on one device doesn't collide with the tag space on another device. Again, think of them as file names, except that they're UUIDs, which are less user-friendly but unique.

OK, so tags are essential to find the memory that was allocated for some purpose. If the memory is shareable, I claim, don't online it as system RAM. That's nonsense. System RAM is going to get zeroed if you were going to share it. Whatever you were going to share was just lost. You basically just can't use shareable memory as tagged capacity, unless it's allocated as shareable, but you're not really going to share it. In that case, because you can't do that. And then...

The bullet about FAMFS being able to interleave is slightly out of context here, so I'm going to move on. So, I want to talk about the core inside of FAMFS. It's been a lot of work over the years. HP's project called 'The Machine' that a lot of people have been aware of, if you're old enough. It's maybe 10 years ago. They had resistive RAM, which was persistent and shareable. And—but the thinking really was, 'Hey, it's a new paradigm. We need new abstractions.' And that never works out very well. Unless you don't want the new stuff adopted, in which case it works out pretty well, because it's just too hard to use. And so that's, I claim, a bad idea. Now, with DAX devices, when a memory device is shared, then there will be a DAX device on one system that references the same memory as a DAX device on another system. If it's tag capacity, I claim, that should be a virtual DAX device that maps to that tag's capacity. That also, you know, there's one on another system. And you do need a way to resolve tags to those. But once you've got that, that's a thing you can share. An application on one host can map it, and on another host, you can map it. And—it's the same memory. But I claim that's still a little bit too hard, because apps don't already support DAX devices. A few of them do. And you can't stat a DAX device to find out what size it is. There's a different procedure for that. So the core insight was, this is a little too hard. But all the plumbing we needed to make this look just like a file system is already there, except for what turns out to be a little less than 1,000 lines of code currently in the kernel for FAMFS. This is patch sets. It's not upstream yet. And so I implemented it. And there are quite a few universities and companies that are kicking the tires of shared memory using this. And it's fairly far along.

This, I think, I'm not going to dig into, because it's—it's too much detail. But I encourage people to look at the slides and to reach out if you want to talk about the details. But FAMFS is basically an append-only log-structured file system. It solves the problem of being able to mount from the same memory, which is analogous to mounting from the same storage device, except it's not a storage device. It's memory. And it handles that by—there's a master node that gets to append the log, the metadata log. Client nodes. Client nodes just get to play the log, which means they don't get to create files. But they can see all the files. And any given file, if you choose to make it writable by everybody, you can. Because everybody's mmap is really just going to the memory. You can also—this is the thing I added into the CXL spec—a given host can have a read-only mapping of a CXL device or of a tagged capacity allocation. And in that case, you can actually require the device to drop writes from a host that's not allowed to write.

So, a little bit about the metadata format. And there's some updates to this, including an interleaved format. One of the things you would want to do with memory on, or with data on fabric-attached memory, is you would want to interleave across devices for performance. And so, we actually support that now. It's not pushed to the mainline repo, but it's going to be there. And this is an interesting enablement, because CXL hardware can be programmed to interleave across devices. And therefore, it can be programmed to interleave across tagged capacity instances. But the extent list has to be identical on all the devices, or you can't do it. So if it's one extent, it's got to be the same DPA on all the devices, or hardware can interleave it for you. But if you're already doing a file system, interleaving is actually straightforward. It's just an interleaved extent list. So, my suspicion is that the ask of fabric managers allocating, "give me 32 gigabytes each from these eight devices, and make sure it's exactly the same DPA extent list," that's going to be hard.

Quick status update: I introduced FAMFS at Plumbers last year. It wasn't out yet. It's on GitHub. The V1 RFC came out in February; V2 came out in April. I led a talk on it at LSFMM. And the net of that, which wasn't my goal, but it's actually good news in the big picture, is that there's a desire to merge this capability into FUSE. And so, I'm working on that. We have Miklos, the maintainer, coming in tomorrow for a meetup. And so, why is that possible? Well, FAMFS is about 1,000 lines of kernel code, and about 6,500 lines of user space code today. And the user space code? The user space code writes superblocks, initializes the log. All log entries are created by user space code. All log plays are handled by user space code. With the current implementation, when the log gets played for each file, the file gets created. Briefly, it's kind of an empty RAMFS file. And then an ioctl on the file gets called that says, "Here's your extent list," at which point it's got what it needs to handle mapping faults and handle rewrites. The FUSE implementation changes around a little bit how that works. FUSE has the concept of caching up metadata and timing it out, and whatnot. But the key thing that has to be added to FUSE, and Miklos is on board with this, is that a file whose metadata isn't timed out must have its metadata fully cached in the kernel. Because we're enabling memory. It's got to be at memory speeds. And that's what's required for that. Because, what does a file system do a million times a second? It answers the question from down below: "Tell me where the data is for this offset in this file." And the answer is in the form of, "It's at this offset on this DAX device." And so, that's got to be fast. That's why we cached the metadata in the kernel. And that's why we'll have to keep doing that.

So interesting use cases. So, there's a ton of uses for data frames. Apache Arrow is among my favorites because it's a memory-friendly format. It was created by the data analytics community in conjunction with the GPU people to say, "Don't take my job from the CSV data that I imported or the log that I scraped to get this into memory." Let's build a canonical format where a column of eight-byte floats is packed in memory, vectorized, and so on. And so, these formats are already super friendly to memory mapping. In fact, they're made to be stored in files and memory-mapped. It's just that until this, files are demand-paged things. In this case, if you dump it into FAMFS for analysis, it's just memory. There's not a storage backing. An in-memory database is also interesting because those also tend to be memory-mapped formats.

And let's see. So yeah, FAMFS doesn't create any new cache coherency problems; it just makes the old ones worse. This was a benchmark we showed at FMS, and the purple line is the RocksDB database in DRAM. And the x-axis is normalized to system memory size. So, when we run RocksDB query workloads against a database that's fully cached up in memory, it's fast, right? And when it gets bigger than memory, it gets bad slow. And the p99 latency, the lower plot, goes through the roof. And the lower plot is log scale. This particular diagram acknowledges maybe more pointedly than will be applicable. OK, the CXL memory is a little slower than the regular memory. But you can have a lot of it. And if you have a lot of it, you can have a lot of it. And if you have a lot of it and put something big in it, you can have consistent performance scaling all the way to the size of that memory. This particular data was typical DRAM bandwidth. And it was two CXL cards striped with more. We expect to be able to move the green line up towards the purple line on the left side. And I think, but the thinking that I want to encourage is that CXL memory can change what size of problems fit in memory. And there's a set of techniques we use to make things fit in memory, like sharding, which leads you to shuffling. And not all problems shard well. Some problems have to be shuffled because you've got to move the data to where the compute's available towards the end of the job, and whatnot. Shuffling can be in order for nodes. Some problems are hard to shard. But if you, in a thought experiment, take four servers, eight terabytes of memory total, you can put two terabytes in each server and take a six terabyte database and shard it across the four servers. If sharding works well for you and shuffling's not a problem, that's pretty good. Another way to use the same resources is a quarter terabyte of memory in each server, seven terabytes of shared memory FAMFS. You could do it with DAX, too, if you like pain. And now the whole thing's in memory. All the servers can access all of it. Yeah, the memory's a little bit slower, but there is no shuffling. And so these 'change what fits in memory' types of things are interesting.

Couple words about cache coherency, and one minute ought to be enough for cache coherency, right? So, CXL has a cache coherent mode, but if that sounds to you like it solves all your cache coherency problems, then you probably don't know enough about cache coherency. There are some implications, like memory barriers need to wait for all CLFLUSHes that have been called to complete before they unblock, and things like that. But John's opinion is that shared memory is interesting for the use cases, and there are a lot of them where data gets dumped into a shared store, consumed, the outputs go to separate files or data sets, and there's a ton of data like that. And you know, like RocksDB is a good example because it writes files out and then they read only for the rest of their life, so it maps beautifully onto this.

And here's just a—and this is the last slide. There's a ton; there's a big ecosystem of stuff that uses data frames, also LSM code, kind of key-value store stuff, that is biased towards data that gets written once or not very often, and then consumed. And so, there you go.

Okay, I've got one question for you.

Yes.

What is there anything you'd particularly like help with, or people to look at? Basically, what would you like the people in this room to do to get this there faster?

So that's a great question. The, I'm at a sort of scalability bottleneck point right now because the FUSE port is gnarly. And I didn't know enough about FUSE going in, so I'm just coming up to speed. I expect to be posting FUSE patches later this year. The big thing that I could do is, number one, if you have a use case that you want to play with this for, please do. Number two, I didn't talk about the limitations, but the limitations are kind of epic at the moment. Files are strictly pre-allocated. So you can't change the size of a file once you've allocated it. And you can't just open a file and start writing to it. You know, you can DD in and out of files, but you have to use conv=notrunc. And today I don't support delete. Now, delete is actually pretty easy to support. But once we do delete, there's a problem, I mean, the core problem with FAMFS is that clients may not be up-to-date playing the log. So if I let you delete a file and then create another file that uses the same memory, some client that's stale still sees an old file that points to the same memory. And so I want feedback about, I mean, my opinion is that it does not make sense to try to turn this into a general-purpose file system. Although there are some cases where you might want to use XFS and FS_DAX mode on your CXL tagged capacity or CXL memory, if you want to have a file-oriented allocator that you don't need to share. If you need to share it, you can't use XFS because it does write back metadata and that's not shareable. Hannes.

Sorry. Before we go to Hannes, we've got Gorun's got a hand up online. Do you want to ask a question, Gorun?

Yeah, I want to ask a question. How do you deal with the memory tearing? I mean, if the FAMFS big-endian C tile may have tearing—some faster, some close, some remote—how do you move the metadata from one side to another?

Got it. And today, FAMFS doesn't do anything about tearing. It does do striping across multiple memories. And actually, the thing that it does today, because we don't have DCDs yet and we don't have really fabric, but what we've got is some early switches that are, you know, super experimental and whatnot. But what they do is give you one DAX device that is a concatenation of some backend DAX devices, not striped, concatenated. And that's the worst possible case because a given naive allocation is going to only land on one device. Therefore, FAMFS has a striped allocation mechanism that can be just bucketized based on the knowledge of how big the backend devices are, and it can stripe. Now later, we could do, you know, more complex tearing. But the problem is relocation is a problem. It's the same problem that delete is. Which is that, you know, we've elegantly solved the problem that clients might have a stale view of metadata by making that just not matter. The moment you do delete or the moment you relocate something, then it does matter. Deleting, relocating, those aren't hard, but we have to have an understanding with, you know, the application has to understand what its responsibility is, or else it breaks. But, and so, a dialogue about... What, how it should go forward in terms of flexibility for delete and whatnot would be very helpful. And then, because of the FUSE port, I don't have, well, there's actually one or two issues up on the GitHub that say "help wanted." But they're kind of DevOps-y things. But help is definitely wanted. And long term, you know, there is a community starting. There are PRs that have been merged from outside. And more would be great. Feel free to reach out to me.

Okay. Oh, yeah. We can do it afterwards.

Oh, okay. Do it offline.