135


Hi, everybody. John Groves from Micron. And this is a little more of a proposal for how we need to do things down the road with CXL3. I've been spending a lot of time thinking about how to make shared FAM, fabric attached memory, work in a way that gets us some leverage. And it doesn't end up like 3D cross point where it's hard to use and it doesn't get used. For sure. Yeah, it is missed by a handful of people.

So I thought about calling this meet the new abstractions, same as the old abstractions, because that's kind of the approach I'm trying to take. So my goals in talking about this are to raise awareness of fabric attached memory capabilities, to point out the need for a shared scale out FSDAX file system, which I'm going to try to convince you is important.

And then to start a dialogue about the architecture. And I'll describe a prototype we have. I've got some ideas about the architecture, but I'm not of the impression that these ideas are going to survive all the way through. So real quick, a lot of what we've been talking about is what I would call pooling. That's where some memory gets given to a host, it gets onlined, and a lot of interesting work being done there around tiering and stuff like that.

But with sharing, it's not going to be online in most cases, I claim. Online memory gets zeroed. If you're sharing, it's not a case where you want to zero it, because you probably connect into it to find contents in there. And oh, by the way, that's a lot like persistent memory, right? So, and actually, can anybody confirm whether the 3.1 spec got released? No, okay. There's a handful of things that can be talked about. It was going to come out today, then it's tomorrow. It's, yeah, there is one someday. Okay. Yeah, I know. Now that I did that, it might not. Okay. So, CXL shared build memory looks like tagged capacity. There will be a way, if you know the tag, to find the DAX device. And so you can imagine putting data sets on raw tagged capacity, i.e. raw DAX devices, and apps could open it. But I claim, and I've got some backup info on this, that's not going to be good enough. There's a world full of apps that know how to open data sets in files, and we should let them do that.

Because if we do that, we get a ton of leverage. If we don't do it, then there's a world full of apps that need to be modified. So what does it look like? The CXL devices are DCDs. This is hard to read on this screen. The DCDs are effectively devices with an allocator built in. But what gets allocated is tagged capacity. There are cases where it's not tagged, but it doesn't make sense to do that if it's shareable. Because you need to agree on which one's which.

So that's enough about that. Some observations about shareable memory. It doesn't make sense to online it. It's going to be zeroed. The most accessible use cases are things that already know how to deal with data sets and files. It's PMEM-like, I've already said that. Cache coherency, you have some options. There is a hardware-supported cache coherency mode, but the laws of physics apply. That's going to be expensive. I don't think, in my opinion, it's going to be attractive for some things. But software-managed or cases where you're reading and not writing are going to be really interesting because they don't have any problems with the cost of the coherency.

And real quick about FSDAX. This kind of unfolded over the past few months as an epiphany for me. The VFS layer already lets you have a file that maps to some special-purpose memory. And that's kind of what you'd like for shareable FAM. The only problem is that the way that existing FSDAX file systems are implemented, they have write-back metadata. And that's just something you can't mount in more than one place. Although technically, you can mount it read-only for more than one place. But that's hacky. So if a file has the S_DAX flag on it, and if it's paired with a DAX device that has the right functionality, then if I mmap it, I'm doing exactly what I hoped I would be doing, which is I'm doing cache-line-level accesses to the DAX memory, which in this case, the interesting thing is if it's shared.

So can we enable it for a lot of use cases? The column on the right is kind of the data science and AI tool chain, you know, Jupyter and Pandas and NumPy and Apache Arrow, et cetera, et cetera. These things share a lot of infrastructure for dealing with data sets and files. And in fact, they've done some things that are really helpful for fabric-attached memory for reasons that are actually a little different from the reasons we care about them. Apache Arrow, for example, is what the data science community, the developers of that, they call it a zero-copy format. And the idea is that you wrangle your data all the way to where your columns are vectorized in memory, you know, packed like a C array, so that vector instructions will run on it and computation is efficient. By the way, that's how you'd like your data that you're going to share in FAM to be laid out, particularly if you've got S_DAX files pointing to it, so you map it and you're just really accessing it. And so, let's see, in these kinds of data sets, read-only data is common, actually, too. And again, can we make this ecosystem deal with raw DAX? Yeah, one app at a time we could. But if you have a file system that can give you access to it, then we'll get them all. Now, it's still a subset of those use cases that are interesting, especially the ones where the data is actually packed in memory the way you'd want it. Question, Dan?

Absolutely. Are any of those using, like, multiple processes, at least on the same host, like doing a map shared of the same file?

Yes. Yeah. And we've run experiments. I don't have data to share about this, but we've got a good set of sort of plausibility arguments around, like, take Apache Ray. Effectively, it's an orchestrator. It's an orchestrator for your AI or computational data science kind of workloads, which I'm not up here as an expert on, but I'm observing and learning about it. It's really typical that you start with a big data set, and then you shard it up and kick off 250 processes to each work on a piece of it. A really interesting use case, and one that I believe isn't difficult to demonstrate the value of, is start with that big data set and shared FAM as a zero-copy data frame. Then each of the sub-jobs will run a query against that data set to get the subset it's supposed to use. That's already part of what Ray does. And actually, we're showing a demo at Supercomputing this week, where the value of not having to move the data around actually overcomes the fact that the CXL memory is slow right now. What would you want to do with FAM? Well, you'd want to have more FAM than you can fit in a server. When I was in school as a physics major, we used to say computationally there's two kinds of problems, the ones that fit in memory and the ones that don't. And that's still true. There's a lot more memory, and at Micron we're in favor of that. But if you've got data sets that are really huge, even if you're sharding up the computation on them, it potentially is a real win to put them in FAM where you've got more of it than you can fit in a server. And then in some cases you can run computation directly against that. Yeah, it's probably slower, both in terms of performance and latency, but you can fit things you couldn't. But also you can have it there for querying out what you want to do, the individual workers work on. And in some cases those will be moved into DRAM for that.

So I think I'm going to try not to say a whole lot more about these use cases. And, okay, so this is really just about why not DAX? Why won't DAX just enable us for a lot of apps by itself?

So I'll leave that unaddressed. But the slides are online and I'm absolutely happy to talk to anybody who wants to discuss it. So existing FSDAX file systems, the way -- so I've studied this stuff because we have a prototype, and I'm going to talk about that in a minute. But they do write back metadata. The metadata is mixed into the memory with the data. And so that's just not multiple mountable. They also do some things that I think can be avoided here, allocate on write. You know, the prototype we've got is preallocate. So the procedure is special for creating a file. Once a file is created, the procedure for using it can be exactly the same as using any file. You know, extra points for doing it via mmap because that's really doing the thing you want to enable here. But also -- and then there's some constraints, like truncate's a problem. Right. Didn't Linus once years ago say we lost the key to the clue box on a list talking about truncate? I'm pretty sure he did.

Okay, so the current implementations just can't scale out. So here's what I think the requirements are. It needs to be an FSDAX abstraction that lives on top of tag capacity. Files have to efficiently handle VMA faults. That's like kind of the main thing file systems do, right? The system says, where's the page at this offset in this file? Because there was a page table miss following a TLB miss. And so that's got to be handled fast. And then, yeah, we're exposing memory. We want it to be at memory speeds. We have to distribute metadata in a shareable way. That's something that addresses something that isn't done that way in the current DAX file system, because they were put there for a different reason. They were put there effectively to subdivide NVDIMMs, PMEM. And finally, it has to tolerate clients with a stale copy of metadata. And so somebody might think of more requirements. I want to hear from you.

So what do we have right now? Well, started with a RAMFS clone, added an ioctl to the files to say, OK, here's the extent list of DAX extents that back you. Here's how big you are. And then that gets set up, and the S_DAX flag gets set on the file. At that point, the VFS layer knows what to do with that file, provided the interactions between the file system and DAX work. And right now, they work the same way they do in XFS, and there's a couple of issues there, but I'll get to those in a minute. So there's a log format. Right now, it's implemented as an append-only log-structured file system. So files can be allocated. You can dump data sets into it. There's a master who gets to actually append the log and do the allocations. There are clients who consume the log and make notice from time to time that it got appended. And so, you know, when you play the log, it instantiates all the files. There's an interesting little twist to this because the metadata is local to each client. Metadata does not get written back to the log. I claim to enable a bunch of apps that already know how to share data. That's actually fine. But it's a thing that should be discussed and debated. Log play for the current prototype is handled from user space. I think that can continue to be true. User space just reads the log, instantiates the files, calls the ioctl to set up the allocation. Like I said before, files are allocated by the master. And this is interesting. Data may be writable by clients. At the log playtime at each client, the file can be made writable or not. And so, if the software knows how to deal with one writer, multiple readers, or multiple writers even, you can do that. You can also have the files be read-only for everybody but the master, that kind of thing. There's a CLI that gives you the ability to copy existing files into it, pre-allocate, put the data there. And I claim this is actually sufficient to enable a lot of stuff. I don't have demo data right now, but we're playing with it.

I hope to be releasing RFC kind of patch sets pretty soon. So, you know, looking at how it works, makefs is master only. Playing the log, you know, the master might remount the file system and need to play the log as clients would. Creating a file, you allocate some capacity, create a local file backed by that space, that's actually create the file, then make the i_op call to turn it into a FSDAX file. Initialize the data or put the data in, and then commit a log entry. And so then, I'm waving my hands, I'm not really talking about how we distribute the log right now, but assuming that we can do this and this is eminently, I claim it's already working. Then, you know, anybody who's watching log for appends gets visibility of the files that got put in there. And the policy for what they're writing, you know, whether they can write it, doesn't have to be the same for every client either. And then usage, POSIX read and write work because that works with DAX, it just looks like a memcpy. And MMAP works. And that's the thing you're really interested in.

So let's see, there is one more problem, and that is the PMEM versus DAX thing. Right now, if it's persistent memory, it gets a block dev PMEM device. That device supports this IOMAP API such that the VMA faults, I believe they go to the DAX device, but the DAX device knows to ask the file system. So the file system gets asked what, you know, translate this offset in a file to an offset on a DAX device. And then the DAX device finishes up by figuring out what address that is or what struct page and reports it back. But character DAX devices don't have a fully working IOMAP subsystem. So right now what we're doing is we're telling the system they're block DAX devices. Go ahead, sorry.

Right, yeah, sorry. Yeah, we're telling the system they're PMEM devices. And that's fine as a workaround for that issue. I'm assuming that the, you know, the RFCs that we put out for this will just have to solve that problem. I'm sure I'll bug you, Dan, about it.

But, so anyway, so back to goals and plans. And I'm kind of amazed that I've gotten this far in this time. Probably everybody's like, what the hell are you talking about? But I believe this is an opportunity to enable a ton of stuff. Really a lot of stuff gets enabled by this. And again, the data science tool chain, they already figured out that... So a couple words about that. What does the workflow look like? Well, there's probably some comma-delimited data that you're starting with, some data set from somewhere. So you start reading that stuff into a data frame. It's actually not entirely uncommon for like one line of Python code slinging data frames to double the memory usage of the app at a given moment. Because they don't know how much they need to allocate. So they start taking it in and then they allocate some more. So that's why, and in these workflows, you already interact with your data at a data frames and files level. So there's already this workflow where you convert it into a zero copy data frame. That packs it. And the users of these workflows are already doing this kind of stuff. So that's what you want to do and put it into FAM. You don't want to put your comma-delimited data in FAM because it's inefficient use of space. And like I say, I hope to start posting RFCs and finding out who hates it and who loves it. Question, Yazan?

Yeah, have you thought about the error handling flow here? Like the hardware industry has been trying to take the surface area of memory errors and contain them smaller and smaller so we don't like have catastrophic events, right?

Right.

So here we're going to have like a big file system that's shared across multiple hosts. I mean, if we have one memory error, how does it affect everybody? Does everything go down? Does just a set of jobs get killed? Can we like, we can give more thought to that.

Yeah. So that's an area that I can't do real justice to right now. I mean, whoever read Poison and machine checked is down, right? But another thing about these, I don't think they're super long lived in general. It's like you're going to massage up some data, dump a set of data frames into this thing. Clients are going to mount it. They're going to do something. But in some, I believe in some contained amount of time, they finish and you unmount it. Like there's no way to delete these files right now. You could, well, I can let you truncate it shorter or delete it if I don't reclaim the space because what I can't guarantee, at least without doing a lot of work, is that all the clients know I deleted something. So space that's been allocated to something can't be reclaimed without addressing that issue. But that's, you know, so errors, I don't think it's unique to this. I mean.

Yeah, I think maybe you make a good point. If it's constrained to short-lived work, maybe it's like, you know, you go hard, you go fast, you take a lot of risk for the performance benefit.

Dangerous my business.

You live dangerously. But if the job only runs for a little while, that's okay. So maybe it's not worth the return to figure that out.

Well, I mean, I think it'll ride the coattails of whatever figures out that stuff too because I'm...

Is what you have so far predicated on the existence of back invalidate or are you claiming to do like, for example, you have the single writer, many readers, the intent to just instantiate the entire device and then just do a giant flush to make sure it's coherent or...

So it's, I believe this is orthogonal to cache coherency protocols. When you're appending the log, you're going to have to do write back and validate and things like that. Whoever's appending the log has to use locking to serialize both allocation and log append. What the client applications do needs to make sense. So the easiest thing to say makes sense is the master dumps some files into there, make sure the cache is written back, everything software coherency, but the clients are all reading. So then you're good. If they're writing, I mean, you know, a comment about the laws of physics. Cache coherency among multiple threads or processes on one machine is already expensive enough that we try really hard to avoid that. And that terminates at the processor cache. With fabric-attached memory, whatever you do has got to go all the way out to the memory. That's going to be more expensive. That convinces me that the killer app for this is probably not there. But it's important to be able to do that. Yeah.

So you're claiming, just to be clear, you're claiming this could be useful even on top of existing 2.0 infrastructure rather than having to wait all the way for 3.0 with some level of back and forth.

Well, 2.0 really doesn't do shared memory, does it?

Doesn't really do, but...

We have some of that. Right. So if you've got shared memory, you can do this. All right. Okay. Absolutely. Thanks, everybody.