137


All right.  Speaking of bad ideas, this is hopefully a little comedic relief, but this topic is actually chock full of bad ideas.  So, hopefully, if you don't leave this talk with having cocked your head to the side and going, "I don't know about that," at least once, I haven't done my job.  So, the thesis of this talk is, very simply, "What if move_pages but for physical addresses?" And that should make you cock your head to the side from the beginning.

So, the thesis here is-- So, I've talked to just about every CXL hardware vendor that has either talked about their hardware openly or not, and they're all kind of discussing, "All right, if we implement multiple levels of memory in our device, what kind of fun things can we offload?" A simple idea might be, "If I have DRAM and NAND in the same device, maybe I can offload page faults." Whether or not that's a valid idea, not for this talk.  All we have to think about is, in terms of memory tiering, there are real problems there.  And the core problem is, if you do that, all of a sudden, a lot of your monitoring features go away.  For example, and we'll talk about this a little bit, transparent page placement uses page faults to detect what's hot, what's not, and what to demote and what to promote.  So, a lot of these vendors start to talk about, "Well, what if we offload the ability to track heat maps or idle bits or hotness list?  What can I provide tiering software in terms of telemetry to make it easier?" The problem is, as I alluded to, devices only talk host physical address, or at best, and at worst, device physical address.  Which means any data you get from these devices is going to be in that format.  And that's a problem because there isn't really a way to utilize that directly.  You have to convert that right now back to a virtual address in a PID and use either the sysmove_pages system call or some of the internal kernel interfaces.

So, I kind of allude to this, "Why do we even want to do this?" The current state of tiering, it's kind of chunked into four or five different mechanisms right now, and they're all kind of eaten by a handful of systems like Daemon.  But I kind of chunk it into four basic ones.  Interception-based, which is, we call it kind of legacy Optane stuff, where you have to replace allocators and things like that and be able to detect access.  IBS and PEBs are sampling-based systems.  They can be configured to provide either virtual address or physical address, which is kind of interesting from a security perspective.  Transparent page placement uses faults and some level of charging to do demotion and promotion.  And then, basically, page or folio flag monitoring.  A simple one is the idle bitmap, for example.  All of these have some performance implications that I want to touch on very quickly.

So, IBS and PEBs, for example.  The most common thing that people are using right now to determine hotness of data is basically LLC or last level cache misses, right?  And they configure it to give them, using libperf or some other interface, a combination of virtual addresses and a PID.  And they utilize that with move_pages or some of the kernel internal interfaces to actually demote or promote pages.  The problems with this, and there have been a number of talks in the past couple of years on this, prefetch traffic can actually bypass these sampling interfaces, which is problematic because the prefetcher can drive like 66% of all bandwidth.  So, are you really being accurate with what is hot and what's not?  There can be some runtime overhead based on the sampling rate.  That should be obvious. There's no free lunch.  But more generally, these PMU counters are actually already being used in a lot of cases in a lot of these data centers, right?  So, they may not be available in the first place.  Or if they are available, you're potentially time slicing them.

So, yeah.  Transparent page placement, this is rather new.  Primary mechanism here is fault-based.  It basically marks some pages not present, detects if they get accessed.  If they don't, they get marked for demotion in some cases.  And some recent extensions enable hot pages on lower tiers of memory to do promotion.  That's actually what the memory tier component in MM does right now.  So, there's some additional extensions.  So, the problem here is fault overhead, right?  There's tail latencies.  Every time you add these faults into the system, you're going to eat the cost of that fault.  And it can be complicated to tune depending on the workload.  So, it's not a one-size-fits-all mechanism.  And it's decent in certain areas.  But if you're trying to figure out exactly how my software should be utilizing the memory, it's hard to configure sometimes.

Idle bit tracking.  There is a mechanism in the kernel to do some level of page bit tracking.  And the kernel does mark some of these PFNs as idle or not idle.  And so, you can determine what has most recently been used or what has not been used.  Userland can actually set these idle and not idle bits directly if it wants in order to tell the kernel's reclaim feature, "Hey, don't push this out the swap," et cetera.  There's some problems with this, too.  From Userland, at least right now, last time I looked, there's this mass migration from struct page to struct folio.  And that broke a lot of flag stuff from Userland.  So, looking at some of those things may not be accurate in the most recent.  It depends on the interface you use, I'll say.  So, some of the debugging interfaces aren't quite there.  So, the other thing here is it's actually PFN-based.  So, in order to use a lot of these interfaces, you already have to have the physical address anyway, which is problematic in some senses, but could be useful if you are getting this information from the devices. 

And so, we come to the proposed stuff.  So, a lot of the hardware vendors I've talked to are saying, "Okay, I want to implement maybe a super device with DRAM and SSD and network-based CXL traffic, and we're going to do all the swaps and all this fancy stuff behind the scenes, and that's great." And maybe they even have a working prototype.  And then they've come to us and they've said, "Okay, we have a problem.  Software doesn't like it when all of the memory is out in the really cold tier, and we don't know how to get Linux to promote it.  So, how can we sweeten the pot?  How can we get the hotness data out so that the kernel or some other piece of software can say, 'Go ahead and promote this off of the device as fast as possible.'" And they've proposed all of the mechanisms you can imagine.  Idle bits, heat maps, hotness lists.  I've heard them all.  The big problem is physical device addressing, right?  Devices have no concept of tasks whatsoever.  They don't have a concept of virtual memory, so all you get out is a physical address.  And the reverse lookup to go from physical address to virtual address, because there is no interface in user land, for good reason, to do this, is very, very expensive.  There are a few interfaces you can cobble together to manage it.  They all require admin, right?  So that should be obvious.  But because there's no standard interface, it's hard to build core kernel support for that.  And because there's no standardized interface, especially in the CXL spec, I wouldn't imagine the core driver is going to implement anything like that.  So it's all going to be vendor device-specific extensions to do this, right? 

Yes?

You are mentioning about lack of getting-- not able to get the virtual address, and hence you want to do the reverse lookup, which is expensive, right?  So I thought getting physical addresses, isn't it good enough for you to actually place the page wherever you want?  For example, using-- the kernel already does that, right?  Using the interfaces that you have.  The migrate interfaces works on a stuck page or a physical address, right?

So it's two things, right?  So a lot of the monitoring mechanisms operate on-- you're right, struck page or monitoring some of those features right now.  These device vendors are talking about pushing some of those monitoring mechanisms down into-- we'll take page faults, right?  If the kernel is setting some of these page fault pages or some of the-- we'll say the present bits to generate these faults, it increases overhead.  So some of the vendors are asking, "Well, what if we forego that and we handle in FPGA or magic hardware land, wherever, the ability to just go out to my second tier of memory on my device and handle that page fault internally?" Well, now I've gotten rid of my monitoring mechanism from the kernel's perspective or the CPU's perspective.  And so you have to-- if you were dependent on that to control promotion and demotion before, you have to replace that with something that the device gives you.  But the core problem is the device can only tell you device physical address.  And this is where the standardized interface becomes a problem, which is if you don't have something in the specification that says, "This is what the device shall implement if you want to do this," chances are you're going to have a bunch of random drivers implementing it.  They're going to implement their own command sets.  And I'm interested in a standardized interface that enables us to build a more general solution for this class of device.

Another question I had related to the IBS thing that you are describing.

Sure.

You were mentioning about the prefetch overhead.  What exactly is the-- you mentioned that it would kind of pollute or reduce the efficiency of the sampling data that you have.

So it's not an overhead.  Prefetch traffic-- so the hardware prefetchers will do quite a bit in order to front-run software.  And IBS and PEBs in particular-- I'm kind of picking on those two particular monitoring pieces.  They either don't report last-level cache misses for prefetch the same way, or they miss the prefetch calls entirely.  So if you were to hit one of these pages or cache lines that should be marked hot, it may not actually show up in the data that you're requesting from IBS or PEBs.  So it's not that there's overhead.  You might not have as accurate a piece of data as you want about the state of the system.

Regarding the standard interface and whatnot, one idea that's floated around for maybe a year since I think it first came out of LSFMM was to have some sort of-- doing the promotion in the kernel, not in triggered by user space-- is have some sort of kpromoted where you would harvest the different ways of tracking hotness, fault hinting, performance counters, whatnot, and have that daemon, kind of like kswapd, I guess, harvest that and then do the promotion asynchronously.  That's easier said than done, but that's kind of where the kernel MM folks are trying to go.

Yeah.  And-- did you have something?  Yeah. Oh, the only thing I was going to say was, and when we cross into what interface should the device provide, there's a little bit of a chicken and egg scenario.  Until someone kicks the can over the line, no one knows what to implement.  And that's kind of where I'm at in terms of, well, here's an interface you can use to prototype.  Go ahead and do it.  Whether or not this is ultimately the best interface, probably not.  But we can discuss it.

Yeah, and also one of the nice things about that is that you have different kinds of-- the inputs can be virtual addresses, can be PFNs, whatnot. And again, automagically kpromoted would solve that.

And daemon eats a lot of this already, so daemon implements a lot of that abstraction as well.

You almost kind of said a solution when you're talking about how the monitoring mechanism doesn't get updated or doesn't know-- like the device knows something that the kernel's currently monitoring mechanisms don't understand.  But that seems like a simple matter of hardware.  Like what if hardware updated the structures that Linux cares about?

True. So the hardware would have to know about those things so that-- 

These would be Linux-specific devices.

Yes. So there is that. Yeah.

Well, they're probably architecture-specific. These structures are not necessarily constant cross-architecture.

Well, and also they might even be built-- 

They're the legacy ones.

So for the sake of time, I'll kind of jump through a bunch of these.  So I've talked a little bit about the reverse lookup overhead.  You can do this right now using userland.  You can do all of this reverse lookup.  There's a couple of different things.  And we tested, what does it actually cost to do this?  What does it cost to do this when we're running with additional compute available and when the system is pressured for compute as well?

So to kind of summarize these awful charts, the top is the chart when there's compute available.  And you see as we utilize more memory, 64 gigabytes on the left and 512 gigabytes on the right, and the subcolumns here are number of processes running, there's basically an expected linear increase in the amount of time it takes to reverse map this amount of data.  So 512 gigabytes, it takes 90 seconds to reverse map a whole memory map of that region.  If you use bigger pages, it'll be less, obviously.  But 512 is probably not the target figure for some of these devices.  You might be looking at eight terabytes of potential memory for really big capacity plays.  But it gets really bad when we look at as soon as your system is pressured for compute, all of a sudden it goes from a linear increase to you got to steal those cycles from somebody.  And now you're putting additional bandwidth pressure on, you're putting all this stuff in order to move that data.  And you're actually eating up data in the process because you have to build this reverse map before you can make any decisions about where you want to move the data.  And the ultimate, the thing that really kicks, shoots this dead is the fact that this is only good for one snapshot in time.  So during all of that time, during that 90 seconds or five minutes that you're building this map, your data is going more and more and more out of date.  And so that means it's less useful for the latency tiering perspective.  So that's, yes, reverse mapping is actually really, really bad.

So there is limitations to doing these offloaded tracking limits.  And I'll talk a little bit more about the implementation of move_phys_pages.  Obviously, I've mentioned hardware has no contextual information about a page.  So a physical page could actually be reused very rapidly between dying and creating created processes.  And so if you migrated that page up, well, it might just stay hot.  Right. Because if that that particular page is being used by the kernel to just reallocate between processes, well, it's going to look hot, but that's not actually realistic.  Transparent huge pages actually potentially cause this problem.  So by no means is this a perfect solution, but it is. I do think it is enough to start the conversation.  And there are many unknowns in this scenario, as we discussed already, largely because all the hardware vendors are like, what could you implement?  And the software implementers are like, well, no one does this.  So I'm not going to bother exploring use cases. Right. 

Just a quick comment. Sure. The reuse rapidly reused page is definitely hot.  It may not be hot if you're trying to do your migration in a particular application.  So it is something you want to keep cache hot.

So the rapidly reused page would be one in which it is hot for a very short amount of time and then is freed and released back to the system and then utilized again.  But it is hot in the context of a hardware address or a physical address, not in the context of a virtual address.  Right. So this is where it gets a little confusing to keep straight between virtual and physical.

So I do have a working implementation of move_phys_pages. It's based on version 6.6.  I can't get it up to 6.7 on the main at the moment because I have to do some pretty nasty rebases.  But it is actually two commits. It's very stupid simple.  There's one refactor commit that takes the existing move_pages code and makes some of it reusable.  And then the second commit implements the syscall. This is only on x86.  But if you have a particular do you want arm, I'll do it on arm and it won't take me that long.  Probably an afternoon. And it actually just does it reuses all of the move_pages infrastructure.  And then just all it does is remove the PID from the syscall.  And you you handed a physical address instead.  Something note. Yes, it still does the reverse lookup anyway, because it has to. Right.  So we are going to eat an rmap. So I think I have it on the next slide.

Per page, you still have to do this rmap_walk on the folio that the page belongs to, because you have to validate that you're not going to violate cgroups, CPU sets and things like that.  So you have to go check every VMA that it belongs to, check that it's actually migratable and then move it, which you already have to do anyway in move_pages. So it's not really that crazy.  There might be one locking thing that I may have followed that violated for cgroups.  I got to go back and fix that. Just one small one.  But it also reuses like a VMA migratable to double check that these VMAs, this page belongs to VMA that's actually migratable.  So we are validating whether or not these pages are movable in this interface.  We're reutilizing the vast majority of code directly from move_pages as well.

And there's already been a little bit of feedback and some other ideas.  The first one I heard was, well, can we create the interface but not the syscall?  I think that's totally valid. There's two issues that I see with that.  First is it allows development, but really only in drivers or in the core.  And until you have a standardized interface that devices can implement, it doesn't necessarily-- it may hurt adoption or may hurt people that are interested in doing it but don't want to write a driver.  May not encourage the open development.  So if somebody actually figures out how to do this really well, they could just close source their driver unless we GPL this particular symbol, for example.  And then the other one was the obvious feedback, which was userland shouldn't talk, physical addresses go away.  And that's a legitimate concern.  The feedback I would say there is that IBS and PEBs can already be configured to not just give you one or the other, but both the virtual and physical address mapping.  And that's why it requires CAP_SYS_ADMIN.  And that's why this interface would also require CAP_SYS_ADMIN.  So from a security perspective, it's no worse than IBS and PEBs in that sense.

So I think that's what I got in terms of questions.  We got three minutes and then we can all--

I think the other concern about giving the syscall though is also like forever cutting off the kernel or kernel from a policy mechanism.

Yeah.  No, absolutely.  I totally agree with that.  And that's why I said this talk is chock full of bad ideas.  But it does give something for people to prototype against in the meantime until we kind of figure out whether or not this is an actual legitimate idea.

But if we imagine a world where a SQL memory expander is giving hotness calls information, couldn't that expander driver be a generic thing that just calls internal move_pages?

And my follow up question to you would be, would you integrate that without it being in the CXL spec as a full formed feature?  That's all I'm saying.  Well, no, as the kernel maintainer, it was like, wouldn't you want to see that as an actual generalized feature that is defined? 

I mean, we generally operate on what's in the spec.

And that's all I'm saying is I completely agree with you. And my feedback would be that sounds like something we need to standardize.

I do think there's this-- like, kind of going back to David Lohr's point, right?  The applications that would make sense is like, who's going to consume this, right?  Like, you could say, like, sophisticated applications wouldn't need this, right?  They'll handle placement on their own.  And then kind of the ones that, you know, make the best case possible, let the kernel do it internally, right?  Why give them this power afterwards, right?  Like, what if--

Yeah, one of the pieces of feedback that I got was the kernel should do this.  And as we just discussed, I actually agree with that.  But it is a little bit of a chicken and egg scenario.  I think until you can prove there's a legitimate use case, the kernel will not do it.  And so the proposal here is can we get something so we can at least get it moving along?  And in the meantime, the patch is there. People can use it.

Yeah, that's what I was going to say.  Why do we need it upstream for the purposes of development?  We need the refactoring and everything.

So the refactor, yes, and the core interface, yes.  But I don't think necessarily the syscall as it is should be upstream.  I think the refactor should and interface be made available.  But it's fun to talk about implementing a new syscall.

Got it.

So this is for the purpose of CXL device memory, kind of hotness access tracking.  Why would the user space be the first be handling this kind of information?  Why wouldn't it happen in the kernel?

It's kind of in the same realm of the discussion we were just having.  Yes, ultimately, I think a general tiering piece of software is probably better handled by the kernel because it has better general information around the system.  But there is a bit of a chicken and egg, right?  So if we get something that allows us to do this type of movement in user space, we can very rapidly prototype off of custom devices, off of a variety of things that happen to pop up.

I see. So these devices would expose something to the user space and show the hotness information.

And it would be done probably in the form of vendor-specific mailbox commands to be able to-- I mean-- Well, I-- No. I mean, it's in the specification, and it's carved out.  But it's-- but as a-- I'm joking.  But it is good as a prototyping interface, just to be clear.  I'm not saying that-- we should ultimately try and get it into the standards, right?  We should ultimately try to do that.  So did you have more?

I mean, just for the record, we had a user space agent working with Intel Optane memory that was using a very similar kind of syscall, but Optane is dead.  So I'm very curious what's going to go-- what's going to happen.

Medic.  No, no, you're right.  And there's actually quite a few projects.  I mean, Damon is already trying to do some of this by abstracting some of these interfaces inside the kernel as well.  So there's clearly some level of want or need for this type of interface.  Whether it ultimately ends up as a syscall, probably not.  But again, fun to talk about in that context.  That's probably it, and then we've got time. 

Can you think of another reason for-- besides memory policy, that we'd want this-- because basically, think of another minimum viable product for moving pages, driver-directed page moving.  And then maybe we could get that piece moving stream and solve this other problem later.

Yeah, and I think that's probably a better way to do the initial RFC, is the refactor patches plus an interface maybe from the CXL area that allows you to define some device feedback as to hotness.

Yeah, I mean, we're kind of doing this with the type 2 stuff, where we're just kind of like, imagine somebody who cares about this, here's some patches.  But yeah, it's hard to move forward with that.

Yeah, it is a chicken and egg problem, right?  Until you find the use case, it's hard.  We are technically over time, so this will probably be it.

Yeah, just a quick comment, actually.  So I'm working on a prototype and an experimental implementation somewhat related to this.  I have, in fact, posted an RFC a few months back.  And this is regarding using of IBS provided hints, essentially, to do the page placement, both for moving migrating pages between regular nodes, as well as for hot page promotion.  So this is essentially use the physical address provided by IBS to do the movement.  So rather than doing it from the user space, I was trying to get everything done from the kernel and driving the existing NUMA balancing logic, both for regular migration and hot page promotion.  So I'm having a talk about it on Wednesday also.  Wednesday, in the referee track, I would request audience this kind of related if you find it interesting, maybe.

So maybe follow up question, are you finding that the NUMA balancing interface is sufficient?  Or would it be better to have a direct interface you could call akin to move_pages that just says, move this set of pages in particular? 

Right now, the way IBS works and the way it provides data for each access sample, NUMA balancing interface is looking pretty good for me, but it is not the only use case.  There are other subsystems like Reclaine might be there, right?  Maybe Daemon itself might benefit from using these accesses.  So I want to debate about all those things in the talk.  What should be that interface?  Should we have a common subsystem which actually collects all this information and makes it available to the consumers within the kernel, different subsystems?  But to answer your question, for NUMA balancing, IBS works.  I feel naturally the interface works.

If I'm thinking of the same thing, that's been nacked just because you end up doing too much processing in the kernel.

Yeah.  Well, that goes back to there's no free lunch.  You're stealing the cycles from somewhere, right?

I don't know what we're talking about, but do you really need per-paging?  Would it be enough to say, go poke a sysfs thing that says promote this node or demote this node?

There is another patch set that I don't think is public yet that is aimed at that, and that is, I think, another who's willing to take on the maintenance cost of it, where it's maybe not necessarily promote this node, but maybe take 10% of what's on this node, distribute it randomly, and move it up or down or whatever.  And I think that is a very solid proposal, especially in terms of rebalancing for interleaving, for example.  Now that we're in solidly software-defined memory land, you can come up with a billion use cases that might show up.  The question is, as always, who actually wants it?  So yes, I think there's a place for that interface for a variety of reasons.  Is it when you talk about two-layer memory devices, potentially with DRAM and SSD, you really want to know what's hot on that to get it off as fast as possible, because the tail latencies could kill you.  So that's where it may not be sufficient to just have a general interface.  You may want, I really want this page off of there because it's killing me.  Thanks.