-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preferred metadata design #19
Comments
For me
|
Policy questions
Userspace implementation
Kernel implementation
|
Ok so you envision something more generic than simply "allocate metadata on metadata preferred devices", and more classifying devices on a spectrum, and using that spectrum to infer policy. I actually like that idea better, because it removes the need to have a separate flag to indicate what the policy is, we simply tag devices with their policy and carry on. From the point of view of the user it would map out roughly like this
For this the on disk implementation would simply be a new item on the device tree per device. Implementation wise I would do something like
I think that's it code wise. |
Sounds good. One question: What does "no preference" mean? metadata-only, metadata-preferred, data-preferred, and data-only imply an ordering, so what is "none" and where does it fit in the order? Or is "none" an alias to one of the others and the default value for new devices, e.g.
and if so, wouldn't it be better to call it "default" instead of "none"? |
Policy questionsENOSPCDo we fail when we run out of chunks on the metadata disks?IMHO no, a lower performance is far better than a metadata -ENOSPC. Do we allow the use of non-metadata disks once the metadata disk is full?As above, yes. However there are some corner cases that have to be addressed.
Normal case: data spans sd[cde], metadata spans sd[ab]. What should happen if sdd is full:
Do we allow the user to specify the behavior in this case?I think no. However I am open to change my idea if there is a specific user case. Device Replace/RemovalDoes the preferred flag follow the device that's being replaced?In the general case no, I don't see any obious behavior; we can replace a faster disk with a slower one. However btrfs disk replace/remove should warn the user about the possibles risks What do you do if you remove the only preferred device in the case that we hard ENOSPC if there are no metadata disks with free space?I think that the "preferred" is just an hint. I prefer a better btrfs-progs that warn the user about this situations (preferred disks full) Userspace implementationsysfs interface for viewing current status of all elements.Definitely yes sysfs interface at least for setting any policy related settings.I agree, the only limits is that is difficult to implement an "atomic change of multiple values". I don't know if this A btrfs command for setting a preferred disk. The ability to set this at mkfs time without the file system mounted.It does make sense. Anyway my suggestion is still to allow a flag to the mount command to set a "standard" behavior in an emergence situation. May be that the mount options are "transitory"; instead the sysfs are "permanentely" Kernel implementationThe device settings need to be persistent (ie which device is preferred for the metadata).Agree The policy settings must also be persistent.Agree The question of how to store this is an open question.The xattr was a fascinating idea. Unfortunately it suffers of two problems:
The other way is the one which was used to store the default subvolume. The only differences is to use an extendible (and versioned) structure to hold several parameters (even not related). A "dirty" flag marks the structure to be committed in the next transaction (see btrfs_feature_attr_store() ). |
This is one area where Goffredo and I disagree. I have use cases where there absolutely must not be data or metadata on a non-preferred device type. In my examples above "metadata-only" and "data-only" get used more often than "metadata-preferred" or "data-preferred." In my test cases I never use metadata-preferred or data-preferred at all. I could live with just metadata-only and data-only, but I know others can't, so I included metadata-preferred and data-preferred in all variations of my proposal. On a server with 250GB of NVME and 50TB of spinning disk, It's a 95% performance hit to put metadata on a non-preferred device, and a 0.5% space gain to use preferred devices for data. That tradeoff is insane, we never want that to happen and we'd rather have it just not be possible. We're adults, we can buy NVME devices big enough for our metadata requirements. |
I'm with Zygo here, having first hand had teams tell me they'd rather have X fall over hard than slowly degrade. I think its valuable to have the metadata-preferred/data-preferred model for people who want the perf boost+safety, but it's equally valuable for Zygo and other people who would rather it be fast and plan appropriately than to have it slowly break. |
Ok, let me to summarize the algorithm.
The above ordering is for metadata. For data, the ordering is reversed. The allocator takes the disks from the first group. To simplify I would like to know your opinion about grouping the different levels:
|
The "disk with no hints" case is a problem. If we have split the filesystem into disks that only have data or metadata, but we add a new disk, btrfs might put the wrong type of chunk on it. There's a few solutions:
My preference is option #3. Even though it can put data on a SSD sometimes, it is much simpler for users. |
I think that this is the "least surprise" option; I like this |
I can't think of cases where you'd have both -preferred and -only preferences set on devices in a filesystem without also wanting the ordering between them. Conversely, if you don't want the ordering, you also don't need two of the preference levels.
We could dig out my |
Is there any effort underway to implement this? I may be interested in contributing |
Here's a patchset I'm trying to keep updated for LTS versions: |
This is a very interesting topic. As a user and sys admin there would be many benefits with the multi level approach. As a home user I'd say I would like the preferred option, because I am conscious about cost and want to avoid ENOSPC as long as possible. However as a sys-admin I have other resources and priorities, so the *-only would make sense. In a way I wonder if the current train of thought could be expanded in a more tiered storage way where we can have classes of data too? NVME = metadata-only The biggest value add is the original metadata preference idea. But, depending on design choice, maybe would allow for development of tiered allocation later on? |
In the current patches it's sorting devices by preference, and there's only 2 preference types plus a "consider other preference types" bit. We could easily add more preference type bits and then specify the order for filling devices, or create multiple preference lists explicitly and sort each one differently. These will require a different representation in the metadata (and potentially one with multiple items if e.g. there are a lot of tiers). The hard part is that with one monolithic "data" type, the patch only has to touch the chunk allocator to decide where to put new chunks. If there are multiple "data" types and a data classification scheme, then we have to start adding hooks in |
In the years since this stuff was posted I did find another use case: allocation preference
e.g. we set devices 1-6 to I guess |
it's not easier when i set musage to 100 percent on a device than space info says there is no space for data ? |
If I understand it correctly "musage" is just a filter on the balance process which selects only block groups with space usage below that percentage for reallocation (balance). Device is another filter there which selects only block groups on that device. To actually force allocation process to select particular device there is a need to apply patches mentioned above and set allocation hints per device. |
Yes, you do. They are really filters only to select which chunks to consider. If this selects a chunk which has stripes on different devices, it would still work on both devices even if you filtered for just one device. Allocation strategy is a completely different thing, not influenced by these filters. |
Why not implement a system assigning each device two integer priority values, one for metadata and one for data? These integer values would dictate the order of storage priority, ensuring devices with higher numerical values are utilized first, and lower values are used as fallback options when space runs out. An administrator would set their desired metadata device with a high metadata priority, and a low data priority. This structure not only facilitates the creation of a dedicated metadata device to prevent premature space exhaustion but also supports a variety of scenarios, including:
This prioritization system could also benefit future developments in cache device technology, enabling the cohabitation of cache data and metadata on a single device. It would also refine tiered storage strategies with mixed-speed disks, prioritizing specific data on faster devices while caching other data on slower ones. |
@Izzette This idea has some flaws:
I think tiering should really be approached by a completely different implementation. This preferred allocation design has its limited set of applications and it works for that. Tiering is not part of this set and cannot be solved with allocation strategies. Currently, there's already a function in the kernel where the kernel prefers reading data from stripes on non-rotating disks. A future implementation should probably rather track latency to decide which stripe to read, similar to how mdraid does it. But this is completely different from allocation strategies and doesn't belong into allocation strategy design. We should not try to implement tiering as part of an allocation strategy, it's a very different kind of beast. Stripe selection for preferred reading of data is also not part of allocation strategy. Adding a reserve device is already possible with these patches (but I agree, in a limited way only), and preventing allocation during migration is also possible (using the "none" preference). I'm currently using these patches, which implement the idea here, to keep meta data on fast devices only, and to prevent content data from going to these devices. High latency of meta data is the biggest contributor to slow btrfs performance. While this sounds like "tiering", it really isn't, although in case of Just to spin up some ideas but it should probably be discussed in another place:
Currently, the linked patch + bcache mostly implement the above idea except it doesn't know about stripes and does double-caching if different processes read from different stripes, which is kind of inefficient for both performance and storage requirements. |
How about simply having a user space daemon doing this work and submit migration calls, maybe via a modified balance interface? We do not need btrfs handling this kind of thing internally, IMHO. |
This should work and I actually agree to let a user-space daemon handle this (the kernel should not run such jobs). But how does such a daemon know what is hot or cold data? |
In my opinion, this is a very easy limitation to overcome. Priorities of 0 or lower could be used for exclusion, presenting allocation to the device for that data type all together.
I fully agree, but I can see how it could potential be used as part of a complete tiering solution, for especially for write-back caching.
In fact, this is exactly the case I am encountering / interested in. My proposal is just an imperfect idea, but having "metadata-only" "data-only" etc. feels limiting, telling the user how to use the tool instead of giving the users the tools they need to fit their use case. |
Yes, this would work. But then, it's essentially what we already have: The list of integer values behind the symbols is actually a priority list with exclusion bits. But I think your idea of splitting it into two integers makes sense and is more flexible, and I like the "0" as exclusion idea. OTOH, there are not many free bits in the meta data to store such information thus it has been designed the way it is currently. Everything more complex will need additional btrfs trees to store such data, as far as I understood it.
Me too, because it will make things much more dynamic. But currently the combination with bcache works for me. E.g., you could use 2x SSD for metadata, mdraid1 NVMe for bcache, and the remaining HDDs will be backends for bcache.
I think we still should rethink a different solution and not try to force metadata allocation hints into a half-baked tiering solution. Yes, you can use it as a similar behaving solution currently, with some limitations, and your priority idea could make it more flexible. But in the end, it won't become a tiering solution that just dynamically works after your disks filled up: Because data sticks to that initial assignment, performance would suddenly drop or start to jitter because some extents are on fast devices, some on slow. This will happen quite quickly due to cow behavior. Allocation hints are to be used for a different scenario, and within those limits, you can trick it into acting as a tiering solution, and even a dynamic one, if we put bcache (or lvmcache) into the solution. Don't get me wrong: I'd rather prefer a btrfs-native solution but allocation hinting is not going to work for this because it only affects initial data placement. In that sense, allocation hinting/priorities is not the correct tool to give users a tool for tiering. We need a different tool. We don't need a fork with some aluminum foil attached to dig a hole, it just would wear off fast and break. |
As for the metadata patches go, I think they are pretty good as they are. More users would benefit if we can release them officially, maybe with some fallback logic to handle corner cases. The patches themselves seems very solid. If we want to change data to into different tiers, then some changes are likely needed. At least some metadata that can be set (on extent, inode or subvol?) to classify data. The allocator then needs to use these properties during extent allocation. Another possibility is to make balance accept a placement hint. Balance can already use vrange/drange as src argument, so providing a target hint to the allocator could be enough to let user-space handle the logic. This would be suitable for moving cold data to a lower tier, or mostly read data to a upper tier. |
This would be perfect. But still, how do we know what hot, cold, or read-mostly data is? I still don't think that tiering should be part of the allocator hinting but given btrfs could record some basic usage stats and offered an interface to a target hint, we are almost there by using a user-space daemon. Of course, we could use allocator hinting to put new data on a fast tier first, then let user-space migrate to to slower storage over time. But to do that correctly, we need usage stats. But I really have no idea how to implement that in an efficient way because it probably needs to write to the file system itself to persist it in some sort of database or linked structure. Maybe we could borrow some ideas from what the kernel does for RAM with multi-gen LRU or idle-page tracking. Also, I think meta data should be allowed to migrate to slower storage, too, e.g. if you have lots of meta data hanging around in old snapshots you usually do not touch. This can free up valuable space for tiered caching on the faster drives. |
There is also the situation of files where low latency is desired but are accessed infrequently, such as the contents of Or perhaps the user plays games, which are sensitive to loading latency, and wants to prevent the files from being stored on spinning rust, but doesn't want their fastest storage consumed by them (think NVMe/Optane + SSD + HDD). It may also be convenient to place all small files on the fastest tier, similar to how ZFS has the Maybe intelligence could be added to places new files at different tiers based on "magic tests" similar to how
Could that be implemented with access count and access time? Update the access count whenever the access time is updated, with a max value (maybe 255). If it's been x amount of time since the last access, reset the count to 0, where x could be tunable. User space could examine the access time and count and decide to move the data. |
This is one example why allocation hints should not be used for this. It'll create a plethora of different options and tunables the file system needs to handle. Everything you describe could be easily handled by a user-space daemon which migrates the data. If anything, allocation hints should only handle the initial placement of such data. And maybe it would be enough to handle that per subvolume. But then again: At some point in time, the fast tier will fill up. We really need some measure to find data which can be migrated to slower storage. But you mention a lot of great ideas, we should keep that in mind for a user-space daemon.
I think on a cow file system, we should try to keep updates to meta data as low as possible. Often, atime updates are completely disabled, or at least set to relatime which updates those times at most once per 24h. So this would not make a very valuable statistic. Also, it would probably slow down reading and increase read latency because every read involves a potentially expensive write - the opposite of what we want to achieve. For written data, a user-space daemon could possible inspect transaction IDs and look for changed files, similar to how bees tracks newly written data. But this won't help at all for reading data. Maybe read events could be placed in an in-memory ring buffer, so a user-space daemon could read it and create a useful statistic from it. And if we lose events due to overrun, it really doesn't matter too much. I think it's sufficient to implement read tiering in an best-effort manner, it doesn't need to track everything perfectly. |
Monitoring reads and writes could potentially be done using What are the effects of adding additional modes? For tiering we might need something like The problem is, I am guessing, to extend current patch set with additional tiers is that we only have one type of In the end, for my personal use-cases, the current solution works very well and I think that a lot of users would benefit if we could mainline it as it is. I know there are issues with for example free space calculation and possibly sanity checking so that users dont lock themselves out by using bad combinations, etc, but maybe it is enough to solve some of them? |
I'm not sure if this is an official one (as far as the patch set of the original author is "official"). Do you refer to my latest patch in kakra/linux#31? I think we could easily add tier1,2,3 for data. It would use data disks in the preferred order then. To make it actually useful, the balance API needs an option to hint for one of the tiers. And then we'd need some user-space daemon to watch for new data, observe read frequency, and decide whether it should do a balance with a tier hint. This would actually implement some sort of write cache because new data would go to the fastest tier first, and then later become demoted to slower tiers. I wonder if we could use generation/transaction IDs to demote old data to slower tiers if the faster tier is filling up above some threshold - except it was recently read. |
Yes, I think prefer none seems logical to have. It is that patch set that iI have been using for some time now. |
Yeah, I added that because I had a use-case for it: A disk starts to fail in a raid1 btrfs. But I currently do not have a spare-disk I want or can use. It already migrated like 30% to the non-failing disks just by using the system. So it seems to work. :-) |
Regarding the original topic, are we any closer to a consensus? Perhaps this should be brought to the mailing list? |
@Forza-tng While using the new bees version, I had the idea of creating a I'm not sure if I understand the btrfs code enough to implement such a feature, or if that is even easily possible. But to me, this idea is quite tempting. But it will also make allocation and performance less predictable because of extent splitting when writing to shared extents, or writing bigger data could suddenly make it slower. It would need some close observation of extent size statistics and free space per device. Also, I'd still prefer the extent type hint over any size-based hint but maybe type hints would even be no longer needed then... |
We need to agree on how this system will work before we start writing code again. The main goal is to provide users the ability to specify a disk (or set of disks) to dedicate for metadata. There's a few policy questions that need to be asked here. Please discuss with new comments, and as we come to a consensus I will update this main entry with the conclusion of our discussions.
Policy questions
Userspace implementation
Kernel implementation
The text was updated successfully, but these errors were encountered: