-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call for testers - release candidate v0.11-rc3 #300
Comments
Building v0.11-rc3 fails for me with
on Ubuntu 20.04.6 LTS. As far as I understand it, Ubuntu 20.04 only provides Linux kernel userspace headers for kernel 5.4 (package bees v0.11-rc2 builds and works fine. I do not have strong feelings about Ubuntu 20.04 support, so if you decide to not support it any more, that would be fine by me. |
Hello, thank you for your work. I read a lot of comments about simultaneous compression and dedup and seem confused, can please you clarify the following : Sorry to ask here, some themes have not been updated for a long time and I didn’t want to create a new theme that duplicated the old one. |
There's something like that on roadmap, but it doesn't make bees run any faster, so it's a low priority (or an opportunity for external contribution--patches welcome).
Any form of defragmentation looks like new data to bees, which will attempt to dedupe the new data, possibly reinstating some of its earlier fragmentation. In v0.11, bees will now only introduce fragments when most of the space used by the fragmented extent can be freed as a result of the fragmentation. This produces a healthier balance between defrag and dedupe.
Compression and dedupe are independent features, and they can be combined to get the good effects of both. Note that in the graph, the compressed and uncompressed data sets are both 45% smaller after dedupe, but they were different sizes at the beginning. The compressed data set went from 100 GiB to 55 GiB, while the uncompressed data set went from 160 GiB to 90 GiB. The effect of dedupe was the same, but the compressed data was smaller due to compression, so fewer bytes were saved by dedupe because they were already saved by compression. Combining compression and dedupe, the data went from 160 GiB (uncompressed without dedupe) to 55 GiB (compressed with dedupe), a total savings of 65%. bees won't recompress data for you. You'll have to do that yourself, with |
It turned out to be not too painful to add compatibility code for older headers, similar to what bees already does for btrfs, so I pushed a fix to If anyone wants to test it on their DEC Alpha...let me know if it works. 😉 |
In #296 you said |
not quite yet ...
|
Doesn't bees create compressed temporary files to place dedupe extents? |
Not necessarily... btrfs stripes reads by PID modulo. So if you have two disk candidates to read from, btrfs does So if bees happens to read only from "even" PIDs, two threads would read the same stripe - always - because bees threads are relatively long living. I wonder if we could get some stripe selector into the kernel at some time which decides reading based on IO latency. But this is difficult to do, because switching the stripe too often will reduce throughput. And the gains seem not to be very high: There has been a patchset in the past trying to do something like that, and it didn't improve a lot. So if bees could read from two stripes in parallel, it would probably not gain a lot. |
Could bees account for the PID modulo 2 thing? Like just use 2 threads for reading but make sure that one has an odd PID and one an even. If PID can't be controlled when running threads, getting and even and an odd still shouldn't be to hard, just create threads until you have at least one even and one odd PID then get rid of the extra threads you made.
If switching is bad how about just assigning each thread a stripe to read from but instead of making it based on PID base in on IO latency or throughput or utilization of each stripe or whatever trying to balance reads between stripes, switching could be done very infrequently maybe once per minute or less.
If you account for the PID thing you'd have 2 independent threads doing reads without switching, why would it not gain a lot? In theory reading from 2 stripes should be an up to 100% improvement in read speed. |
Because there is still meta data which may not follow the same logic, and because btrfs is not really good at parallelizing reads and writes from the same consumer. Most of the time in bees seems to be spend on very short extents which involves a lot of metadata IO. So while reads of large extents could benefit from such a logic, it would be only a very small gain in the overall extent processing. I'd rather leave that extra "free" IO to other processes. But I'm pretty sure bees did multiple read threads in the past... @Zygo? So this feature must have been dropped for a reason. |
How is two bees thread doing reads any different then two completely independent consumers? As long as the two threads work on scanning different data (don't try to read the same file/extent/whatever bees works with 2 threads) they should not really interfere with each other. How is bees any different then two compactly separate applications reading 2 different things from the 2 copies?
Is there a reason metadata reads wouldn't follow the same logic assuming it's also raid1 or higher (I guess if data and metadata are at different raid levels they would use different module values) for me data is raid10 and metadata is raid1c3.
My guess (I have no idea what bees actually did in previous and I'm just going of one part of last issue saying it now only uses one thread for reading.) is that previously bees just didn't have a limit on how many threads could read at the same time, which probably degraded performance/gave no performance gains since you had multiple threads reading from the same disks. |
It wasn't a feature. Old versions simply hurled IO requests at the kernel, and the kernel generally scheduled them in the least efficient way possible. There were As a temporary fix for v0.11, all readahead/prefetching was moved to userspace, and a mutex was introduced to ensure that only one thread at a time does prefetch (the downstream processing is still parallel, but it's all working from page cache after prefetch is done). Extent scan ensures that IO patterns generated by bees reused cached pages more often, which gains back most of the performance that subvol scans were losing by design. It's a temporary fix because the more permanent fix will be to remove as much block reading as possible. There's no need to make that code run faster or in parallel, if we can avoid running it in the first place. Getting rid of reads is next on the todo list, but we want to get the non-trivial gains we have so far into a stable release so distros will pick it up, and people can stop running v0.10 and earlier. It doesn't make a lot of sense to micro-optimize reads when:
I made some prototypes where bees figures out where we can send reads in parallel based on the chunk tree, and schedule tasks to avoid reading from the same devices at the same times, using the same framework bees currently uses to avoid inode and extent lock contention. They suffer from two problems:
Some of these problems could be solved, but the csum tree scanner simply doesn't have the problems to begin with, and it's hundreds of times faster on all storage configurations. After that's done, we can look at where bees spends its time and focus effort there. Another possible micro-optimization is to use the |
Only for partially duplicate extents where it can't find a duplicate copy of some of the data, and only when those fragments are compressible. In the 160 GiB uncompressed test, only 293 MiB was compressed by bees (and about 7 GiB was zeros, so those are now holes that don't count as space usage).
|
Why wouldn't raid10 benefit? Yes it already does parallel reads from 2 devices because of the 2 stripes but there's also another copy like in raid1. Wouldn't the raid1 optimization also work on raid10 (going from reading from 2 devices to all 4 in a 4 drive raid 10 setup)? |
That's not how striping works. If your IO request is large enough to cover multiple stripes, yes, they can be read in parallel (increased throughput). If your IO request is smaller than the stripe width, then nothing is read in parallel and you rather gain performance from multithreaded workloads (reduced latency). That's why you want smaller stripes for databases and larger stripes for file servers. In raid1, the system has multiple (usually 2) independent IO queues to work with, and some latency-depending IO algorithm could distribute the load to those two queues. But care must be taken to keep logically sequential application access in the same queue otherwise you'll reduce overall performance by disturbing other independent workloads. For read workloads, raid0 and raid1 are mostly identical. For write workloads, they are obviously not. For raid5 workloads, you get a less predictable mixture of both, where writes are limited by the slowest member (like in raid1), and reads are limited by (n-1) IO queues. raid5 may do diagonal striping to better distribute competing sequential workloads so the parity stripe does not span a single disk only. bees has no such predictable workloads. And raid does not parallelize anything at all. By design. |
With
It probably has been that way since a long time but due to the new silenced logger, I've increased verbosity from 5 to 6. And maybe we should ship bees with verbosity 7 by default instead of 8? |
...sigh. Try now? |
Yes, thanks. Running master at commit 85aba7b since ~9 hours already. |
The quoted paragraph assumes the kernel is already parallelizing raid1 reads (or at least preventing userspace from predicting device usage) with one of the non-default read_policy settings. All the raid1* profiles in btrfs use the same mirror preference choice, so with default read_policy it would be possible to get two non-interfering readers on raid10; however, only two would be possible in the entire filesystem, regardless of the number of drives, due to the striping. (edit: it also relies on btrfs always assigning a device to the same position in the raid10 stripes, which is not guaranteed to happen when there's an even number of drives, and guaranteed not to happen when there's an odd number.) It won't scale up with the number of drives as the other raid1* and single profiles do. Either way, the gain is far too small to be worth pursuing yet. |
Yeah, that error should be captured and ignored. It's trying to find While we're supporting XFS for $BEESHOME, I should put the
...and maybe rearrange some of the log messages. Exception trace logs maybe should be at debug level or one level above, as most of those are expected events where we just have to stop big complex operation because the data changed under us while we were working on it. Only actual IO errors should be visible at lower verbosity levels. We don't use log level 1, 2, or 4 at all. |
I'm just wondering why bees is talking this long to finish it's initial run for me for this new version. |
Has been the same for me, and then suddenly it finished. Don't look at the ETA numbers, they are vastly inaccurate. The new bees is actually a lot faster. I've done the same and purged the hash and state, and let the new bees start over. And coming back to the raid argument: No that won't help here. Your problem is not throughput, it's how the smaller extents are laid out across your file system. Bees probably spends most time in kernel calls where the kernel is slow due to how btrfs works. Selectively defragmenting those files which are made of thousands and millions of extents may speed this process up considerably - given that those files are not part of any snapshots. For me, such files are usually log files, the Actually, I've used bees in the past, and current version, to spot the files and directories which take a lot time to process. When bees reached the end of the smaller size classes, it's probably way behind the current generation number and will spawn another round of extra long ETA - and then just finish after 1-2 weeks or faster. |
Can confirm the ETA function is terrible. I had a 35-week ETA for 128K extents abruptly finish a few days ago. When I examined the logs in detail, I found that age of the extents is part of the problem: after a scan has been running for a while, more and more of the filesystem is above the top of the transid range (i.e. the data is newer than the scan). When the ETA is calculated, it's based on proportions of the total space, but it doesn't take time into account. The ETA assumes all the data will be processed in the current scan, when in reality the new data will be skipped in the current scan and become part of the next scan. I don't see how to use the above information to improve the ETA (any stats wizards out there?). If we try to exclude data that's outside of the transid range, then all the size estimations will be based on the most recently written data in later scans. When bees is up to date and processing new writes as they happen, these samples are very small with huge variation, so the ETAs computed from that data are junk. |
Can you post your progress table and your kernel version? The "RATES" section of the status file might also be informative. I don't want to delay v0.11 chasing down every possible performance improvement--that's what later releases are for--but I also want to make sure we haven't got an unanticipated problem. |
I've actualy saved multiple copies of my status files so here they are, I post the full status file in case anything in the could be useful. My Kernel version is 6.8.0-51-generic
This is my status file from yesterday:
This was my status file on the 195h of January
January 11th
January 2nd
|
There seems to be some problems with the kernel 6.13. Tomorrow I will check on 6.12 to see if this is the cause of the problem.
UPD:I don’t get a similar message on 6.12 on the same data, tomorrow I’ll try again on 6.13 |
Hi, I've seen that there had been changes in bees and AFAIR there hadn't been for a while so I just “git pulled” from master, recompiled and installed it on my 2 NASes, one classical x86_64 Intel Debian machine and a less classical aarch64 Debian Raspberry Pi 5. First thing I notice running bees with the same configuration as previously is that it seems to be working (well it uses CPU and the disks are scratching) but I have completely lost any output to syslog (journald). I still get some usual output from startup thru “beesd[189647]: bees version v0.11-rc3-9-g85aba7b” Did I miss something about changes in logging configuration or whatever ? (Both machines running Debian 12 Logging lost on both.) Thanks in advance. |
That looks like a debug stack trace from |
Check the status files in The main dedupe search function was rewritten, and a lot of log messages went away. The default log level (4) is lower than the list of all deduplicated filenames and offsets (6) or debug messages (7). Also, when switching an existing configuration to |
7.7 M/s random and cached reads (
Opening files ( Dedupes are a very small component of the overall picture, less than 0.4% (
That might be why the new run seems slower than the old run--because the old run made the new run slower.
This is a cold start ( |
Well, it's a cold start anyways when going to v0.11 from v0.10, isn't it? The new scanner uses a different set of (virtual) subvolids for tracking the progress, and this starts from 0. Would it be a good idea to defrag the files that v0.10 pushed to very high reference counts? Then it would be nice if v0.11 could log such file names where it would clearly see that reference count is very high, extent size is very small, and that's probably caused by pre-v0.11. Is something like this possible? Maybe as a separate tool? Previously, I used the performance log lines to get an indicator of which files I should probably defrag. But I think you removed that. |
It starts from the lowest
Currently very high reference count extent logical addresses are logged in the debug log, but bees doesn't get the filenames because there are tens of thousands of them. The extent is simply ignored, and removed from the hash table if it was previously inserted (presumably when it had a lower ref count). Other extents with medium-high reference counts are processed normally, with only It's possible to defrag those files, but they're kind of like asbestos in buildings--if you don't need to modify them, it's better to leave them where they are. They take as long to defrag as it takes to fragment them in the first place. For now, I'd let bees v0.11 grind its way through them. Once they're scanned, bees won't look at them again.
There's the thing @Forza-tng was working on... |
It didn't do a cold start here (it would take weeks on my NAS) but it seems to be happy :
|
Continued from #296. Thanks to all who tested -rc1 and -rc2!
New since -rc2:
openat2
on kernels that have it. This uses weak symbols for the first time. It may cause building or running issues for anyone building or running with non-mainstream libc or toolchain.These changes might cause a regression relative to -rc2. If there are any, I'll back out the problematic changes, and v0.11 will be based on -rc2 or -rc3 with the offending commits reverted.
Todo:
beescrawl.dat
format for extent scan? (Might be too large of a change for v0.11, and we might have to support what -rc1 did for the foreseeable future anyway)The text was updated successfully, but these errors were encountered: