-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demystifying needed options with QubesOS pool in btrfs reflink (multiple cow snapshots rotating, beesd dedup and load avg hitting 65+ on 12 cores setup) #283
Comments
@tlaurion Having helped someone (and myself) recently with Btrfs degraded performance, two things stood out:
What I prescribe is pretty simple:
Also:
Worth trying: 'ssd_spread' option, if it has any effect on metadata Batch defrag is by far the most important factor above, IMO. Using a margin of additional space in exchange for smooth operation seems like a small price to pay (note that other filesystems make space/frag tradeoffs automatically). Making the defrag a batch op, with days in between, gives us some of the best aspects of what various storage systems do, preserving responsiveness while avoiding the worst write-amplification effects. 'autodefrag' will make baseline performance more like tLVM and could also increase write-amplification more than other options. But it can help avoid hitting a performance wall if for some reason you don't want to use Long term, I would suggest someone convince the Btrfs devs to take a hard look at the container/VM image use case so they might help people avoid these pitfalls. Qubes might also help here as well: If we created one-subvol-per-vm and used subvol snapshots instead of reflinks, then the filesystem could be in 'nodatacow' mode and you would have a metadata performance profile closer to NTFS with generally less fragmentation because not every re-write would create detached/split extents. Qubes could also create a designation for backup snapshots, including them in the revisions_to_keep count. With that said, all the CoW filesystems have these same issues. Unless some new mathematical principle is applied to create a new kind of write-history, then the trade-offs will be similar across different formats. We also need to reflect on what deduplication means for active online systems and the degree to which it should be used; the fact that we can dedup intensively doesn't mean that practice isn't better left to archival or offline 'warehouse' roles (one of Btrfs' target use cases). FWIW, the Btrfs volume I use most intensively has some non-default properties:
Probably 'no-holes' has the greatest impact. I suspect the jbod hurts performance slightly. The safety margins on Btrfs are such that I'd feel safe turning off RAID1 metadata if it enhances performance. Also, I never initiate balancing (no reason why). |
Better, but not quite there yet. Some default, yet unchanged options from QoS 4.2.1 installer's FS creation defaults:
fstab:
Ran Happening on @tasket :Thought reflink was not supposed to copy image but reference disk images. |
Reflink copy will duplicate all the extent information in the source file's metadata to the dest file. Its not like a hard link (which is just one pointer to an inode) but usually much bigger. I am pretty sure Wyng is using reflink copy the same way Qubes Btrfs driver is. One difference is that after making reflinks, Wyng creates a read-only subvol snapshot, reads extent metadata from it, then deletes the snapshot (when it displays "Acquiring deltas"). You might try looking at a 'top' listing during that phase to see if there is anything unusual. For volumes over a certain size (about 128GB) Wyng will use a tmp directory in /var instead of /tmp; the more complex/deduped a large volume is, the more it will write data to /var (vaguely possible its creating your spike, but unlikely). Also check for swap activity. |
PS: Look at |
How do snapshots make a difference here vs. reflinks? If you want But with snapshots or with reflinks, I think the limiting factor is QubesOS/qubes-issues#8767 i.e. the inherent amount of CoW currently required by the Qubes OS storage API. |
HOULA. That consumed all available space on pool, since my use case implies cloning qubes and templates, which is totally incompatible with doing that, and should not be advised doing at all. Currently deleting wyng1 files, where defrag command resulted in errors because 1.8tb partition for btrfs became full as a result of this, where is was 40% before doing defrag op. Not linked to bees at all here, since it was not deployed. Clones being defragged resulted into poll becoming full and should be avoided at all costs. Will most probably have to go back the bees way to revert this action. Hopefully this will fix this, and should never have ran this command. |
Deleting wyng1 files is freeing more then 522gb and counting and of now. |
The latest master branch (v0.11 rc) should properly revert this action and may even result in a better situation as before regarding framentation, because the new bees will prefer larger extents over small ones with the new scan mode. So defragmenting and then letting bees do its job is actually quite a good way. You'll see an increase in meta data probably but I could see a subjective improvement of data access after such an operation (not in QubesOS, tho). |
I'll revisit this issue and update this post with details as I gather information along.
First sharing current frozen screen waiting for system to deal with iowait and writing changes to disk after having ran beesd on qubesos deployed with qusal; which means a lot of clones were created from base minimal templates then specialized with different packages installed in derived templates and where origin of clone also were updated. This means the origin of the clones were intact origin of clones, then those disk images in reflink pool diverged, and where bees deduped extents that were the same.
Notes:
qvm-volume revert qube:volume
helper. This permits the end user to revert up to two states of a qube after having shutdown it after realizing he did a stupid mistake, eg wiping ~/, for up to two subsequent reboot of a qube without needing to rely on backups to restore files/disk image states.@tasket @Zygo : you have some guidelines on proper btrfs improvements of what is best known to work in cow disk images under virtualization context and more specific to qubesos use case of btrfs that should be tweaked? Willing to reinstall and restore backup if needed, where from current understanding most can/should be tweak able by balancing/fstab/tunefs without needing to reinstall.
Any insights welcome. I repeat, if I defrag: deduped is canceled and performances go back to normal. Thanks for your time.
The text was updated successfully, but these errors were encountered: