Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.6] Track btrfs patches #31

Closed
wants to merge 5 commits into from
Closed

Conversation

kakra
Copy link
Owner

@kakra kakra commented Nov 26, 2023

Export patch series: https://github.com/kakra/linux/pull/31.patch

  • Allocator hint patches: Allows to prefer SSDs for meta-data allocations while excluding HDDs from meta-data allocation, greatly improves btrfs responsiveness, file system remains compatible with non-patched systems but won't honor allocation preferences then (re-balance needed to fix that after going back to a patched kernel)

To make use of the allocator hints, add these to your kernel. Then run btrfs device usage /path/to/btrfs and take note of which device IDs are SSDs and which are HDDs.

Go to /sys/fs/btrfs/BTRFS-UUID/devinfo and run:

  • echo 0 | sudo tee HDD-ID/type to prefer writing data to this device (btrfs will then prefer allocating data chunks from this device before considering other devices) - recommended for HDDs, set by default
  • echo 1 | sudo tee SSD-ID/type to prefer writing meta-data to this device (btrfs will then prefer allocating meta-data chunks from this device before considering other devices) - recommended for SSDs
  • There's also type 2 and 3 which write meta-data only (2) or data only (3) to the specified device - not recommended, can result in early no-space situations
  • Added 2024-06-27: Type 4 can be used to avoid allocating new chunks from a device, useful if you plan on removing the device from the pool in the future: echo 4 | sudo tee LEGACY-ID/type
  • NEVER EVER use type 2 or 3 if you only have one type of device
  • The default "preferred" heuristics (0 and 1) are good enough because btrfs will always allocate from devices with most space first (respecting the "preferred" type with this patch)
  • After changing the values, a one-time meta-data and/or data balance (optionally filtered to the affected device IDs) is needed

Important note: This recommends to use at least two independent SSDs so btrfs meta-data raid1 requirement is still satisfied. You can, however, create two partitions on the same SSD but then it's no longer protected against hardware faults, it's essentially dup-quality meta-data then, not raid1. Before sizing the partitions, look at btrfs device usage to find the amount of meta-data, at least double that size to size your meta-data partitions.

This can be combined with bcache by directly using meta-data partitions as a native SSD partition for btrfs, and only using data partitions routed through bcache. This also takes a lot of meta-data pressure from bcache, making it more efficient and less write-wearing as a result.

Real-world example

In this example, sde is a 1 TB SSD having two meta-data partitions (2x 128 GB) with the remaining space dedicated to a single bcache partition attached to my btrfs pool devices:

# btrfs device usage /
/dev/bcache2, ID: 1
   Device size:             3.63TiB
   Device slack:            3.50KiB
   Data,single:             1.66TiB
   Unallocated:             1.97TiB

/dev/bcache0, ID: 2
   Device size:             3.63TiB
   Device slack:            3.50KiB
   Data,single:             1.66TiB
   Unallocated:             1.97TiB

/dev/bcache1, ID: 3
   Device size:             2.70TiB
   Device slack:            3.50KiB
   Data,single:           752.00GiB
   Unallocated:             1.96TiB

/dev/sde4, ID: 4
   Device size:           128.00GiB
   Device slack:              0.00B
   Metadata,RAID1:         27.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           100.97GiB

/dev/sde5, ID: 5
   Device size:           128.01GiB
   Device slack:              0.00B
   Metadata,RAID1:         27.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           100.98GiB

# bcache show
Name            Type            State                   Bname           AttachToDev
/dev/sdd2       1 (data)        dirty(running)          bcache1         /dev/sde2
/dev/sdb2       1 (data)        dirty(running)          bcache2         /dev/sde2
/dev/sde2       3 (cache)       active                  N/A             N/A
/dev/sdc2       1 (data)        clean(running)          bcache3         /dev/sde2
/dev/sda2       1 (data)        dirty(running)          bcache0         /dev/sde2

A curious reader may find that sde1 and sde3 are missing, which is my EFI boot partition (sde1) and swap space (sde3).

@Forza-tng
Copy link

Hi @kakra any chance to rebase against 6.7? The 6.6 patches do apply on 6.7 kernel, but with some offsets, so not a problem yet.

@kakra
Copy link
Owner Author

kakra commented Feb 29, 2024

I'm only doing this for LTS kernels. Maintaining it for the fast moving kernel versions involved too much overhead in the past.

But yes, this one probably applies cleanly with some offsets to 6.7 and on. It applied without conflicts since 6.1 or even 5.19. I'm sure it will apply to 6.8, too.

@kakra
Copy link
Owner Author

kakra commented Jun 27, 2024

added: type 4 for avoiding allocations on a disk, useful if you plan to remove the drive in the future, e.g. due to degrading performance or damaged sectors.

@kakra
Copy link
Owner Author

kakra commented Jul 16, 2024

Added a patch to prevent a very rare and obscure corruption from reaching the disk, the file system will still flip RO but won't corrupt itself. Will be removed when upstream patches reach stable LTS.

@kakra kakra force-pushed the rebase-6.6/btrfs-patches branch from cffc5a2 to 76e1b3e Compare July 19, 2024 09:16
kreijack and others added 5 commits September 9, 2024 21:25
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:

- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
  preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
  preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
  only data chunk allowed

Signed-off-by: Goffredo Baroncelli <[email protected]>
When this mode is enabled, the chunk allocation policy is modified as
follow.

Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.

Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).

Signed-off-by: Goffredo Baroncelli <[email protected]>
This is useful where you want to prevent new allocations of chunks on a
disk which is going to removed from the pool anyways, e.g. due to bad
blocks or because it's slow.

Signed-off-by: Kai Krakow <[email protected]>
@kakra kakra force-pushed the rebase-6.6/btrfs-patches branch from 76e1b3e to 186029a Compare September 9, 2024 19:33
@kakra
Copy link
Owner Author

kakra commented Sep 9, 2024

Corruption fix patch is part of v6.6.50, dropped "validate dref root and objectid".

@kakra
Copy link
Owner Author

kakra commented Nov 23, 2024

Obsolete, see #36 instead.

@kakra kakra closed this Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants