diff --git a/README.md b/README.md
index 8124ac9..39585bf 100644
--- a/README.md
+++ b/README.md
@@ -10,810 +10,903 @@ associated problem space.
 point that out explicitly and clearly in the associated patches and Cc
 `Christian Brauner <brauner (at) kernel (dot) org`.**
 
-* Ability to mount sub-directories of regular file systems instead of
-  the top-level directory. i.e. for a file system `/dev/sda1` which
-  contains a sub-directory `/foobar` mount `/foobar` without having
-  to mount its parent directory first. Consider something like this:
-
-  ```
-  mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
-  ```
-
-  (This is of course already possible via some mount namespacing
-  shenanigans, but this requires namespacing to be available, and is
-  not precisely obvious to implement. Explicit kernel support at mount
-  time would be much preferable.)
-
-  **Use-Case:** `systemd-homed` currently mounts a sub-directory of
-  the per-user LUKS volume as the user's home directory (and not the
-  root directory of the per-user LUKS volume's file system!), and in
-  order to implement this invisibly from the host side requires a
-  complex mount namespace exercise.
-
-* inotify() events for BSD file locks. BSD file locks
-  (i.e. `flock()`, as opposed to POSIX `F_SETLK` and friends are
-  inode-focussed, hence would be great if one could get asynchronous
-  notification when they are released via inotify.
-
-  **Use-Case:** udevd probes block devices whenever they pop up to
-  create /dev/disk/by-label/* and similar symlinks. Formatting tools
-  can temporarily block this behaviour by taking a BSD file lock on
-  the block device (as per https://systemd.io/BLOCK_DEVICE_LOCKING),
-  in order to make sure udevd doesn't probe file systems/partition
-  tables that are only partially initialized. Currently, udevd uses
-  inotify `IN_CLOSE_WRITE` notifications to detect whenever
-  applications close a block device after writing to it, and
-  automatically reprobes the device. This works reasonably OK given
-  that block devices are usually closed at the same time as their
-  BSD file lock is released, and vice versa. However, this is not
-  fully correct: what udevd actually should be watching is the locks
-  being released, not the devices being closed.
-
-* `SCM_CGROUP` or a similar auxiliary socket message, that allows
-  receivers to figure out which cgroup a sender is part of.
-
-  **Use-Case:** `systemd-journald` picks up cgroup information from
-  logging clients, in order to augment log records and allow
-  filtering via this meta-information. In particular it derives
-  service identity from that (so that requests such as "Show me all
-  log messages of service X!" can be answered). This is currently
-  racy, since it uses `SCM_CREDS`' `.pid` field for this, which it then
-  used to load `/proc/$PID/cgroup`. In particular for programs that
-  log and immediately exit, the cgroup information frequently cannot
-  be acquired anymore by `systemd-journald`.
-
-* Ability to link an `O_TMPFILE` file into a directory while *replacing* an
-  existing file. (Currently there's only the ability to link it in, if the
-  file name doesn't exist yet.)
-
-  **Use-Case:** there are many programs (e.g. `systemd-hostnamed`
-  when updating `/etc/hostname`) that atomically want to update a
-  file, so that either the old or the new version is in place, but
-  never a partially updated one. The canonical way to do this is by
-  creating a temporary file with the new contents, and then renaming
-  it to the filename of the file to update, thus atomically replacing
-  it. Currently, the temporary file for this must be created with a
-  random name, `O_TMPFILE` cannot be used, since for these files
-  atomic-replace is not supported, currently.
-
-* `O_REGULAR` (inspired by the existing `O_DIRECTORY` flag for
-  `open()`), which opens a file only if it is of type `S_IFREG`.
-
-  **Use-Case:** this would be very useful to write secure programs
-  that want to avoid being tricked into opening device nodes with
-  special semantics while thinking they operate on regular
-  files. This is particularly relevant as many device nodes (or even
-  FIFOs) come with blocking I/O (or even blocking `open()`!) by
-  default, which is not expected from regular files backed by "fast"
-  disk I/O. Consider implementation of a naive web browser which is
-  pointed to `file://dev/zero`, not expecting an endless amount of
-  data to read.
-
-* `unlinkat3(dir_fd, name, inode_fd)`: taking one file descriptor
-  for the directory to remove a file in, and another one referring
-  to the inode of the filename to remove. This call should only
-  succeed if the specified filename still refers to the specified
-  inode.
-
-  **Use-Case:** code that operates on a well-know path that might be
-  shared by multiple programs that jointly manage it might want to
-  safely remove a filename under the guarantee it still refers to
-  the expected inode. As a specific example, consider lock files,
-  that should be cleaned up only if they still refer to the assumed
-  owner's instance, but leave the file in place if another process
-  already took over the filename.
-
-* Ability to determine if a mount point belongs to the current user
-  namespace, in order to check if there's a chance a process can
-  safely unmount it (as that only works for mounts owned by the same
-  user namespaces — or one further down the tree, but not any up the
-  tree). A simple, additional field in `/proc/self/mountinfo`
-  container the owning user namespace ID would probably already
-  suffice.
-
-  **Use-Case:** the `systemd` system and service manager tries to unmount
-  all established mounts on shutdown. Inside of container
-  environments where specific mounts are established by the
-  container manager (and not the payload itself) this will
-  ultimately fail if user namespaces are enabled. In order to clean
-  up the shutdown logic it would be very good to be able to
-  determine whether a specific mount could even possibly be
-  unmounted or whether it's not worth the effort to include the
-  unmount in the system shutdown transaction.
-
-* A way to mark mounts that receive mount propagation events from
-  elsewhere so that these propagated mounts are established
-  read-only implicitly. Right now, if a mount receives a mount
-  propagation event it will have the exact same `MS_RDONLY`,
-  `MS_NODEV`, … flags as it has where it originated. It would be
-  very useful if an `MS_RDONLY` could be ORed into the mount flags
-  automatically whenever propagated elsewhere.
-
-  **Use-Case:** various mount namespace based sandboxes
-  (e.g. `systemd`'s `ProtectSystem=` option) mark large parts of the
-  host file hierarchy read-only via mounting it
-  `MS_RDONLY|MS_BIND|MS_REMOUNT`, but generally intend to leave the
-  file hierarchy besides that the way it is, and that includes they
-  typically still want to be able to receive mount events to
-  directories such as `/mnt/` and `/media/` in these sandboxed
-  environments. Right now, any such propagation then happens in
-  writable mode, even if the file hierarchy otherwise is almost
-  entirely read-only. To close this gap it would be great if such
-  propagated mounts could implicitly gain `MS_RDONLY` as they are
-  propagated.
-
-* Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX`
-  sockets. Right now reception of file descriptors is always on when
-  a process makes the mistake of invoking `recvmsg()` on such a
-  socket. This is problematic since `SCM_RIGHTS` installs file
-  descriptors in the recipient process' file descriptor
-  table. Getting rid of these file descriptors is not necessarily
-  easy, as they could refer to "slow-to-close" files (think: dirty
-  file descriptor referring to a file on an unresponsive NFS server,
-  or some device file descriptor), that might cause the recipient to
-  block for a longer time when it tries to them. Programs reading
-  from an `AF_UNIX` socket currently have three options:
-
-  1. Never use `recvmsg()`, and stick to `read()`, `recv()` and
-     similar which do not install file descriptors in the recipients
-     file descriptor table.
-
-  2. Ignore the problem, and simply `close()` the received file descriptors
-     it didn't expect, thus possibly locking up for a longer time.
-
-  3. Fork off a thread that invokes `close()`, which mitigates the
-     risk of blocking, but still means a sender can cause resource
-     exhaustion in a recipient by flooding it with file descriptors,
-     as for each of them a thread needs to be spawned and a file
-     descriptor is taken while it is in the process of being closed.
-
-  (Another option of course is to never talk `AF_UNIX` to peers that
-  are not trusted to not send unexpected file descriptors.)
-
-  A simple knob that allows turning off `SCM_RIGHTS` right reception
-  would be useful to close this weakness, and would allow
-  `recvmsg()` to be called without risking file descriptors to be
-  installed in the file descriptor table, and thus risking a
-  blocking `close()` or a form of potential resource exhaustion.
-
-  **Use-Case:** any program that uses `AF_UNIX` sockets and uses (or
-  would like to use) `recvmsg()` on it (which is useful to acquire
-  other metadata). Example: logging daemons that want to collect
-  timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message
-  broker and suchlike.
-
-* Another alternative to this could be if some form of filtering
-  could be enforced on the file descriptors suitable for en-queuing
-  on the `AF_UNIX` socket. i.e. allow filtering by superblock type
-  or similar, so that policies such as "only `memfd`s are OK to be
-  received" may be expressed. (BPF?).
-
-  **Use-Case:** a above.
-
-* A reliable (non-heuristic) way to detect from userspace if the
-  current process is running in a PID namespace that is not the main
-  PID namespace. PID namespaces are probably the primary type of
-  namespace that identify a container environment. While many
-  heuristics exist to determine generically whether one is executed
-  inside a container, it would be good to have a correct,
-  well-defined way to determine this.
-
-  **Use-Case:** tools such as `systemd-detect-virt` exist to determine
-  container execution, but typically resolve to checking for
-  specific implementations. It would be much nicer and universally
-  applicable if such a check could be done generically. It would
-  probably suffice to provide an `ioctl()` call on the `pidns` file
-  descriptor that reveals this kind of information in some form.
-
-* A way to exclude `pidfd` watched processes from `waitid(P_ALL, …)`.
-
-  **Use-Case:** various programs use `waitid(P_ALL, …)` to collect exit
-  information of exited child processes. In particular PID 1 and
-  processes using `PR_SET_CHILD_SUBREAPER` use this as they may
-  collect unexpected children that have been reparented from dying
-  sub-processes, and that need to be reaped in order to clean up the
-  PID space. Currently, these programs cannot easily mix waiting for
-  specific sub-processes via `pidfd` with waiting for the other
-  *unexpected* children via `waitid(P_ALL, …)` since the latter also
-  reaps (and thus invalidates) the pidfd-tracked
-  children. Specifically, the `systemd` service manager would like
-  to use `pidfd`s to remove PID recycling security issues, but
-  currently cannot as it also needs to generically wait for such
-  unexpected children.
-
-* Mount notifications that do not require continuous re-scanning of
-  `/proc/self/mountinfo`. Currently, if a program wants to track
-  mounts established on the system it can receive `poll()`able
-  events via a file descriptor to `/proc/self/mountinfo`. When
-  receiving them it needs to rescan the file from the top and
-  compare it with the previous scan. This is both slow and
-  racy. It's slow on systems with a large number of mounts as the
-  cost for re-scanning the table has to be paid for every change to
-  the mount table. It's racy because quickly added and removed
-  mounts might not be noticed.
-
-  **Use-Case:** `systemd` tracks the mount table to integrate the mounts
-  into it own dependency management.
-
-* An asynchronous or forced `close()`, that guarantees that
-  userspace doesn't have to risk blocking for longer periods of time
-  when trying to get rid of unwanted file descriptors, possibly
-  received via `recvmsg()` + `SCM_RIGHTS` (see above). Currently,
-  `close()` of various file descriptors (for example those referring
-  to slow storage, e.g. non-responding NFS servers and such) might
-  take arbitrary amounts of time, potentially into the minute range
-  and more. This makes it risky accepting file descriptors on
-  publicly accessible `AF_UNIX` sockets, the way like IPC brokers
-  (e.g. D-Bus) do it: if a rogue client keeps sending file
-  descriptors that because unexpected must be closed immediately it
-  might cause the receiving process to effectively crawl, when it is
-  busy closing them all. A special form of `close()` that simply
-  detaches a file descriptor from the file descriptor table without
-  blocking on IO in any form would be great to close this issue.
-
-  **Use-Case:** any program that receives file descriptors via `AF_UNIX`
-  from untrusted clients would benefit from this. e.g. D-Bus
-  brokers.
-
-* `CLOCK_MONOTONIC` network timestamps. Currently network timestamps
-  are exclusively in `CLOCK_REALTIME`, even though for many (most?)
-  a monotonic clock would be much preferable, as calculations become
-  easier when one doesn't have to think about clock jumps and
-  similar.
-
-  **Use-Case:** `systemd-journald` collects implicit timestamps via
-  `AF_UNIX` time-stamping, in `CLOCK_REALTIME`, even though for its
-  internal logic only monotonic timestamps are used, as log records
-  are searched via bisection in ordered tables, that require
-  strictly increasing timestamps. In particular during boot (where
-  `CLOCK_REALTIME` is often not available, stable or subject to
-  corrections) it would be good to have reliable, monotonic
-  timestamps on all log records.
-
-* Truly immutable loopback block devices. Right now setting up a
-  loopback block device in read-only mode, backed by a read-only
-  file (stored on a regular read/write file system), and then
-  mounting it with `ext4` also in `MS_RDONLY`mode *will* result in
-  changes to the file, quite unexpectedly 🤯. Ideally, if a loopback
-  block device is set up in read-only mode this should guarantee
-  that the backing file remains unmodified by it.
-
-  **Use-Case:** disk image build tools that want to reproducibly and
-  verifiable build images must be able to rely that mounting them in
-  read-only mode does not alter the images in any way. In particular
-  when working in computer forensics one must be able to rely that
-  file systems that are analyzed remain unmodified by the analysis.
-
-* A time-out for the flock() syscall. Faking the time-out in
-  userspace is nasty: most code does it with alarm() (or equivalent
-  APIs), but that's racy since on a heavily loaded system the
-  timeout might trigger before the flock() call is entered, in
-  particular if short time-outs shall be used. More accurate is to
-  do the locking in a short-lived child processed, but that's
-  difficult already in C, and almost impossible in languages that do
-  not allow fork() without execve().
-
-  **Use-Case:** as mentioned above systemd-udev allows synchronizing
-  block device probing via flock(). Often userspace wants to wait
-  for that, but without risking to hang forever.
-
-* Extend `mount_setattr()` to allow changing mount properties ignoring any
-  failures.
-
-  **Use-Case:** workloads that know that there are mounts in a mount tree
-  whose attributes cannot be changed by the caller don't want
-  `mount_settattr()` to fail on the first mount it failed to convert. Give
-  them a flag to request changes ignoring failures.
-
-* Add upgrade masks to `openat2()`. Extend `struct open_how` to allow
-  restrict re-opening of file descriptors.
-
-  **Use-Case:** block services or containers from re-opening/upgrading an
-  `O_PATH` file descriptor through e.g. `/proc/<pid>/fd/<nr` as `O_WRONLY`.
-
-* Make quotas work with user namespaces. The quota codepaths in the kernel
-  currently broken and inconsistent and most interesting operations are
-  guarded behind `capable(CAP_SYS_ADMIN)`, i.e., require `CAP_SYS_ADMIN` in
-  the initial user namespace. We should rework these codepaths to work with
-  user namespaces and then see whether we can make them work with idmapped
-  mounts.
-
-  **Use-Case:** using quotas correctly in containers.
-
-* Add a new `MOUNT_SETATTR_LOCKED` flag to `mount_setattr(..., ..., MOUNT_SETATTR_LOCKED, ..., ...)`.
-  The `MOUNT_SETATTR_LOCKED` flag allow a `ns_capable(mntns->user_ns,
-  CAP_SYS_ADMIN)` caller to lock all mount properties. The mount properties
-  cannot be changed anymore.
-
-  **Use-Case:** allowing processes to lock mount properties even for
-  privileged processes. Locking mount properties would currently involve
-  having to have the mount namespace of the container be owned by an ancestor
-  user namespace. But this doesn't just lock a single mount or mount subtree
-  it locks all mounts in the mount namespace, i.e., the mount table cannot be
-  altered.
-
-* Add a new `OPEN_TREE_CLEAR` flag to `open_tree()`. This flag can only be
-  used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
-  all mount properties from that mount including the mount's idmapping.
-  Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
-  are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
-  in the filesystems user namespace.
-
-  Locked mount properties cannot be changed. A mount's idmapping becomes
-  locked if it propagates across user namespaces.
-
-  This is useful to get a new, clear mount and also allows the caller to
-  create a new detached mount with an idmapping attached to the mount. Iow,
-  the caller may idmap the mount afterwards.
-
-  **Use-Case:** A user may already use an idmapped mount for their home
-  directory. And once a mount has been idmapped the idmapping cannot be
-  changed anymore. This allows for simple semantics and allows to avoid
-  lifetime complexity in order to account for scenarios where concurrent
-  readers or writers might still use a given user namespace while it is about
-  to be changed.
-  But this poses a problem when the user wants to attach an idmapping to
-  a mount that is already idmapped. The new flag allows to solve this
-  problem. A sufficiently privileged user such as a container manager can
-  create a user namespace for the container which expresses the desired
-  ownership. Then they can create a new detached mount without any prior
-  mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
-  mount.
-
-* Enforce that the user namespace about to be attached to a mount must have
-  an idmapping written.
-
-  **Use-Case:** Tighten the semantics.
-
-* Add an extension to `setns()` to allow attaching to all namespaces of
-  a process `SETNS_PIDFD_ALL` different from the caller's namespaces.
-  Currently specifying e.g., `CLONE_NEWUSER` fails if the caller is in the
-  same user namespace as the target process. This is very inconvenient.
-
-  **Use-Case:** Make it trivial to attach to all namespaces of a process
-  without having to figure out whether the caller is already in the same
-  namespace or not.
-
-* (kAPI) Add security hook to `mount_setattr()`.
-
-  **Use-Case:** Allow LSMs to make decisions about what mount properties to
-  allow and what to deny.
-
-* A per-cgroup knob for coredump sizes. Currently coredump size
-  control is strictly per process, and primarily under control of
-  the processes themselves. It would be good if we had a per-cgroup
-  knob instead, that is under control of the service manager.
-
-  **Use-Case:** coredumps can be heavy to generate. For different
-  usecases it would be good to be able to opt-in or opt-out
-  dynamically from coredumps for specific services, at runtime
-  without restarting them.
-
-* A way to race-freely create an (non-file) inode and immediately
-  open it. For regular files we have open(O_CREAT) for creating a
-  new file inode, and returning a pinning fd to it. This is missing
-  for other inode types, such as directories, device nodes,
-  FIFOs. The lack of such functionality means that when populating a
-  directory tree there's always a race involved: the inodes first
-  need to be created, and then opened to adjust their
-  permissions/ownership/labels/timestamps/acls/xattrs/…, but in the
-  time window between the creation and the opening they might be
-  replaced by something else. Addressing this race without proper
-  APIs is possible (by immediately fstat()ing what was opened, to
-  verify that it has the right inode type), but difficult to get
-  right. Hence, mkdirat_fd() that creates a directory *and* returns
-  an O_DIRECTORY fd to it would be great. As would be mknodeat_fd()
-  that creates a device node, FIFO or (dead) socket and returns an
-  O_PATH fd to it. And of course symlinkat_fd() that creates a
-  symlink and returns an O_PATH fd to it.
-
-  **Use-Case:** any program that creates/unpacks not just files, but
-  directories, device nodes, fifos, and wants to ensure that they
-  safely get the right attributes applied, even if other code might
-  simultaneously have access to the same directory tree.
-
-* The io_uring subsystem is open to adding classic existing synchronous
-  system calls (e.g. setns() or mount() or other) to io_uring. They also said
-  they would support adding new functionality into io_uring that is not
-  exposed through system calls yet.
-
-* Explore the idea of mapping different uids/gids to the same uids/gids, i.e.
-  65534:1000:1 50000:1000:1. This will only work if the mount is read-only as
-  the kernel wouldn't know what uid/gid would need to be put to disk
-  otherwise (65534? 50000? the first one that is mapped?).
-
-  **Use-Case:** Delegate multiple {g,u}ids to the same user. Merging
-  ownership similar to how overlayfs merges files. Bindfs
-  (https://bindfs.org/docs/bindfs.1.html#sect3) allows this concept too.
-
-* [`blobfs`](https://fuchsia.dev/fuchsia-src/concepts/filesystems/blobfs)
-  for Linux. i.e. a minimalistic file system, that can store
-  authenticated (Verity) data files, that can be written once, and
-  not be modified after that, and provide stable handles (i.e. is
-  content-addressable) to them.
-
-  **Use-Case:** This would deliver just about enough to place
-  trusted OS resources (binaries, kernels, initrds, fs trees, other
-  resources) in them, without having to trust the medium and IO
-  underneath. Should be simple enough to even implement in a boot
-  loader and similar, without making things vulnerable to rogue file
-  system image attacks. The OS and its payloads (apps, containers,
-  …) could then be composed from these resources, through means like
-  overlayfs, namespacing and more.
-
-* Namespace-able loop and block devices, usable inside user namespaces.
-
-  **Use-Case:** Allow mounting images inside nspawn containers, and using
-  RootImage= and friends in the systemd user manager.
-
-* Support detached mounts with `pivot_root()`
-
-  The new rootfs must currently refer to an attached mount. This restriction
-  seems unnecessary. We should allow the new rootfs to refer to a detached
-  mount.
-
-  This will allow a service- or container manager to create a new rootfs as
-  a detached, private mount that isn't exposed anywhere in the filesystem and
-  then `pivot_root()` into it.
-
-  Since `pivot_root()` only takes path arguments the new rootfs would need to
-  be passed via `/proc/<pid>/fd/<nr>`. In the long run we should add a new
-  `pivot_root()` syscall operating on file descriptors instead of paths.
-
-* Device cgroup guard to allow `mknod()` in non-initial userns
-
-  If a container manager restricts its unprivileged (user namespaced)
-  children by a device cgroup, it is not necessary to deny `mknod()`
-  anymore. Thus, user space applications may map devices on different
-  locations in the file system by using `mknod()` inside the container.
-
-  **Use-Case:** A use case for this, which is applied by users of GyroidOS,
-  is to run `virsh` for VMs inside an unprivileged container. `virsh` or
-  libvirt creates device nodes, e.g., `/var/run/libvirt/qemu/11-fgfg.dev/null`
-  which currently fails in a non-initial userns, even if a cgroup device white
-  list with the corresponding major, minor of `/dev/null` exists. Thus, in
-  this case the usual bind mounts or pre populated device nodes under `/dev`
-  are not sufficient.
-
-  An initial group internal RFC exists in
-  (https://github.com/quitschbo/linux/tree/devcg_guard_rfc).
-  See commit message for more implementation specific details.
-
-* Add `AT_EMPTY_PATH` support for unlinkat().
-
-  **Use-Case:** When dealing with files/directories, allow passing around only a
-  file descriptor without having to keep the path around to be able to unlink
-  the file/directory.
-
-* Race-free mount of block devices
-
-  Introduce a new struct to `fsconfig()` as an alternative to the
-  `source` property. The struct contains at least a pointer to a path,
-  possibly a device minor and major, and a diskseq number. The VFS can
-  expose a helper that filesystems can call and use the diskseq number
-  to verify that the block device they are intending to mount is indeed
-  the one they want to mount.
-
-  **Use-Case:** Race-free mounting of block devices.
-
-* Add ability to reopen a `struct block_device`. This would allow using
-  `blkdev_get_by_path()`/`blkdev_get_{part,whole}()` to claim a device
-  with `BLK_OPEN_READ` and later on reopen with
-  `BLK_OPEN_READ | BLK_OPEN_WRITE`. This in turn would opening block
-  devices at `fsconfig(FS_CONFIG_SET_*)` time and then at `fill_super()`
-  time we would be able to reopen in case the `!(fc->sb_flags & SB_RDONLY)`.
-  Overall this has the effect that we're able to open devices early
-  giving the user early errors when they set mount options rather than
-  very late when the superblock is created.
-
-* When activating a dm-verity volume allow specifying keyring to
-  validate root hash signature against.
-
-  **Usecase:** In systemd, we'd like to authenticate Portable Service
-  images, system extension images, configuration images, container
-  images with different keys, as they typically originate from
-  different sources and it should not be possible to generate a
-  system extension with a key pair that is supposed to be good for
-  container images only.
-
-* Make statx() on a pidfd return additional recognizable identifiers
-  in `.stx_btime`.
-
-  **🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
-
-  It would be fantastic if issuing statx() on any pidfd would return
-  the start time of the process in `.stx_btime` even after the process
-  died.
-
-  These fields should in particular be queriable *after* the process
-  already exited and has been reaped, i.e. after its PID has already
-  been recycled.
-
-  **Usecase:** In systemd we maintain lists of processes in a hash
-  table. Right now, the key is the PID, but this is less than ideal
-  because of PID recycling. By being able to use the `.stx_btime`
-  and/or `.stx_ino` fields instead would be perfect to safely
-  identify, track and compare process even after they ceased to exist.
-
-* An API to determine the parent process ID (ppid) of a pidfd would be
-  good.
-
-  This information is relevant to code dealing with pidfds, since if
-  the ppid of a pidfd matches the process own pid it can call
-  `waitid()` on the process, if it doesn't it cannot and such a call
-  would fail. It would be very useful if this could be determined
-  easily before even calling that syscall.
-
-  **Usecase:** systemd manages a multitude of processes, most of which
-  are its own children, but many which are not. It would be great if
-  we could easily determine whether it is worth waiting for
-  `SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
-  them is the only way to get exit notification.
-
-* There should be a way to control the process' `comm` field if
-  started via `fexecve()`.
-
-  Right now, when `fexecve()`/`execveat()` is used, the `comm` field
-  (i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
-  which breaks `ps -C …` and various other tools.  In particular when
-  the fd was opened with `O_CLOEXEC`, the number of the fd in the old
-  process is completely meaningless.
-
-  The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.
-
-  Since `comm` is under user control anyway (via `PR_SET_NAME`), it
-  should be safe to also make it somehow configurable at fexecve()
-  time.
-
-  See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
-  https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.
-
-  **Usecase:** In systemd we generally would prefer using `fexecve()`
-  to safely and race-freely invoke processes, but the fact that `comm`
-  is useless after invoking a process that way makes the call
-  unfortunately hard to use for systemd.
-
-* The LSM module API should have the ability to do path-based (not
-  just inode-based) ACL management.
-
-  **Usecase:** This would be useful in BPF-LSM modules such as
-  systemd's `mntfsd` which allows unprivileged file system mounts in
-  some cases, and which would like to restrict ACL handling based on
-  the superblock involved.
-
-* `overlayfs` should permit *immutable* layers, i.e. layers whose
-  non-directory inodes may not be overriden in an upper writable
-  layer.
-
-  **Usecase:** This would be useful when implementing `/etc/` as a
-  stack of overlayfs layers, each shipping configuration for a
-  different facet of the system, with a writable layer on the top for
-  local modifications. In such a scenario it would be useful to allow
-  the user to change any configuration it likes, except of the files
-  and other inodes shipped in the lower layers.
-
-* `overlayfs` should have an `ioctl()`-based API (or similar) for
-  querying information of the backing file systems/block devices
-
-  **Usecase:** In systemd in various areas we automatically find the
-  block device backing the root file system and other file systems
-  (Example: `systemd-gpt-auto-generator` or `bootctl` wull try to find
-  auxiliary file systems of the OS image by looking in the GPT
-  partition table the root file system is located in). While this
-  logic is good enough to find the backing block devices of some more
-  complex storage such as dm-crypt, dm-verity or btrfs, once
-  `overlayfs` is used as backing for the root file system this logic
-  does not work anymore. It would be great if there was an API to
-  simply query `overlayfs` for the superblock information
-  (i.e. `.st_dev`) of the backing layers.
-
-* An *auto-grow* feature in `btrfs` would be excellent.
-
-  If such a mode is enabled, `btrfs` would automatically grow a file
-  system up to the size of its backing block devices. Example: btrfs
-  is created with 200M in size on a block device 2G in size. Once the
-  file system is filled up fully, `btrfs` would automatically grow the
-  file system as need in the increments it needs, up to the 2G that
-  the backing block device is in size.
-
-  **Usecase:** This would allow creating minimal, compact file
-  systems: just create them small on a sparse block device, and copy
-  files into it, as needed, create subvolumes and whatever else is
-  desired. As long as only files are created and written (but not
-  modified) the resulting fs should be automatically minimal in size.
-  This would specifically be useful in `systemd-homed`, which
-  maintains per-user `btrfs` file systems backed by block
-  devices. Currently, `homed` grows the file systems manually on login
-  and then shrinks them again on logout, but this is less than ideal,
-  since btrfs places files all over the backing store, and thus the
-  shrinking will generate a lot of nonsensical IO that could be
-  reduced if the file system was always kept minimal in size anyway.
-
-* Add process by PIDFD to a cgroup
-
-  At the moment the canonical way to add a process to a cgroup is by
-  echoing its PID into the `cgroup.procs` attribute in the target
-  cgroupfs directory of the cgroup. This is safe as long as the
-  process doing so just forked off the process it wants to migrate and
-  hence can control that it hasn't been reaped yet, and hence
-  guarantees the PID is valid. This is racy however if "foreign"
-  processes shall be moved into the cgroup.
-
-  **Usecase:** In systemd, all user sessions are wrapped in scope
-  units which are backed by a cgroup. The session processes moved into
-  the scope unit are typically "foreign" processes, i.e. not children
-  of the service manager, hence doing the movement is subject to races
-  in case the process dies and its PID is quickly recycled. (This
-  assumes systemd can acquire a pidfd of the foreign process without
-  races, for example via `SCM_PIDFD` and `SO_PEERPIDFD` or similar.)
-
-* Ability to put user xattrs on `S_IFSOCK` socket inodes
-
-  Currently, the kernel only allows extended attributes in the
-  `user.*` namespace to be attached to directory and regular file
-  inodes. It would be tremendously useful to allow them to be
-  associated with socket inodes, too.
-
-  **Usecase:** There are two syslog RFCs in use today: RFC3164 and
-  RFC5424. `glibc`'s `syslog()` API generates events close to the
-  former, but there are programs which would like to generate the
-  latter instead (as it supports structured logging). The two formats
-  are not backwards compatible: a client sending RFC5424 messages to a
-  server only understanding RFC3164 will cause an ugly mess. On Linux
-  there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing
-  `syslog()`, which is used in a one-way, fire-and-forget style. This
-  means that feature negotation is not really possible within the
-  protocol. Various tools bind mount the socket inode into `chroot()`
-  and container environments, hence it would be fantastic to associate
-  supported feature information directly with the inode (and thus
-  outside of the protocol) to make it easy for clients to determine
-  which features are spoken on a socket, in a way that survives bind
-  mounts. Implementation idea would be that syslog daemons
-  implementing RFC5425 could simply set an xattr `user.rfc5424` to `1`
-  (or something like that) on the socket inode, and clearly inform
-  clients in a natural and simple way that they'd be happy to parse
-  the newer format. Also see:
-  https://github.com/systemd/systemd/issues/19251 – This idea could
-  also be extended to other sockets and other protocols: by setting
-  some extended attribute on a socket inodes, services could advertise
-  which protocols they support on them. For example D-Bus sockets
-  could carry `user.dbus` set to `1`, and Varlink sockets
-  `user.varlink` set to `1` and so on.
-
-* Extend `pidfd_open()` to allow opening the thread-group leader based on the
-  PID of an individual thread. Currently we do support:
-
-  (1) `pidfd_open(1234, 0)` on a thread-group leader PID
-  (2] `pidfd_open(1234, PIDFD_THREAD)` on a thread
-
-   Add an option to go from individual thread to thread-group leader.
-
-   **Use-Case:** Allow for a race free way to go from individual thread to
-   thread-group leader pidfd.
-
-* Add a namespace ioctl to translate a PID between PID namespaces.
-
-  **Use-Case:** This makes it possible to e.g., figure out what a given PID in
-  a PID namespace corresponds to in the caller's PID namespace. For example, to
-  figure out what the PID of PID 1 inside of a given PID namespace is.
+### Mount a subdirectory instead of the top-level directory
+
+Ability to mount a subdirectory of a regular file system instead of
+the top-level directory. E.e. for a file system `/dev/sda1` which
+contains a sub-directory `/foobar` mount `/foobar` without having
+to mount its parent directory first. Consider something like this:
+
+```
+mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
+```
+
+(This is of course already possible via some mount namespacing
+shenanigans, but this requires namespacing to be available, and is
+not precisely obvious to implement. Explicit kernel support at mount
+time would be much preferable.)
+
+**Use-Case:** `systemd-homed` currently mounts a sub-directory of
+the per-user LUKS volume as the user's home directory (and not the
+root directory of the per-user LUKS volume's file system!), and in
+order to implement this invisibly from the host side requires a
+complex mount namespace exercise.
+
+### inotify() events for BSD file locks
+
+BSD file locks (i.e. `flock()`, as opposed to POSIX `F_SETLK` and
+friends are inode-focussed, hence would be great if one could get
+asynchronous notification when they are released via inotify.
+
+**Use-Case:** udevd probes block devices whenever they pop up to
+create /dev/disk/by-label/* and similar symlinks. Formatting tools
+can temporarily block this behaviour by taking a BSD file lock on
+the block device (as per https://systemd.io/BLOCK_DEVICE_LOCKING),
+in order to make sure udevd doesn't probe file systems/partition
+tables that are only partially initialized. Currently, udevd uses
+inotify `IN_CLOSE_WRITE` notifications to detect whenever
+applications close a block device after writing to it, and
+automatically reprobes the device. This works reasonably OK given
+that block devices are usually closed at the same time as their
+BSD file lock is released, and vice versa. However, this is not
+fully correct: what udevd actually should be watching is the locks
+being released, not the devices being closed.
+
+### Auxiliary socket message describing the sender's cgroup
+
+`SCM_CGROUP` or a similar auxiliary socket message, that allows
+receivers to figure out which cgroup a sender is part of.
+
+**Use-Case:** `systemd-journald` picks up cgroup information from
+logging clients, in order to augment log records and allow
+filtering via this meta-information. In particular it derives
+service identity from that (so that requests such as "Show me all
+log messages of service X!" can be answered). This is currently
+racy, since it uses `SCM_CREDS`' `.pid` field for this, which it then
+used to load `/proc/$PID/cgroup`. In particular for programs that
+log and immediately exit, the cgroup information frequently cannot
+be acquired anymore by `systemd-journald`.
+
+### Linking of `O_TMPFILE` files with replacement
+
+Ability to link an `O_TMPFILE` file into a directory while *replacing* an
+existing file. (Currently there's only the ability to link it in, if the
+file name doesn't exist yet.)
+
+**Use-Case:** there are many programs (e.g. `systemd-hostnamed`
+when updating `/etc/hostname`) that atomically want to update a
+file, so that either the old or the new version is in place, but
+never a partially updated one. The canonical way to do this is by
+creating a temporary file with the new contents, and then renaming
+it to the filename of the file to update, thus atomically replacing
+it. Currently, the temporary file for this must be created with a
+random name, `O_TMPFILE` cannot be used, since for these files
+atomic-replace is not supported, currently.
+
+### Ability to only open regular files
+
+`O_REGULAR` (inspired by the existing `O_DIRECTORY` flag for
+`open()`), which opens a file only if it is of type `S_IFREG`.
+
+**Use-Case:** this would be very useful to write secure programs
+that want to avoid being tricked into opening device nodes with
+special semantics while thinking they operate on regular
+files. This is particularly relevant as many device nodes (or even
+FIFOs) come with blocking I/O (or even blocking `open()`!) by
+default, which is not expected from regular files backed by "fast"
+disk I/O. Consider implementation of a naive web browser which is
+pointed to `file://dev/zero`, not expecting an endless amount of
+data to read.
+
+### Unlinking via two file descriptors
+
+`unlinkat3(dir_fd, name, inode_fd)`: taking one file descriptor
+for the directory to remove a file in, and another one referring
+to the inode of the filename to remove. This call should only
+succeed if the specified filename still refers to the specified
+inode.
+
+**Use-Case:** code that operates on a well-know path that might be
+shared by multiple programs that jointly manage it might want to
+safely remove a filename under the guarantee it still refers to
+the expected inode. As a specific example, consider lock files,
+that should be cleaned up only if they still refer to the assumed
+owner's instance, but leave the file in place if another process
+already took over the filename.
+
+### Determining if a mount point belongs to the current user
+
+Ability to determine if a mount point belongs to the current user
+namespace, in order to check if there's a chance a process can
+safely unmount it (as that only works for mounts owned by the same
+user namespaces — or one further down the tree, but not any up the
+tree). A simple, additional field in `/proc/self/mountinfo`
+container the owning user namespace ID would probably already
+suffice.
+
+**Use-Case:** the `systemd` system and service manager tries to unmount
+all established mounts on shutdown. Inside of container
+environments where specific mounts are established by the
+container manager (and not the payload itself) this will
+ultimately fail if user namespaces are enabled. In order to clean
+up the shutdown logic it would be very good to be able to
+determine whether a specific mount could even possibly be
+unmounted or whether it's not worth the effort to include the
+unmount in the system shutdown transaction.
+
+### Read-only propagation of mounts
+
+A way to mark mounts that receive mount propagation events from
+elsewhere so that these propagated mounts are established
+read-only implicitly. Right now, if a mount receives a mount
+propagation event it will have the exact same `MS_RDONLY`,
+`MS_NODEV`, … flags as it has where it originated. It would be
+very useful if an `MS_RDONLY` could be ORed into the mount flags
+automatically whenever propagated elsewhere.
+
+**Use-Case:** various mount namespace based sandboxes
+(e.g. `systemd`'s `ProtectSystem=` option) mark large parts of the
+host file hierarchy read-only via mounting it
+`MS_RDONLY|MS_BIND|MS_REMOUNT`, but generally intend to leave the
+file hierarchy besides that the way it is, and that includes they
+typically still want to be able to receive mount events to
+directories such as `/mnt/` and `/media/` in these sandboxed
+environments. Right now, any such propagation then happens in
+writable mode, even if the file hierarchy otherwise is almost
+entirely read-only. To close this gap it would be great if such
+propagated mounts could implicitly gain `MS_RDONLY` as they are
+propagated.
+
+### Disabling reception of `SCM_RIGHTS` for `AF_UNIX` sockets
+
+Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX`
+sockets. Right now reception of file descriptors is always on when
+a process makes the mistake of invoking `recvmsg()` on such a
+socket. This is problematic since `SCM_RIGHTS` installs file
+descriptors in the recipient process' file descriptor
+table. Getting rid of these file descriptors is not necessarily
+easy, as they could refer to "slow-to-close" files (think: dirty
+file descriptor referring to a file on an unresponsive NFS server,
+or some device file descriptor), that might cause the recipient to
+block for a longer time when it tries to them. Programs reading
+from an `AF_UNIX` socket currently have three options:
+
+1. Never use `recvmsg()`, and stick to `read()`, `recv()` and
+   similar which do not install file descriptors in the recipients
+   file descriptor table.
+
+2. Ignore the problem, and simply `close()` the received file descriptors
+   it didn't expect, thus possibly locking up for a longer time.
+
+3. Fork off a thread that invokes `close()`, which mitigates the
+   risk of blocking, but still means a sender can cause resource
+   exhaustion in a recipient by flooding it with file descriptors,
+   as for each of them a thread needs to be spawned and a file
+   descriptor is taken while it is in the process of being closed.
+
+(Another option of course is to never talk `AF_UNIX` to peers that
+are not trusted to not send unexpected file descriptors.)
+
+A simple knob that allows turning off `SCM_RIGHTS` right reception
+would be useful to close this weakness, and would allow
+`recvmsg()` to be called without risking file descriptors to be
+installed in the file descriptor table, and thus risking a
+blocking `close()` or a form of potential resource exhaustion.
+
+**Use-Case:** any program that uses `AF_UNIX` sockets and uses (or
+would like to use) `recvmsg()` on it (which is useful to acquire
+other metadata). Example: logging daemons that want to collect
+timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message
+broker and suchlike.
+
+### Filtering on received file descriptors
+
+An alternative to the previous item could be if some form of filtering
+could be enforced on the file descriptors suitable for enqueuing on
+the `AF_UNIX` socket. i.e. allow filtering by superblock type or
+similar, so that policies such as "only `memfd`s are OK to be
+received" may be expressed. (BPF?).
+
+**Use-Case:** as above.
+
+### A reliable way to check for PID namespacing
+
+A reliable (non-heuristic) way to detect from userspace if the
+current process is running in a PID namespace that is not the main
+PID namespace. PID namespaces are probably the primary type of
+namespace that identify a container environment. While many
+heuristics exist to determine generically whether one is executed
+inside a container, it would be good to have a correct,
+well-defined way to determine this.
+
+**Use-Case:** tools such as `systemd-detect-virt` exist to determine
+container execution, but typically resolve to checking for
+specific implementations. It would be much nicer and universally
+applicable if such a check could be done generically. It would
+probably suffice to provide an `ioctl()` call on the `pidns` file
+descriptor that reveals this kind of information in some form.
+
+### Excluding processes watched via `pidfd` from `waitid(P_ALL, …)`
+
+
+**Use-Case:** various programs use `waitid(P_ALL, …)` to collect exit
+information of exited child processes. In particular PID 1 and
+processes using `PR_SET_CHILD_SUBREAPER` use this as they may
+collect unexpected children that have been reparented from dying
+sub-processes, and that need to be reaped in order to clean up the
+PID space. Currently, these programs cannot easily mix waiting for
+specific sub-processes via `pidfd` with waiting for the other
+*unexpected* children via `waitid(P_ALL, …)` since the latter also
+reaps (and thus invalidates) the pidfd-tracked
+children. Specifically, the `systemd` service manager would like
+to use `pidfd`s to remove PID recycling security issues, but
+currently cannot as it also needs to generically wait for such
+unexpected children.
+
+### Mount notifications without rescanning of `/proc/self/mountinfo`
+
+Mount notifications that do not require continuous rescanning of
+`/proc/self/mountinfo`. Currently, if a program wants to track
+mounts established on the system it can receive `poll()`able
+events via a file descriptor to `/proc/self/mountinfo`. When
+receiving them it needs to rescan the file from the top and
+compare it with the previous scan. This is both slow and
+racy. It's slow on systems with a large number of mounts as the
+cost for re-scanning the table has to be paid for every change to
+the mount table. It's racy because quickly added and removed
+mounts might not be noticed.
+
+**Use-Case:** `systemd` tracks the mount table to integrate the mounts
+into it own dependency management.
+
+### Asynchronous `close()`
+
+An asynchronous or forced `close()`, that guarantees that
+userspace doesn't have to risk blocking for longer periods of time
+when trying to get rid of unwanted file descriptors, possibly
+received via `recvmsg()` + `SCM_RIGHTS` (see above). Currently,
+`close()` of various file descriptors (for example those referring
+to slow storage, e.g. non-responding NFS servers and such) might
+take arbitrary amounts of time, potentially into the minute range
+and more. This makes it risky accepting file descriptors on
+publicly accessible `AF_UNIX` sockets, the way like IPC brokers
+(e.g. D-Bus) do it: if a rogue client keeps sending file
+descriptors that because unexpected must be closed immediately it
+might cause the receiving process to effectively crawl, when it is
+busy closing them all. A special form of `close()` that simply
+detaches a file descriptor from the file descriptor table without
+blocking on IO in any form would be great to close this issue.
+
+**Use-Case:** any program that receives file descriptors via `AF_UNIX`
+from untrusted clients would benefit from this. e.g. D-Bus
+brokers.
+
+### `CLOCK_MONOTONIC` network timestamps
+
+Currently network timestamps are exclusively in `CLOCK_REALTIME`, even
+though for many (most?) a monotonic clock would be much preferable, as
+calculations become easier when one doesn't have to think about clock
+jumps and similar.
+
+**Use-Case:** `systemd-journald` collects implicit timestamps via
+`AF_UNIX` time-stamping, in `CLOCK_REALTIME`, even though for its
+internal logic only monotonic timestamps are used, as log records
+are searched via bisection in ordered tables, that require
+strictly increasing timestamps. In particular during boot (where
+`CLOCK_REALTIME` is often not available, stable or subject to
+corrections) it would be good to have reliable, monotonic
+timestamps on all log records.
+
+### Immutable loopback block devices
+
+Truly immutable loopback block devices. Right now setting up a
+loopback block device in read-only mode, backed by a read-only
+file (stored on a regular read/write file system), and then
+mounting it with `ext4` also in `MS_RDONLY`mode *will* result in
+changes to the file, quite unexpectedly 🤯. Ideally, if a loopback
+block device is set up in read-only mode this should guarantee
+that the backing file remains unmodified by it.
+
+**Use-Case:** disk image build tools that want to reproducibly and
+verifiable build images must be able to rely that mounting them in
+read-only mode does not alter the images in any way. In particular
+when working in computer forensics one must be able to rely that
+file systems that are analyzed remain unmodified by the analysis.
+
+### A timeout for the `flock()` syscall
+
+A timeout for the `flock()` syscall. Faking the time-out in userspace
+is nasty: most code does it with `alarm()` (or equivalent APIs), but
+that's racy since on a heavily loaded system the timeout might trigger
+before the `flock()` call is entered, in particular if short time-outs
+shall be used. More accurate is to do the locking in a short-lived
+child processed, but that's difficult already in C, and almost
+impossible in languages that do not allow `fork()` without `execve()`.
+
+**Use-Case:** as mentioned above systemd-udev allows synchronizing
+block device probing via flock(). Often userspace wants to wait
+for that, but without risking to hang forever.
+
+### Extend `mount_setattr()` to allow changing mount properties ignoring any failures 
+
+**Use-Case:** workloads that know that there are mounts in a mount tree
+whose attributes cannot be changed by the caller don't want
+`mount_settattr()` to fail on the first mount it failed to convert. Give
+them a flag to request changes ignoring failures.
+
+### Upgrade masks in `openat2()`
+
+Add upgrade masks to `openat2()`. Extend `struct open_how` to allow
+restrict re-opening of file descriptors.
+
+**Use-Case:** block services or containers from re-opening/upgrading an
+`O_PATH` file descriptor through e.g. `/proc/<pid>/fd/<nr` as `O_WRONLY`.
+
+### Make quotas work with user namespaces
+
+The quota codepaths in the kernel currently broken and inconsistent
+and most interesting operations are guarded behind
+`capable(CAP_SYS_ADMIN)`, i.e., require `CAP_SYS_ADMIN` in the initial
+user namespace. We should rework these codepaths to work with user
+namespaces and then see whether we can make them work with idmapped
+mounts.
+
+**Use-Case:** using quotas correctly in containers.
+
+### Add `MOUNT_SETATTR_LOCKED` flag to `mount_setattr()`
+
+Add a new `MOUNT_SETATTR_LOCKED` flag to `mount_setattr(..., ..., MOUNT_SETATTR_LOCKED, ..., ...)`.
+The `MOUNT_SETATTR_LOCKED` flag allow a `ns_capable(mntns->user_ns,
+CAP_SYS_ADMIN)` caller to lock all mount properties. The mount properties
+cannot be changed anymore.
+
+**Use-Case:** allowing processes to lock mount properties even for
+privileged processes. Locking mount properties would currently involve
+having to have the mount namespace of the container be owned by an ancestor
+user namespace. But this doesn't just lock a single mount or mount subtree
+it locks all mounts in the mount namespace, i.e., the mount table cannot be
+altered.
+
+### Add `OPEN_TREE_CLEAR` flag to `open_tree()`
+
+Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be
+used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
+all mount properties from that mount including the mount's idmapping.
+Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
+are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
+in the filesystems user namespace.
+
+Locked mount properties cannot be changed. A mount's idmapping becomes
+locked if it propagates across user namespaces.
+
+This is useful to get a new, clear mount and also allows the caller to
+create a new detached mount with an idmapping attached to the mount. Iow,
+the caller may idmap the mount afterwards.
+
+**Use-Case:** A user may already use an idmapped mount for their home
+directory. And once a mount has been idmapped the idmapping cannot be
+changed anymore. This allows for simple semantics and allows to avoid
+lifetime complexity in order to account for scenarios where concurrent
+readers or writers might still use a given user namespace while it is about
+to be changed.
+But this poses a problem when the user wants to attach an idmapping to
+a mount that is already idmapped. The new flag allows to solve this
+problem. A sufficiently privileged user such as a container manager can
+create a user namespace for the container which expresses the desired
+ownership. Then they can create a new detached mount without any prior
+mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
+mount.
+
+### Require a user namespace to have an idmapping when attached
+
+Enforce that the user namespace about to be attached to a mount must
+have an idmapping written.
+
+**Use-Case:** Tighten the semantics.
+
+### Extend `setns()` to allow attaching to all new namespaces of a process
+
+Add an extension to `setns()` to allow attaching to all namespaces of
+a process `SETNS_PIDFD_ALL` different from the caller's namespaces.
+Currently specifying e.g., `CLONE_NEWUSER` fails if the caller is in the
+same user namespace as the target process. This is very inconvenient.
+
+**Use-Case:** Make it trivial to attach to all namespaces of a process
+without having to figure out whether the caller is already in the same
+namespace or not.
+
+### Security hook for `mount_setattr()`
+
+(kAPI) Add security hook to `mount_setattr()`.
+
+**Use-Case:** Allow LSMs to make decisions about what mount properties to
+allow and what to deny.
+
+### Per-cgroup limit for coredump sizes
+
+A per-cgroup knob for coredump sizes. Currently coredump size
+control is strictly per process, and primarily under control of
+the processes themselves. It would be good if we had a per-cgroup
+knob instead, that is under control of the service manager.
+
+**Use-Case:** coredumps can be heavy to generate. For different
+usecases it would be good to be able to opt-in or opt-out
+dynamically from coredumps for specific services, at runtime
+without restarting them.
+
+### Race-free creation and opening of non-file inodes
+
+A way to race-freely create an (non-file) inode and immediately
+open it. For regular files we have open(O_CREAT) for creating a
+new file inode, and returning a pinning fd to it. This is missing
+for other inode types, such as directories, device nodes,
+FIFOs. The lack of such functionality means that when populating a
+directory tree there's always a race involved: the inodes first
+need to be created, and then opened to adjust their
+permissions/ownership/labels/timestamps/acls/xattrs/…, but in the
+time window between the creation and the opening they might be
+replaced by something else. Addressing this race without proper
+APIs is possible (by immediately fstat()ing what was opened, to
+verify that it has the right inode type), but difficult to get
+right. Hence, mkdirat_fd() that creates a directory *and* returns
+an O_DIRECTORY fd to it would be great. As would be mknodeat_fd()
+that creates a device node, FIFO or (dead) socket and returns an
+O_PATH fd to it. And of course symlinkat_fd() that creates a
+symlink and returns an O_PATH fd to it.
+
+**Use-Case:** any program that creates/unpacks not just files, but
+directories, device nodes, fifos, and wants to ensure that they
+safely get the right attributes applied, even if other code might
+simultaneously have access to the same directory tree.
+
+### Extend `io_uring` with classic synchronous system calls
+
+The `io_uring` subsystem is open to adding classic existing synchronous
+system calls (e.g. `setns()` or `mount()` or other) to `io_uring`.
+They also said they would support adding new functionality into
+`io_uring` that is not exposed through system calls yet.
+
+### Map different uids/gids to the same uids/gids?
+
+Explore the idea of mapping different uids/gids to the same uids/gids, i.e.
+65534:1000:1 50000:1000:1. This will only work if the mount is read-only as
+the kernel wouldn't know what uid/gid would need to be put to disk
+otherwise (65534? 50000? the first one that is mapped?).
+
+**Use-Case:** Delegate multiple {g,u}ids to the same user. Merging
+ownership similar to how overlayfs merges files. Bindfs
+(https://bindfs.org/docs/bindfs.1.html#sect3) allows this concept too.
+
+### blobfs
+
+[`blobfs`](https://fuchsia.dev/fuchsia-src/concepts/filesystems/blobfs)
+for Linux. i.e. a minimalistic file system, that can store
+authenticated (Verity) data files, that can be written once, and
+not be modified after that, and provide stable handles (i.e. is
+content-addressable) to them.
+
+**Use-Case:** This would deliver just about enough to place
+trusted OS resources (binaries, kernels, initrds, fs trees, other
+resources) in them, without having to trust the medium and IO
+underneath. Should be simple enough to even implement in a boot
+loader and similar, without making things vulnerable to rogue file
+system image attacks. The OS and its payloads (apps, containers,
+…) could then be composed from these resources, through means like
+overlayfs, namespacing and more.
+
+### Namespaced loop and block devices
+
+Namespace-able loop and block devices, usable inside user namespaces.
+
+**Use-Case:** Allow mounting images inside nspawn containers, and using
+RootImage= and friends in the systemd user manager.
+
+### Support detached mounts with `pivot_root()`
+
+The new rootfs must currently refer to an attached mount. This restriction
+seems unnecessary. We should allow the new rootfs to refer to a detached
+mount.
+
+This will allow a service- or container manager to create a new rootfs as
+a detached, private mount that isn't exposed anywhere in the filesystem and
+then `pivot_root()` into it.
+
+Since `pivot_root()` only takes path arguments the new rootfs would need to
+be passed via `/proc/<pid>/fd/<nr>`. In the long run we should add a new
+`pivot_root()` syscall operating on file descriptors instead of paths.
+
+### Device cgroup guard to allow `mknod()` in non-initial userns
+
+If a container manager restricts its unprivileged (user namespaced)
+children by a device cgroup, it is not necessary to deny `mknod()`
+anymore. Thus, user space applications may map devices on different
+locations in the file system by using `mknod()` inside the container.
+
+**Use-Case:** A use case for this, which is applied by users of GyroidOS,
+is to run `virsh` for VMs inside an unprivileged container. `virsh` or
+libvirt creates device nodes, e.g., `/var/run/libvirt/qemu/11-fgfg.dev/null`
+which currently fails in a non-initial userns, even if a cgroup device white
+list with the corresponding major, minor of `/dev/null` exists. Thus, in
+this case the usual bind mounts or pre populated device nodes under `/dev`
+are not sufficient.
+
+An initial group internal RFC exists in
+(https://github.com/quitschbo/linux/tree/devcg_guard_rfc).
+See commit message for more implementation specific details.
+
+### `AT_EMPTY_PATH` support `for unlinkat()`
+
+**Use-Case:** When dealing with files/directories, allow passing
+around only a file descriptor without having to keep the path around
+to be able to unlink the file/directory.
+
+### Race-free mounting of block devices
+
+Introduce a new struct to `fsconfig()` as an alternative to the
+`source` property. The struct contains at least a pointer to a path,
+possibly a device minor and major, and a diskseq number. The VFS can
+expose a helper that filesystems can call and use the diskseq number
+to verify that the block device they are intending to mount is indeed
+the one they want to mount.
+
+**Use-Case:** Race-free mounting of block devices.
+
+### Ability to reopen a `struct block_device`
+
+Add ability to reopen a `struct block_device`. This would allow using
+`blkdev_get_by_path()`/`blkdev_get_{part,whole}()` to claim a device
+with `BLK_OPEN_READ` and later on reopen with
+`BLK_OPEN_READ | BLK_OPEN_WRITE`. This in turn would opening block
+devices at `fsconfig(FS_CONFIG_SET_*)` time and then at `fill_super()`
+time we would be able to reopen in case the `!(fc->sb_flags & SB_RDONLY)`.
+Overall this has the effect that we're able to open devices early
+giving the user early errors when they set mount options rather than
+very late when the superblock is created.
+
+### Specification of a keyring for dm-verity volume verification
+
+When activating a dm-verity volume allow specifying keyring to
+validate root hash signature against.
+
+**Usecase:** In systemd, we'd like to authenticate Portable Service
+images, system extension images, configuration images, container
+images with different keys, as they typically originate from
+different sources and it should not be possible to generate a
+system extension with a key pair that is supposed to be good for
+container images only.
+
+### Make statx() on a pidfd return additional info
+
+Make statx() on a pidfd return additional recognizable identifiers in
+`.stx_btime`.
+
+**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
+
+It would be fantastic if issuing statx() on any pidfd would return
+the start time of the process in `.stx_btime` even after the process
+died.
+
+These fields should in particular be queriable *after* the process
+already exited and has been reaped, i.e. after its PID has already
+been recycled.
+
+**Usecase:** In systemd we maintain lists of processes in a hash
+table. Right now, the key is the PID, but this is less than ideal
+because of PID recycling. By being able to use the `.stx_btime`
+and/or `.stx_ino` fields instead would be perfect to safely
+identify, track and compare process even after they ceased to exist.
+
+### API to determine the parent process ID of a pidfd
+
+An API to determine the parent process ID (ppid) of a pidfd would be
+good.
+
+This information is relevant to code dealing with pidfds, since if
+the ppid of a pidfd matches the process own pid it can call
+`waitid()` on the process, if it doesn't it cannot and such a call
+would fail. It would be very useful if this could be determined
+easily before even calling that syscall.
+
+**Usecase:** systemd manages a multitude of processes, most of which
+are its own children, but many which are not. It would be great if
+we could easily determine whether it is worth waiting for
+`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
+them is the only way to get exit notification.
+
+### Set `comm` field before `exec()`
+
+There should be a way to control the process' `comm` field if
+started via `fexecve()`/`execveat()`.
+
+Right now, when `fexecve()`/`execveat()` is used, the `comm` field
+(i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
+which breaks `ps -C …` and various other tools.  In particular when
+the fd was opened with `O_CLOEXEC`, the number of the fd in the old
+process is completely meaningless.
+
+The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.
+
+Since `comm` is under user control anyway (via `PR_SET_NAME`), it
+should be safe to also make it somehow configurable at fexecve()
+time.
+
+See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
+https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.
+
+**Usecase:** In systemd we generally would prefer using `fexecve()`
+to safely and race-freely invoke processes, but the fact that `comm`
+is useless after invoking a process that way makes the call
+unfortunately hard to use for systemd.
+
+### Path-based ACL management
+
+The LSM module API should have the ability to do path-based (not
+just inode-based) ACL management.
+
+**Usecase:** This would be useful in BPF-LSM modules such as
+systemd's `mntfsd` which allows unprivileged file system mounts in
+some cases, and which would like to restrict ACL handling based on
+the superblock involved.
+
+### Immutable layers for `overlayfs`
+
+`overlayfs` should permit *immutable* layers, i.e. layers whose
+non-directory inodes may not be overriden in an upper writable
+layer.
+
+**Usecase:** This would be useful when implementing `/etc/` as a
+stack of overlayfs layers, each shipping configuration for a
+different facet of the system, with a writable layer on the top for
+local modifications. In such a scenario it would be useful to allow
+the user to change any configuration it likes, except of the files
+and other inodes shipped in the lower layers.
+
+### `ioctl()` API for `overlayfs`
+
+`overlayfs` should have an `ioctl()`-based API (or similar) for
+querying information of the backing file systems/block devices
+
+**Usecase:** In systemd in various areas we automatically find the
+block device backing the root file system and other file systems
+(Example: `systemd-gpt-auto-generator` or `bootctl` wull try to find
+auxiliary file systems of the OS image by looking in the GPT
+partition table the root file system is located in). While this
+logic is good enough to find the backing block devices of some more
+complex storage such as dm-crypt, dm-verity or btrfs, once
+`overlayfs` is used as backing for the root file system this logic
+does not work anymore. It would be great if there was an API to
+simply query `overlayfs` for the superblock information
+(i.e. `.st_dev`) of the backing layers.
+
+#### Automatic growing of `btrfs` filesystems
+
+An *auto-grow* feature in `btrfs` would be excellent.
+
+If such a mode is enabled, `btrfs` would automatically grow a file
+system up to the size of its backing block devices. Example: btrfs
+is created with 200M in size on a block device 2G in size. Once the
+file system is filled up fully, `btrfs` would automatically grow the
+file system as need in the increments it needs, up to the 2G that
+the backing block device is in size.
+
+**Usecase:** This would allow creating minimal, compact file
+systems: just create them small on a sparse block device, and copy
+files into it, as needed, create subvolumes and whatever else is
+desired. As long as only files are created and written (but not
+modified) the resulting fs should be automatically minimal in size.
+This would specifically be useful in `systemd-homed`, which
+maintains per-user `btrfs` file systems backed by block
+devices. Currently, `homed` grows the file systems manually on login
+and then shrinks them again on logout, but this is less than ideal,
+since btrfs places files all over the backing store, and thus the
+shrinking will generate a lot of nonsensical IO that could be
+reduced if the file system was always kept minimal in size anyway.
+
+### Add process by PIDFD to a cgroup
+
+At the moment the canonical way to add a process to a cgroup is by
+echoing its PID into the `cgroup.procs` attribute in the target
+cgroupfs directory of the cgroup. This is safe as long as the
+process doing so just forked off the process it wants to migrate and
+hence can control that it hasn't been reaped yet, and hence
+guarantees the PID is valid. This is racy however if "foreign"
+processes shall be moved into the cgroup.
+
+**Usecase:** In systemd, all user sessions are wrapped in scope
+units which are backed by a cgroup. The session processes moved into
+the scope unit are typically "foreign" processes, i.e. not children
+of the service manager, hence doing the movement is subject to races
+in case the process dies and its PID is quickly recycled. (This
+assumes systemd can acquire a pidfd of the foreign process without
+races, for example via `SCM_PIDFD` and `SO_PEERPIDFD` or similar.)
+
+### Ability to put user xattrs on `S_IFSOCK` socket inodes
+
+Currently, the kernel only allows extended attributes in the
+`user.*` namespace to be attached to directory and regular file
+inodes. It would be tremendously useful to allow them to be
+associated with socket inodes, too.
+
+**Usecase:** There are two syslog RFCs in use today: RFC3164 and
+RFC5424. `glibc`'s `syslog()` API generates events close to the
+former, but there are programs which would like to generate the
+latter instead (as it supports structured logging). The two formats
+are not backwards compatible: a client sending RFC5424 messages to a
+server only understanding RFC3164 will cause an ugly mess. On Linux
+there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing
+`syslog()`, which is used in a one-way, fire-and-forget style. This
+means that feature negotation is not really possible within the
+protocol. Various tools bind mount the socket inode into `chroot()`
+and container environments, hence it would be fantastic to associate
+supported feature information directly with the inode (and thus
+outside of the protocol) to make it easy for clients to determine
+which features are spoken on a socket, in a way that survives bind
+mounts. Implementation idea would be that syslog daemons
+implementing RFC5425 could simply set an xattr `user.rfc5424` to `1`
+(or something like that) on the socket inode, and clearly inform
+clients in a natural and simple way that they'd be happy to parse
+the newer format. Also see:
+https://github.com/systemd/systemd/issues/19251 – This idea could
+also be extended to other sockets and other protocols: by setting
+some extended attribute on a socket inodes, services could advertise
+which protocols they support on them. For example D-Bus sockets
+could carry `user.dbus` set to `1`, and Varlink sockets
+`user.varlink` set to `1` and so on.
+
+### Open thread-group leader via `pidfd_open()`
+
+Extend `pidfd_open()` to allow opening the thread-group leader based on the
+PID of an individual thread. Currently we do support:
+
+1. `pidfd_open(1234, 0)` on a thread-group leader PID
+2. `pidfd_open(1234, PIDFD_THREAD)` on a thread
+
+Add an option to go from individual thread to thread-group leader.
+
+**Use-Case:** Allow for a race free way to go from individual thread
+to thread-group leader pidfd.
+
+### Namespace ioctl to translate a PID between PID namespaces
+
+**Use-Case:** This makes it possible to e.g., figure out what a given PID in
+a PID namespace corresponds to in the caller's PID namespace. For example, to
+figure out what the PID of PID 1 inside of a given PID namespace is.
 
 ## Finished Items
 
-* [x] ability to unmount obstructed mounts. (this means: you have a stack
-  of mounts on the very same inode, and you want to remove a mount in
-  the middle. right now, you can only remove the topmost mount.)
-
-  **🙇 instead of the ability to unmount obstructured mounts we gained
-  the ability to mount beneath an existing mount, with mostly
-  equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b
-  ("fs: allow to mount beneath top mount") 🙇**
+### Unmounting of obstructed mounts
+
+[x] ability to unmount obstructed mounts. (this means: you have a stack
+of mounts on the very same inode, and you want to remove a mount in
+the middle. right now, you can only remove the topmost mount.)
+
+**🙇 instead of the ability to unmount obstructured mounts we gained
+the ability to mount beneath an existing mount, with mostly
+equivalent outcome. `6ac392815628f317fcfdca1a39df00b9cc4ebc8b
+("fs: allow to mount beneath top mount") 🙇**
+
+**use-case:** this is useful for replacing mounts atomically, for
+example for upgrading versioned disk images: first an old version
+of the image is mounted. then a new version is mounted over the
+existing mount point, and then the lower mount point is
+removed. One such software would be `systemd-sysext`.
+
+### `SCM_PIDFD` auxiliary socket message
+
+[x] `SCM_PIDFD` or similar auxiliary socket message, that is a modern
+version of the `SCM_CREDS` message's `.pid` field, and provides a
+`pidfd` file descriptor to the originating peer process.
+
+**🙇 `5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)")` 🙇**
+
+**Use-Case:** security infrastructure (such as PolicyKit) can safely
+reference clients this way without fearing PID
+recycling. `systemd-journald` can acquire peer metadata this way in
+a less racy fashion, in particular safe against PID recycling.
+
+### Take `IP_UNICAST_IF` into account for routing decisions
+
+[x] `IP_UNICAST_IF` should be taken into account for routing decisions
+at UDP `connect()` time (currently it isn't, only `SO_BINDTOINDEX`
+is, but that does so much more than just that, and one often
+doesn't want that)
+
+**🙇 `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix
+IP_UNICAST_IF option behavior for connected sockets")` 🙇**
 
-  **use-case:** this is useful for replacing mounts atomically, for
-  example for upgrading versioned disk images: first an old version
-  of the image is mounted. then a new version is mounted over the
-  existing mount point, and then the lower mount point is
-  removed. One such software would be `systemd-sysext`.
+**Use-Case:** DNS resolvers that associate DNS configuration with
+specific network interfaces (example: `systemd-resolved`) typically
+want to preferably route DNS traffic to the per-interface DNS
+server via that interface, but not make further restrictions on the
+origins or received replies, and all that without
+privileges. `IP_UNICAST_IF` fulfills this role fine for TCP, but
+for UDP it is not taken into account for the `connect()` routing
+decision.
 
-* [x] `SCM_PIDFD` or similar auxiliary socket message, that is a modern
-  version of the `SCM_CREDS` message's `.pid` field, and provides a
-  `pidfd` file descriptor to the originating peer process.
+### A system call to query information about a mount
 
-  **🙇 `5e2ff6704a275be00 ("scm: add SO_PASSPIDFD and SCM_PIDFD)")` 🙇**
+[x] Implement a mount-specific companion to `statx()` that puts at least the
+following information into `struct mount_info`:
 
-  **Use-Case:** security infrastructure (such as PolicyKit) can safely
-  reference clients this way without fearing PID
-  recycling. `systemd-journald` can acquire peer metadata this way in
-  a less racy fashion, in particular safe against PID recycling.
+**🙇 46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")`` 🙇**
 
-* [x] `IP_UNICAST_IF` should be taken into account for routing decisions
-  at UDP `connect()` time (currently it isn't, only `SO_BINDTOINDEX`
-  is, but that does so much more than just that, and one often
-  doesn't want that)
+* mount flags: `MOUNT_ATTR_RDONLY`, ...
+* time flags: `MOUNT_ATTR_RELATIME`, ...
+  Could probably be combined with mount flags.
+* propagation setting: `MS_SHARED)`, ...
+* peer group
+* mnt id of the mount
+* mnt id of the mount's parent
+* owning userns
 
-  **🙇 `0e4d354762cefd3e16b4cff8988ff276e45effc4 ("net-next: Fix
-  IP_UNICAST_IF option behavior for connected sockets")` 🙇**
+There's a bit more advanced stuff systemd would really want but which
+I think is misplaced in a mountinfo system call including:
+* list of primary and auxiliary block device major/minor
+* diskseq value of those device nodes (This is a new block device feature
+  we added that allows preventing device recycling issues when e.g.
+  removing usb devices very quickly and is needed for udev.)
+* uuid/fsid
+* feature flags (`O_TMPFILE`, `RENAME_EXCHANGE` supported etc.)
 
-  **Use-Case:** DNS resolvers that associate DNS configuration with
-  specific network interfaces (example: `systemd-resolved`) typically
-  want to preferably route DNS traffic to the per-interface DNS
-  server via that interface, but not make further restrictions on the
-  origins or received replies, and all that without
-  privileges. `IP_UNICAST_IF` fulfills this role fine for TCP, but
-  for UDP it is not taken into account for the `connect()` routing
-  decision.
+**Use-Case:** low-level userspace tools have to interact with advanced
+mount information constantly. This is currently costly and brittel because
+they have to go and parse `/proc/<pid>/mountinfo`.
 
-* [x] Implement a mount-specific companion to `statx()` that puts at least the
-  following information into `struct mount_info`:
+### Security hook for `create_user_ns()`.
 
-  **🙇 46eae99ef73302f9fb3dddcd67c374b3dffe8fd6 ("add statmount(2) syscall")`` 🙇**
+[x] (kAPI) Add security hook to `create_user_ns()`.
 
-  * mount flags: `MOUNT_ATTR_RDONLY`, ...
-  * time flags: `MOUNT_ATTR_RELATIME`, ...
-    Could probably be combined with mount flags.
-  * propagation setting: `MS_SHARED)`, ...
-  * peer group
-  * mnt id of the mount
-  * mnt id of the mount's parent
-  * owning userns
+**🙇 `7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()")` 🙇**
 
-  There's a bit more advanced stuff systemd would really want but which
-  I think is misplaced in a mountinfo system call including:
-  * list of primary and auxiliary block device major/minor
-  * diskseq value of those device nodes (This is a new block device feature
-    we added that allows preventing device recycling issues when e.g.
-    removing usb devices very quickly and is needed for udev.)
-  * uuid/fsid
-  * feature flags (`O_TMPFILE`, `RENAME_EXCHANGE` supported etc.)
+**Use-Case:** Allow LSMs to monitor user namespace creation.
 
-  **Use-Case:** low-level userspace tools have to interact with advanced
-  mount information constantly. This is currently costly and brittel because
-  they have to go and parse `/proc/<pid>/mountinfo`.
+### Idmapped mounts for tmpfs
 
-* [x] (kAPI) Add security hook to `create_user_ns()`.
+[x] Support idmapped mounts for tmpfs
 
-  **🙇 `7cd4c5c2101c ("security, lsm: Introduce security_create_user_ns()")` 🙇**
+**🙇 `7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs")` 🙇**
 
-  **Use-Case:** Allow LSMs to monitor user namespace creation.
+**Use-Case:** Runtimes such as Kubernetes use a lot of `tmpfs` mounts of
+individual files or directories to expose information to containers/pods.
+Instead of having to change ownership permanently allow them to use an
+idmapped mount instead.
 
-* [x] Support idmapped mounts for tmpfs
+@rata and @giuseppe brought this suggestion forward. For Kubernetes it is
+sufficient to support idmapped mounts of `tmpfs` instances mounted in the
+initial user namespace. However, in the future idmapped
+mounts of `tmpfs` instances mounted in user namespaces should be supported.
+Other container runtimes want to make use of this. The kernel is able to
+support this since at least `5.17`.
 
-  **🙇 `7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs")` 🙇**
+Things to remember are that `tmpfs` mounts can serve as lower- or upper
+layers in `overlayfs` and care needs to be taken that this remains safe if
+idmapped mounts of `tmpfs` instances mounted in user namespaces are
+supported.
 
-  **Use-Case:** Runtimes such as Kubernetes use a lot of `tmpfs` mounts of
-  individual files or directories to expose information to containers/pods.
-  Instead of having to change ownership permanently allow them to use an
-  idmapped mount instead.
+### Additional identifiers for pidfds in `statx()`
 
-  @rata and @giuseppe brought this suggestion forward. For Kubernetes it is
-  sufficient to support idmapped mounts of `tmpfs` instances mounted in the
-  initial user namespace. However, in the future idmapped
-  mounts of `tmpfs` instances mounted in user namespaces should be supported.
-  Other container runtimes want to make use of this. The kernel is able to
-  support this since at least `5.17`.
+[x] Make `statx()` on a pidfd return additional recognizable identifiers
+in `.stx_ino`.
 
-  Things to remember are that `tmpfs` mounts can serve as lower- or upper
-  layers in `overlayfs` and care needs to be taken that this remains safe if
-  idmapped mounts of `tmpfs` instances mounted in user namespaces are
-  supported.
+**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
 
-* [x] Make statx() on a pidfd return additional recognizable identifiers
-  in `.stx_ino`.
+It would be fantastic if issuing statx() on any pidfd would return some
+reasonably stable 64bit identifier for the process in `.stx_ino`. This would
+be perfect to identify processes pinned by a pidfd, and compare them.
 
-  **🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**
+### Namespace `binfmt_misc` filesystem
 
-  It would be fantastic if issuing statx() on any pidfd would return some
-  reasonably stable 64bit identifier for the process in `.stx_ino`. This would
-  be perfect to identify processes pinned by a pidfd, and compare them.
+[x] Make the `binfmt_misc` filesystem namespaced.
 
-* [x] Namespace `binfmt_misc` filesystem.
+**🙇 `21ca59b365c0 ("binfmt_misc: enable sandboxed mounts")` 🙇**
 
-  **🙇 `21ca59b365c0 ("binfmt_misc: enable sandboxed mounts")` 🙇**
+**Use-Case:** Allow containers and sandboxes to register their own binfmt
+handlers.
 
-  **Use-Case:** Allow containers and sandboxes to register their own binfmt
-  handlers.
+### Support idmapped mounts for `overlayfs`
 
-* [x] Support idmapped mounts for `overlayfs`
+[x] Support idmapped mounts for `overlayfs`
 
-  **🙇 `bc70682a497c ("ovl: support idmapped layers")` 🙇**
+**🙇 `bc70682a497c ("ovl: support idmapped layers")` 🙇**
 
-  **Use-Case:** Allow containers to use `overlayfs` with idmapped mounts.
+**Use-Case:** Allow containers to use `overlayfs` with idmapped mounts.