kernel: Fix `ops.dequeue()` #1175

htejun · 2025-01-10T22:02:49Z

Dequeue handling is complicated for sched_ext because it's the only case where the task ownership needs to be taken away from the BPF scheduler. The usual scheduling flow is that once a task wakes up and gets queued on the BPF scheduler (whether that's on a BPF data structure or a custom DSQ), the BPF scheduler has full ownership of the task until the scheduler decides to dispatch the task. This clean transition of ownership makes it unnecessary to couple locking on the kernel side (per-rq locking) and whatever synchronization scheme used by the BPF scheduler.

However, dequeue doesn't follow this rule. When a task's property needs to be changed, the task is forcibly dequeued from the scheduler and re-enqueued after the property change. This can happen anytime and can't be denied or delayed. To support this while allowing hot-migration (e.g. for sharing a DSQ across multiple CPUs), sched_ext does a bit of synchronization dancing using p->scx.holding_cpu. This combined with other factors also allows dispatches to be opportunistic - a BPF scheduler is allowed to dispatch a task which it doesn't currently own. sched_ext core will ignore such dispatches and simply try again. This in turn allows BPF schedulers to ignore and not implement ops.dequeue(). In turn, ops.dequeue() as currently implemented is only called when the sched_ext core knows the task to be on a BPF data structure.

While not implementing ops.dequeue() works fine most of the time, there are cases where the BPF scheduler wants to track the ownership and property transitions and the current implementation of ops.dequeue() doesn't support it. For example, scx_layered wants to track per-DSQ queued average runtime sum and a straight-forward way to do that would be adding when being enqueued and subtracting when the task is either dispatched or dequeued. However, because ops.dequeue() is not called when a task is on a DSQ, there's no way to detect dequeues. Instead, scx_layered depends on detecting back-to-back enqueues instead: https://github.com/sched-ext/scx/blob/main/scheds/rust/scx_layered/src/bpf/main.bpf.c#L1177

Update ops.dequeue() so that an enqueued task is always either dispatched (moved to a local or global DSQ) or dequeued.

The text was updated successfully, but these errors were encountered:

htejun added the kernel Expose a kernel issue label Jan 10, 2025

htejun assigned etsal Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: Fix `ops.dequeue()` #1175

kernel: Fix `ops.dequeue()` #1175

htejun commented Jan 10, 2025

kernel: Fix ops.dequeue() #1175

kernel: Fix ops.dequeue() #1175

Comments

htejun commented Jan 10, 2025

kernel: Fix `ops.dequeue()` #1175

kernel: Fix `ops.dequeue()` #1175