Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: Fix ops.dequeue() #1175

Open
htejun opened this issue Jan 10, 2025 · 0 comments
Open

kernel: Fix ops.dequeue() #1175

htejun opened this issue Jan 10, 2025 · 0 comments
Assignees
Labels
kernel Expose a kernel issue

Comments

@htejun
Copy link
Contributor

htejun commented Jan 10, 2025

Dequeue handling is complicated for sched_ext because it's the only case where the task ownership needs to be taken away from the BPF scheduler. The usual scheduling flow is that once a task wakes up and gets queued on the BPF scheduler (whether that's on a BPF data structure or a custom DSQ), the BPF scheduler has full ownership of the task until the scheduler decides to dispatch the task. This clean transition of ownership makes it unnecessary to couple locking on the kernel side (per-rq locking) and whatever synchronization scheme used by the BPF scheduler.

However, dequeue doesn't follow this rule. When a task's property needs to be changed, the task is forcibly dequeued from the scheduler and re-enqueued after the property change. This can happen anytime and can't be denied or delayed. To support this while allowing hot-migration (e.g. for sharing a DSQ across multiple CPUs), sched_ext does a bit of synchronization dancing using p->scx.holding_cpu. This combined with other factors also allows dispatches to be opportunistic - a BPF scheduler is allowed to dispatch a task which it doesn't currently own. sched_ext core will ignore such dispatches and simply try again. This in turn allows BPF schedulers to ignore and not implement ops.dequeue(). In turn, ops.dequeue() as currently implemented is only called when the sched_ext core knows the task to be on a BPF data structure.

While not implementing ops.dequeue() works fine most of the time, there are cases where the BPF scheduler wants to track the ownership and property transitions and the current implementation of ops.dequeue() doesn't support it. For example, scx_layered wants to track per-DSQ queued average runtime sum and a straight-forward way to do that would be adding when being enqueued and subtracting when the task is either dispatched or dequeued. However, because ops.dequeue() is not called when a task is on a DSQ, there's no way to detect dequeues. Instead, scx_layered depends on detecting back-to-back enqueues instead: https://github.com/sched-ext/scx/blob/main/scheds/rust/scx_layered/src/bpf/main.bpf.c#L1177

Update ops.dequeue() so that an enqueued task is always either dispatched (moved to a local or global DSQ) or dequeued.

@htejun htejun added the kernel Expose a kernel issue label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel Expose a kernel issue
Projects
None yet
Development

No branches or pull requests

2 participants