Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v6.9.1-scx1 #26

Merged
merged 753 commits into from
May 22, 2024
Merged

v6.9.1-scx1 #26

merged 753 commits into from
May 22, 2024

Conversation

Byte-Lab
Copy link
Contributor

@Byte-Lab Byte-Lab commented May 22, 2024

First official 6.9 release. Includes all sched_ext-specific patches (not necessarily including bpf patches) up to 6f386ca in the sched-ext/sched_ext tree

ThinkerYzu1 and others added 30 commits March 14, 2024 13:34
According to a report, skeletons fail to assign shadow pointers when being
compiled with C++ programs. Unlike C doing implicit casting for void
pointers, C++ requires an explicit casting.

To support C++, we do explicit casting for each shadow pointer.

Also add struct_ops_module.skel.h to test_cpp to validate C++
compilation as part of BPF selftests.

Signed-off-by: Kui-Feng Lee <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Acked-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Current 'bpftool link' command does not show pids, e.g.,
  $ tools/build/bpftool/bpftool link
  ...
  4: tracing  prog 23
        prog_type lsm  attach_type lsm_mac
        target_obj_id 1  target_btf_id 31320

Hack the following change to enable normal libbpf debug output,
  --- a/tools/bpf/bpftool/pids.c
  +++ b/tools/bpf/bpftool/pids.c
  @@ -121,9 +121,9 @@ int build_obj_refs_table(struct hashmap **map, enum bpf_obj_type type)
          /* we don't want output polluted with libbpf errors if bpf_iter is not
           * supported
           */
  -       default_print = libbpf_set_print(libbpf_print_none);
  +       /* default_print = libbpf_set_print(libbpf_print_none); */
          err = pid_iter_bpf__load(skel);
  -       libbpf_set_print(default_print);
  +       /* libbpf_set_print(default_print); */

Rerun the above bpftool command:
  $ tools/build/bpftool/bpftool link
  libbpf: prog 'iter': BPF program load failed: Permission denied
  libbpf: prog 'iter': -- BEGIN PROG LOAD LOG --
  0: R1=ctx() R10=fp0
  ; struct task_struct *task = ctx->task; @ pid_iter.bpf.c:69
  0: (79) r6 = *(u64 *)(r1 +8)          ; R1=ctx() R6_w=ptr_or_null_task_struct(id=1)
  ; struct file *file = ctx->file; @ pid_iter.bpf.c:68
  ...
  ; struct bpf_link *link = (struct bpf_link *) file->private_data; @ pid_iter.bpf.c:103
  80: (79) r3 = *(u64 *)(r8 +432)       ; R3_w=scalar() R8=ptr_file()
  ; if (link->type == bpf_core_enum_value(enum bpf_link_type___local, @ pid_iter.bpf.c:105
  81: (61) r1 = *(u32 *)(r3 +12)
  R3 invalid mem access 'scalar'
  processed 39 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2
  -- END PROG LOAD LOG --
  libbpf: prog 'iter': failed to load: -13
  ...

The 'file->private_data' returns a 'void' type and this caused subsequent 'link->type'
(insn #81) failed in verification.

To fix the issue, restore the previous BPF_CORE_READ so old kernels can also work.
With this patch, the 'bpftool link' runs successfully with 'pids'.
  $ tools/build/bpftool/bpftool link
  ...
  4: tracing  prog 23
        prog_type lsm  attach_type lsm_mac
        target_obj_id 1  target_btf_id 31320
        pids systemd(1)

Fixes: 44ba7b3 ("bpftool: Use a local copy of BPF_LINK_TYPE_PERF_EVENT in pid_iter.bpf.c")
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Tested-by: Quentin Monnet <[email protected]>
Reviewed-by: Quentin Monnet <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
In bpf_objec_load_prog(), there's no guarantee that obj->btf is non-NULL
when passing it to btf__fd(), and this function does not perform any
check before dereferencing its argument (as bpf_object__btf_fd() used to
do). As a consequence, we get segmentation fault errors in bpftool (for
example) when trying to load programs that come without BTF information.

v2: Keep btf__fd() in the fix instead of reverting to bpf_object__btf_fd().

Fixes: df7c3f7 ("libbpf: make uniform use of btf__fd() accessor inside libbpf")
Suggested-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Accept additional fields of a struct_ops type with all zero values even if
these fields are not in the corresponding type in the kernel. This provides
a way to be backward compatible. User space programs can use the same map
on a machine running an old kernel by clearing fields that do not exist in
the kernel.

Signed-off-by: Kui-Feng Lee <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
A new version of a type may have additional fields that do not exist in
older versions. Previously, libbpf would reject struct_ops maps with a new
version containing extra fields when running on a machine with an old
kernel. However, we have updated libbpf to ignore these fields if their
values are all zeros or null in order to provide backward compatibility.

Signed-off-by: Kui-Feng Lee <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
…pdated-version'

Kui-Feng Lee says:

====================
Ignore additional fields in the struct_ops maps in an updated version.

According to an offline discussion, it would be beneficial to
implement a backward-compatible method for struct_ops types with
additional fields that are not present in older kernels.

This patchset accepts additional fields of a struct_ops map with all
zero values even if these fields are not in the corresponding type in
the kernel. This provides a way to be backward compatible. User space
programs can use the same map on a machine running an old kernel by
clearing fields that do not exist in the kernel.

For example, in a test case, it adds an additional field "zeroed" that
doesn't exist in struct bpf_testmod_ops of the kernel.

    struct bpf_testmod_ops___zeroed {
    	int (*test_1)(void);
    	void (*test_2)(int a, int b);
    	int (*test_maybe_null)(int dummy, struct task_struct *task);
    	int zeroed;
    };

    SEC(".struct_ops.link")
    struct bpf_testmod_ops___zeroed testmod_zeroed = {
    	.test_1 = (void *)test_1,
    	.test_2 = (void *)test_2_v2,
    };

Here, it doesn't assign a value to "zeroed" of testmod_zeroed, and by
default the value of this field will be zero. So, the map will be
accepted by libbpf, but libbpf will skip the "zeroed" field. However,
if the "zeroed" field is assigned to any value other than "0", libbpf
will reject to load this map.
---
Changes from v1:

 - Fix the issue about function pointer fields.

 - Change a warning message, and add an info message for skipping
   fields.

 - Add a small demo of additional arguments that are not in the
   function pointer prototype in the kernel.

v1: https://lore.kernel.org/all/[email protected]/

Kui-Feng Lee (3):
  libbpf: Skip zeroed or null fields if not found in the kernel type.
  selftests/bpf: Ensure libbpf skip all-zeros fields of struct_ops maps.
  selftests/bpf: Accept extra arguments if they are not used.

 tools/lib/bpf/libbpf.c                        |  24 +++-
 .../bpf/prog_tests/test_struct_ops_module.c   | 103 ++++++++++++++++++
 .../bpf/progs/struct_ops_extra_arg.c          |  49 +++++++++
 .../selftests/bpf/progs/struct_ops_module.c   |  16 ++-
 4 files changed, 186 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_extra_arg.c
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
The UEI macros were updated in a prior commit. Apply the changes to the
sched_ext selftests dir.

Signed-off-by: David Vernet <[email protected]>
Right now we're just printing what the user passes to SCX_ERROR(). This
can cause the output from that error message to appear on the same line
as the results output from the test runner. Let's append a newline.

Signed-off-by: David Vernet <[email protected]>
scx: Update selftests to use new UEI macros
Copy over main program's sleepable bit into subprog's info. This might
be important for, e.g., freplace cases.

Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
…_ro()

set_memory_ro() can fail, leaving memory unprotected.

Check its return and take it into account as an error.

Link: KSPP/linux#7
Signed-off-by: Christophe Leroy <[email protected]>
Cc: [email protected] <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Message-ID: <286def78955e04382b227cb3e4b6ba272a7442e3.1709850515.git.christophe.leroy@csgroup.eu>
Signed-off-by: Alexei Starovoitov <[email protected]>
…ry_lock_ro()

set_memory_rox() can fail, leaving memory unprotected.

Check return and bail out when bpf_jit_binary_lock_ro() returns
an error.

Link: KSPP/linux#7
Signed-off-by: Christophe Leroy <[email protected]>
Cc: [email protected] <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Puranjay Mohan <[email protected]>
Reviewed-by: Ilya Leoshkevich <[email protected]>  # s390x
Acked-by: Tiezhu Yang <[email protected]>  # LoongArch
Reviewed-by: Johan Almbladh <[email protected]> # MIPS Part
Message-ID: <036b6393f23a2032ce75a1c92220b2afcb798d5d.1709850515.git.christophe.leroy@csgroup.eu>
Signed-off-by: Alexei Starovoitov <[email protected]>
of_gpio.h is deprecated and subject to remove.
The driver doesn't use it, simply remove the unused header.

Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Sebastian Reichel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
There are statements with two semicolons. Remove the second one, it
is redundant.

Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
The free_sge_txq_old() function has an unnecessary txq check of 0.
This check is not necessary, since the txq pointer is initialized by the
uldtxq[i] address from the operation &txq_info->uldtxq[i], which ensures
that txq is not equal to 0.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: ab677ff ("cxgb4: Allocate Tx queues dynamically")
Signed-off-by: Mikhail Lobanov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
There is a "if (err)" check earlier, so the "if (err < 0)"
check that this patch removing is unnecessary. It was my overlook
when making adjustments to the bpf_struct_ops_prepare_trampoline()
such that the caller does not have to worry about the new page when
the function returns error.

Fixes: 187e2af ("bpf: struct_ops supports more than one page for trampolines.")
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
libbpf creates bpf_program/bpf_map structs for each program/map that
user defines, but it allows to disable creating/loading those objects in
kernel, in that case they won't have associated file descriptor
(fd < 0). Such functionality is used for backward compatibility
with some older kernels.

Nothing prevents users from passing these maps or programs with no
kernel counterpart to libbpf APIs. This change introduces explicit
checks for kernel objects existence, aiming to improve visibility of
those edge cases and provide meaningful warnings to users.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Last user of arch_unprotect_bpf_trampoline() was removed by
commit 187e2af ("bpf: struct_ops supports more than one page for
trampolines.")

Remove arch_unprotect_bpf_trampoline()

Reported-by: Daniel Borkmann <[email protected]>
Fixes: 187e2af ("bpf: struct_ops supports more than one page for trampolines.")
Signed-off-by: Christophe Leroy <[email protected]>
Link: https://lore.kernel.org/r/42c635bb54d3af91db0f9b85d724c7c290069f67.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <[email protected]>
arch_protect_bpf_trampoline() and alloc_new_pack() call
set_memory_rox() which can fail, leading to unprotected memory.

Take into account return from set_memory_rox() function and add
__must_check flag to arch_protect_bpf_trampoline().

Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/fe1c163c83767fde5cab31d209a4a6be3ddb3a73.1710574353.git.christophe.leroy@csgroup.eu
Signed-off-by: Martin KaFai Lau <[email protected]>
…dly NULL

Make the sanity check a bit more concise and ensure that ops.cgroup_move()
is never called with NULL source cgroup.
…ugh ops.cgroup_prep_move()

sched_move_task() takes an early exit if the source and destination are
identical. This triggers the warning in scx_cgroup_can_attach() as it leaves
p->scx.cgrp_moving_from uncleared.

Update the cgroup migration path so that ops.cgroup_prep_move() is skipped
for identity migrations so that its invocations always match
ops.cgroup_move() one-to-one.
The BPF map type LPM (Longest Prefix Match) is used heavily
in production by multiple products that have BPF components.
Perf data shows trie_lookup_elem() and longest_prefix_match()
being part of kernels perf top.

For every level in the LPM tree trie_lookup_elem() calls out
to longest_prefix_match().  The compiler is free to inline this
call, but chooses not to inline, because other slowpath callers
(that can be invoked via syscall) exists like trie_update_elem(),
trie_delete_elem() or trie_get_next_key().

 bcc/tools/funccount -Ti 1 'trie_lookup_elem|longest_prefix_match.isra.0'
 FUNC                                    COUNT
 trie_lookup_elem                       664945
 longest_prefix_match.isra.0           8101507

Observation on a single random machine shows a factor 12 between
the two functions. Given an average of 12 levels in the trie being
searched.

This patch force inlining longest_prefix_match(), but only for
the lookup fastpath to balance object instruction size.

In production with AMD CPUs, measuring the function latency of
'trie_lookup_elem' (bcc/tools/funclatency) we are seeing an improvement
function latency reduction 7-8% with this patch applied (to production
kernels 6.6 and 6.1). Analyzing perf data, we can explain this rather
large improvement due to reducing the overhead for AMD side-channel
mitigation SRSO (Speculative Return Stack Overflow).

Fixes: fb3bd91 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/171076828575.2141737.18370644069389889027.stgit@firesoul
scx: cgroup: Fix mismatch between `ops.cgroup_prep_move()` and `ops.cgroup_move()` invocations
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
in tracing progs.

We have an internal use case where for an application running
in a container (with pid namespace), user wants to get
the pid associated with the pid namespace in a cgroup bpf
program. Currently, cgroup bpf progs already allow
bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid()
as well.

With auditing the code, bpf_get_current_pid_tgid() is also used
by sk_msg prog. But there are no side effect to expose these two
helpers to all prog types since they do not reveal any kernel specific
data. The detailed discussion is in [1].

So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid()
are put in bpf_base_func_proto(), making them available to all
program types.

  [1] https://lore.kernel.org/bpf/[email protected]/

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Replace CHECK in selftest ns_current_pid_tgid with recommended ASSERT_* style.
I also shortened subtest name as the prefix of subtest name is covered
by the test name already.

This patch does fix a testing issue. Currently even if bss->user_{pid,tgid}
is not correct, the test still passed since the clone func returns 0.
I fixed it to return a non-zero value if bss->user_{pid,tgid} is incorrect.

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Refactor some functions in both user space code and bpf program
as these functions are used by later cgroup/sk_msg tests.
Another change is to mark tp program optional loading as later
patches will use optional loading as well since they have quite
different attachment and testing logic.

There is no functionality change.

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Add a cgroup bpf program test where the bpf program is running
in a pid namespace. The test is successfully:
  #165/3   ns_current_pid_tgid/new_ns_cgrp:OK

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Add a sk_msg bpf program test where the program is running in a pid
namespace. The test is successful:
  #165/4   ns_current_pid_tgid/new_ns_sk_msg:OK

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Yonghong Song says:

====================
current_pid_tgid() for all prog types

Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
in tracing progs.

We have an internal use case where for an application running
in a container (with pid namespace), user wants to get
the pid associated with the pid namespace in a cgroup bpf
program. Besides cgroup, the only prog type, supporting
bpf_get_current_pid_tgid() but not bpf_get_ns_current_pid_tgid(),
is sk_msg.

But actually both bpf_get_current_pid_tgid() and
bpf_get_ns_current_pid_tgid() helpers do not reveal kernel internal
data and there is no reason that they cannot be used in other
program types. This patch just did this and enabled these
two helpers for all program types.

Patch 1 added the kernel support and patches 2-5 added
the test for cgroup and sk_msg.

Change logs:
  v1 -> v2:
    - allow bpf_get_[ns_]current_pid_tgid() for all prog types.
    - for network related selftests, using netns.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
…unnel.sh

In some systems, the netcat server can incur in delay to start listening.
When this happens, the test can randomly fail in various points.
This is an example error message:

   # ip gre none gso
   # encap 192.168.1.1 to 192.168.1.2, type gre, mac none len 2000
   # test basic connectivity
   # Ncat: Connection refused.

The issue stems from a race condition between the netcat client and server.
The test author had addressed this problem by implementing a sleep, which
I have removed in this patch.
This patch introduces a function capable of sleeping for up to two seconds.
However, it can terminate the waiting period early if the port is reported
to be listening.

Signed-off-by: Alessandro Carminati (Red Hat) <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Byte-Lab and others added 26 commits May 8, 2024 00:29
In a recent commit, we refactored bpf_exit_bstr_common() into
bstr_format(). This unfortunately regressed printing output passed from
BPF, as we incorrectly invoked bpf_exit_bstr_common() in the caller. Fix
it.

Signed-off-by: David Vernet <[email protected]>
(cherry picked from commit b46212c)
Signed-off-by: David Vernet <[email protected]>
Nothing's gained by keeping these percpu. Let's make them global.

(cherry picked from commit a609649)
No functional changes.

(cherry picked from commit 1c3ec4b)
Add dump_line() which wraps seq_buf_vprintf(). Replace seq_buf_printf() and
seq_buf_putc() usages with the new wrapper. Also, don't print more than one
line. This is preparation for sending debug dump through a tracepoint.

(cherry picked from commit f5b97f2)
Let's output stack trace using dump_line() too.

(cherry picked from commit 9d9b07c)
- Factor out __bstr_format() which takes broken out buffer arguments. Will
  be used to implement incremental formatting.

- bstr_format() now takes struct scx_bstr_buf * instead of implicitly using
  the global buffer.

- Rename scx_bstr_buf to scx_exit_bstr_buf and switch the dump path to its
  own bstr_buf.

No functional changes.

(cherry picked from commit 8ee9ebe)
scx_bpf_dump_bstr() was outputing character by character into the dump
buffer. This is fine for the dump attached to the exit info but we also want
to output through a tracepoint. This patch makes ops dump output into lines
so that dump_line() is always called with a full line.

(cherry picked from commit 9263e7f)
Make dump_line() append newline implicitly and add dump_newline() for
outputting a blank line. The latter is needed because __printf format
checking gets unhappy with empty format string "". This doesn't cause any
behavior changes and will be used to enable dump output through a
tracepoint.

(cherry picked from commit b9aeb1b)
- In dump_stack_trace(), seq_buf_commit() triggers BUG() if $s is already
  overflowed when $ns was created. Commit iff $avail is non-zero.

- seq_buf_putc() and seq_buf_vprintf() trigger a warning if @s has zero
  length, which can happen with $ns. Skip them for empy seq_bufs.

(cherry picked from commit 13aae89)
Add a new tracepoint, sched_ext_dump, which also receives the dump output.
The TP output is not limted by dump buffer size and truncation will only
happen on trace buffer overflows.

(cherry picked from commit 3766574)
This triggers debug dump in a non-destructive manner.

(cherry picked from commit 5983930)
(cherry picked from commit d180eff)
Some functions were using scx_rq local variable to cache rq->scx and
dequeue_task_scx() was taking scx_rq as an argument. "scx_rq->" isn't any
shorter than "rq->scx." and the inconsistency adds to confusion. Let's
always use rq.

(cherry picked from commit 3baae12)
No need for this to be a separate variable.

- As this removes symbol name collision, rename test_rq_online() to
  scx_rq_online().

- [un]likely() annotation moved from its users to scx_rq_online().

- On/offline status should agree with ops->cpu_on/offline(). In the existing
  code, the two states could deviate when rq_on/offline_scx() were called
  for sched domain updates. Fix it so that they always agree.

(cherry picked from commit 3a44769)
The two being separate variables only makes things cumbersome.

- Move scx_dsp_buf into scx_dsp_ctx and dynamically allocate the latter.

- Rename scx_dsp_ctx->buf_cursor to ->cursor for brevity.

(cherry picked from commit 7d196e5)
…se-after-free

scx_ops_enable() was iterating all tasks without locking its rq and then
calling get_task_struct() on them. However, this can race against task free
path. The task struct itself is accessible but its refcnt may already have
reached zero by the time get_task_struct() is called. This triggers the
following refcnt warning.

  WARNING: CPU: 11 PID: 2834521 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0xe0
  ...
  RIP: 0010:refcount_warn_saturate+0x7d/0xe0
  ...
  Call Trace:
   bpf_scx_reg+0x851/0x890
   bpf_struct_ops_link_create+0xbd/0xe0
   __sys_bpf+0x2862/0x2d20
   __x64_sys_bpf+0x18/0x20
   do_syscall_64+0x3d/0x90
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

This can also affect other loops as it can call sched functions on tasks
which are already destroyed. Let's fix it by making
scx_task_iter_filtered_locked() also filter out TASK_DEAD tasks and
switching the scx_task_iter_filtered() usage in scx_ops_enable() to
scx_task_iter_filtered_locked(). This makes the next function always test
TASK_DEAD while holding the task's rq lock and the rq lock guarantees that
the task won't be freed closing the race window.

(cherry picked from commit f266831)
Receive the skel.attach() fix.

(cherry picked from commit ddd689b)
Fix "implemenation" to "implementation".

(cherry picked from commit bebdd58)
DSQs have their own lifecycle which is protected by RCU. Let's add a simple
selftest that just stresses the create / free path so we get a bit of coverage.

Signed-off-by: David Vernet <[email protected]>
(cherry picked from commit 67841ae)
In commit f266831 ("scx: Close a small race window in the enable path
which can lead to use-after-free"), we fixed a race window on the enable path
that could cause a crash. This fix is fine, but was a bit too aggressive in
that it could also cause us to miss ops.exit_task() invocations in the
following scenario:

1. A task exits and invokes do_task_dead() (making its state TASK_DEAD), but
   someone still holds a refcount on it somewhere.

2. The scheduler is disabled.

3. On the disable path, we don't invoke ops.task_exit()

4. We don't invoke it in sched_ext_free() either later, because by then the
   scheduler has been disabled.

Let's ensure we don't skip on exiting the task by still calling
scx_ops_exit_task() for TASK_DEAD tasks on the disable path.

Signed-off-by: David Vernet <[email protected]>
(cherry picked from commit db7b5f1)
Signed-off-by: David Vernet <[email protected]>
@Byte-Lab Byte-Lab requested a review from htejun May 22, 2024 20:06
@htejun htejun merged commit 42fdce2 into scx-6.9.y May 22, 2024
1 check passed
@htejun htejun deleted the scx-6.9.1 branch May 22, 2024 20:06
htejun added a commit that referenced this pull request May 22, 2024
SCX: Use "idle core" instead of "whole cpu"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.