scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch #998

mmz-zmm · 2024-11-28T06:53:01Z

I am using scx_bpfland v1.0.6 to test Unixbench on a 128-Core, SMT enabled (64 physical core), 8 NUMA node AMD server, the overall result is great thanks to your guys work, except that multi-core pipe-based context swith shows a 90%+ performance degradation vs eevdf. The kernel version is 6.12.0.rc2.

Multi-core Pipe-based Context Switching(lps)
eevdf: 32508
scx_simple: 596
scx_central: 171
scx_layered: 638
scx_bpfland: 366
scx_lavd: 1839

They all show a 90%+ degradation on this test case. Since scx_simple and scx_central were simple and toy scheduler, it make sense not behaving well, but scx_layered/scx_bpfland/scx_lavd also perform bad on this. I mainly use scx_bpfland, below is some of my attempts on scx_bpfland. At first, I see native_queued_spin_lock_slowpath is super high, with perf, it shows that enqueue task to global share dsq and consume it cause spinlock contention, so I add 8 dsq for each numa node. This time native_queued_spin_lock_slowpath still exists, previous callsites becomes less but context-switch score remains the same . The current perf top show like below:

Besides, when not running unixbench pipe-based context swith test, just enable scx_bpfland, native_queued_spin_lock_slowpath still very high, which related to the cpu idle process.

So is there any thing we can do to optimize these? From what I saw and tested, it doesn't limited to scx_bpfland, is this a scx kernel problem? All discussions are welcome.

ps: the follwing is single core score, which scx_bpfland is pretty good.

Single core Pipe-base Context Swithching(lps)
eevdf: 168
scx_simple: 261
scx_central: 25
scx_layered: 44
scx_bpfland: 197
scx_lavd: 127

The text was updated successfully, but these errors were encountered:

multics69 · 2024-11-28T08:36:32Z

@mmz-zmm -- Thanks for testing scx schedulers! In your 8-socker machine, LAVD should create 8 DSQs, one for each NUMA domain. However, I've never tested how much contention LAVD has in a large NUMA machine. If possible, could you share the perf results for LAVD?

mmz-zmm · 2024-11-28T09:04:28Z

@mmz-zmm -- Thanks for testing scx schedulers! In your 8-socker machine, LAVD should create 8 DSQs, one for each NUMA domain. However, I've never tested how much contention LAVD has in a large NUMA machine. If possible, could you share the perf results for LAVD?

Of course. Running scx_lavd with perf record -F 99 -- ./Run -q -c 128 context1, after finished, perf report show below:

note: when exec scx_lavd, it warn that System does not provides proper CPU frequency information, not sure if this affects the final score.

arighi · 2024-11-28T09:07:25Z

Can you also share the output of lscpu -e?

mmz-zmm · 2024-11-28T09:16:28Z

Can you also share the output of lscpu -e?

lscpu_-e.txt
Hi arighi, I upload it as attachment.

arighi · 2024-11-28T09:21:53Z

Considering that we see idle CPU functions in perf top I was wondering if the per-NUMA idle cpumask change could help in this case...

@mmz-zmm if you have time (and if you can) could you give it a try with this kernel https://github.com/arighi/sched_ext/tree/arighi, in particular the last 2 patches, that are splitting the global idle cpumask into multiple per-node cpumasks:

  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()

mmz-zmm · 2024-11-28T09:22:33Z

Can you also share the output of lscpu -e?

From what I observed, when running multi-core context1 test, it will fork a child thread, they two connected with two pipe, communacate with each other by read/write to pipe, one cycle is a context-switch. EEVDF prefers put context1 to different L3 domain, while scx_bpfland prefers to put them in same L3 domain until all L3 domain CPU is busy.

arighi · 2024-11-28T09:25:41Z

Can you also share the output of lscpu -e?

From what I observed, when running multi-core context1 test, it will fork a child thread, they two connected with two pipe, communacate with each other by read/write to pipe, one cycle is a context-switch. EEVDF prefers put context1 to different L3 domain, while scx_bpfland prefers to put them in same L3 domain until all L3 domain CPU is busy.

I'm pretty sure we can also do something at the scheduler level, but the fact that pretty much all the scx schedulers are showing poor performance with this particular test case / architecture makes me think it's more like a kernel issue...

mmz-zmm · 2024-11-28T09:25:57Z

Considering that we see idle CPU functions in perf top I was wondering if the per-NUMA idle cpumask change could help in this case...

@mmz-zmm if you have time (and if you can) could you give it a try with this kernel https://github.com/arighi/sched_ext/tree/arighi, in particular the last 2 patches, that are splitting the global idle cpumask into multiple per-node cpumasks:
  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()

Ok, I will have a try. It may needs some time, I will post result here when getting it done.

mmz-zmm · 2024-11-28T10:34:02Z

Considering that we see idle CPU functions in perf top I was wondering if the per-NUMA idle cpumask change could help in this case...
@mmz-zmm if you have time (and if you can) could you give it a try with this kernel https://github.com/arighi/sched_ext/tree/arighi, in particular the last 2 patches, that are splitting the global idle cpumask into multiple per-node cpumasks:
  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()
Ok, I will have a try. It may needs some time, I will post result here when getting it done.

The score improved from 366 to 630 (_find_next_and_bit lower), by still very low vs eevdf.

mmz-zmm · 2024-11-28T10:42:11Z

If I use 8 dsq instead of 1 global dsq, the contention decrease, the score can reach 1400, same level as scx_lavd. perf report show like below:

arighi · 2024-11-28T11:02:22Z

Is scx_simple also triggering that stall with the new kernel, or just bpfland? At least the idle cpumak split seems to improve performance in this case. But it could be also due to other kernel changes... it'd be interesting to see what happens dropping the last 2 patches from my tree... if you still have time to do more tests. BTW thanks tons for testing this!

mmz-zmm · 2024-11-29T02:09:59Z

Is scx_simple also triggering that stall with the new kernel, or just bpfland? At least the idle cpumak split seems to improve performance in this case. But it could be also due to other kernel changes... it'd be interesting to see what happens dropping the last 2 patches from my tree... if you still have time to do more tests. BTW thanks tons for testing this!

My bad, I use modified scx_bpfland with 8 global dsq, it caused crash. It has nothing with origin v1.0.6 scx_bpfland, so ignore it.

multics69 · 2024-11-29T02:33:37Z

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

mmz-zmm · 2024-11-29T04:48:32Z

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report.
kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

kernel-6.12.0.rc2, score 1839:

arighi · 2024-11-29T07:14:12Z

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report. kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

Do you get the same regression without changing scx_lavd? IIUC it seemed to improve the performance with scx_bpfland (the unmodified one).

But there might be other factors involved, because these schedulers are doing complex things in ops.select_cpu(). That's why I asked about scx_simple, that is basically relying on the built-in idle CPU selection policy, so it's a better test to measure if the cpumask patch is relevant or not in this case.

mmz-zmm · 2024-12-02T01:21:18Z

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report. kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

Do you get the same regression without changing scx_lavd? IIUC it seemed to improve the performance with scx_bpfland (the unmodified one).

Yes, I didn't change scx_lavd, but its score dropped, while scx_bpfland increased. Since the base kernel version is different, I am going to removing your patch and test it again with same kernel base, it needs some time and I will let you know when avaliable.

But there might be other factors involved, because these schedulers are doing complex things in ops.select_cpu(). That's why I asked about scx_simple, that is basically relying on the built-in idle CPU selection policy, so it's a better test to measure if the cpumask patch is relevant or not in this case.

Make sense, to test your patch, it's absoluty right, but it seams these three patches didn't improve Unixbench Context-Switch too much.

mmz-zmm · 2024-12-02T06:10:05Z

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report. kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

Do you get the same regression without changing scx_lavd? IIUC it seemed to improve the performance with scx_bpfland (the unmodified one).

Yes, I didn't change scx_lavd, but its score dropped, while scx_bpfland increased. Since the base kernel version is different, I am going to removing your patch and test it again with same kernel base, it needs some time and I will let you know when avaliable.

I tested scx_simple, scx_bpfland, scx_lavd on @arighi 's sched-ext repo (https://github.com/arighi/sched_ext/tree/arighi), the base kernel revert last two commits:

  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()

while the patched kernel has this two commits. scx_bpfland and scx_lavd are of version v1.0.6.(the dirty flag in pic is because commit #999)

According Unixbench Multi-core context-switch test, these two patches didn't behave well, below are statistics:

scx_simple

from 474 to 440

scx_bpfland

from 426 to 424

scx_lavd

from 1700 to 1253

conclude

So, there must be something wrong, right? @arighi

arighi · 2024-12-02T07:33:23Z

@mmz-zmm thanks for testing and sharing all these results!

But now I'm confused, I thought you noticed a performance gain initially with the 2 extra kernel patches applied, now there's a performance drop instead. Also, for the kernel patch, let's focus at scx_simple and ignore the others for now (especially bpfland, that may require to be patched as well with the extra kernel patches applied).

Another thing to keep in mind: this is just one benchmark (and I'm not sure what it's doing exactly), it'd be interesting to see what happens also with other benchmarks, like a parallel kernel build. I'm not asking to do this, just mentioning that we may see if the performance drop / gain is consistent with other workloads, maybe the CPU idle selection policy is just irrelevant in this case and the main bottleneck might be somewhere else (so the different/inconsistent results might be just situational - I'm speaking about the extra kernel patches, not the performance diff between eevdf / sched_ext).

arighi · 2024-12-02T07:46:49Z

@mmz-zmm can you share more details about the benchmark you're running? What's the exact command line? Which version of Unixbench? (I'd like to run the test locally to better understand what it's doing).

mmz-zmm · 2024-12-02T08:58:56Z

@mmz-zmm can you share more details about the benchmark you're running? What's the exact command line? Which version of Unixbench? (I'd like to run the test locally to better understand what it's doing).

The benchmark is Unixbench 5.1.3(https://github.com/kdlucas/byte-unixbench/tree/v5.1.3), the command of this specific test is ./Run -c 128 context1. While using perf, I'm running perf record -F 99 -- ./Run -c 128 context1

mmz-zmm · 2024-12-02T09:21:11Z

@mmz-zmm thanks for testing and sharing all these results!

But now I'm confused, I thought you noticed a performance gain initially with the 2 extra kernel patches applied, now there's a performance drop instead. Also, for the kernel patch, let's focus at scx_simple and ignore the others for now (especially bpfland, that may require to be patched as well with the extra kernel patches applied).

Got it. I would say that this latest test was under more precise control, only changing kernel(same base) while keeping others the same, so it is more convincing. Another thing to address, when running the test, I am also using perf record, so the score may a little bit lower than previous results. Besides, I hited a scx_bpfland crash when running perf record -F 99 -- ./Run -c 128 context1 , this time I confirm it's version is v1.0.6, no DIY change at all (Thing gets realy complicated ;) ).

Another thing to keep in mind: this is just one benchmark (and I'm not sure what it's doing exactly), it'd be interesting to see what happens also with other benchmarks, like a parallel kernel build. I'm not asking to do this, just mentioning that we may see if the performance drop / gain is consistent with other workloads, maybe the CPU idle selection policy is just irrelevant in this case and the main bottleneck might be somewhere else (so the different/inconsistent results might be just situational - I'm speaking about the extra kernel patches, not the performance diff between eevdf / sched_ext).

Yes, I totally agree. The reason I created this issue is that, Unixbench has been widely used in server kernel test, when I found its low score on this specific testcase, I thought there may be some problems. If you want me testing kernel building, I will try it. Please be more precise on how to measure the difference, like using time to output building time? eevdf building vs sc_simple building or others?

As for the CPU idle policy, I built this kernel with make LLVM=1 -j binrpm-pkg and then install it, its CPU idlepolicy is None(cpupower idle-info).

arighi · 2024-12-02T12:58:53Z

As for the CPU idle policy, I built this kernel with make LLVM=1 -j binrpm-pkg and then install it, its CPU idlepolicy is None(cpupower idle-info).

Sorry, I wasn't clear, with "CPU idle selection policy" I mean the logic that the scx scheduler is using to pick an idle CPU when a task needs to run, it has nothing to do with cpupower idle policy.

mmz-zmm changed the title ~~scx: 90% performance degradation for Unxibench Multi-core Pipe-based Context Switch~~ scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch Nov 28, 2024

hodgesds added the performance label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch #998

scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch #998

mmz-zmm commented Nov 28, 2024 •

edited

Loading

multics69 commented Nov 28, 2024 •

edited

Loading

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024 •

edited

Loading

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024

mmz-zmm commented Nov 28, 2024

mmz-zmm commented Nov 28, 2024 •

edited

Loading

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024

mmz-zmm commented Nov 29, 2024

multics69 commented Nov 29, 2024

mmz-zmm commented Nov 29, 2024 •

edited

Loading

arighi commented Nov 29, 2024

mmz-zmm commented Dec 2, 2024

mmz-zmm commented Dec 2, 2024 •

edited

Loading

arighi commented Dec 2, 2024

arighi commented Dec 2, 2024

mmz-zmm commented Dec 2, 2024 •

edited

Loading

mmz-zmm commented Dec 2, 2024 •

edited

Loading

arighi commented Dec 2, 2024

scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch #998

scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch #998

Comments

mmz-zmm commented Nov 28, 2024 • edited Loading

multics69 commented Nov 28, 2024 • edited Loading

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024 • edited Loading

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024

mmz-zmm commented Nov 28, 2024

mmz-zmm commented Nov 28, 2024 • edited Loading

mmz-zmm commented Nov 28, 2024

arighi commented Nov 28, 2024

mmz-zmm commented Nov 29, 2024

multics69 commented Nov 29, 2024

mmz-zmm commented Nov 29, 2024 • edited Loading

arighi commented Nov 29, 2024

mmz-zmm commented Dec 2, 2024

mmz-zmm commented Dec 2, 2024 • edited Loading

scx_simple

scx_bpfland

scx_lavd

conclude

arighi commented Dec 2, 2024

arighi commented Dec 2, 2024

mmz-zmm commented Dec 2, 2024 • edited Loading

mmz-zmm commented Dec 2, 2024 • edited Loading

arighi commented Dec 2, 2024

mmz-zmm commented Nov 28, 2024 •

edited

Loading

multics69 commented Nov 28, 2024 •

edited

Loading

arighi commented Nov 28, 2024 •

edited

Loading

mmz-zmm commented Nov 28, 2024 •

edited

Loading

mmz-zmm commented Nov 29, 2024 •

edited

Loading

mmz-zmm commented Dec 2, 2024 •

edited

Loading

mmz-zmm commented Dec 2, 2024 •

edited

Loading

mmz-zmm commented Dec 2, 2024 •

edited

Loading