Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch #998

Open
mmz-zmm opened this issue Nov 28, 2024 · 22 comments

Comments

@mmz-zmm
Copy link

mmz-zmm commented Nov 28, 2024

I am using scx_bpfland v1.0.6 to test Unixbench on a 128-Core, SMT enabled (64 physical core), 8 NUMA node AMD server, the overall result is great thanks to your guys work, except that multi-core pipe-based context swith shows a 90%+ performance degradation vs eevdf. The kernel version is 6.12.0.rc2.

Multi-core Pipe-based Context Switching(lps)
eevdf: 32508
scx_simple: 596
scx_central: 171
scx_layered: 638
scx_bpfland: 366
scx_lavd: 1839

They all show a 90%+ degradation on this test case. Since scx_simple and scx_central were simple and toy scheduler, it make sense not behaving well, but scx_layered/scx_bpfland/scx_lavd also perform bad on this. I mainly use scx_bpfland, below is some of my attempts on scx_bpfland. At first, I see native_queued_spin_lock_slowpath is super high, with perf, it shows that enqueue task to global share dsq and consume it cause spinlock contention, so I add 8 dsq for each numa node. This time native_queued_spin_lock_slowpath still exists, previous callsites becomes less but context-switch score remains the same . The current perf top show like below:

微信图片_2024-11-28_143623_593

Besides, when not running unixbench pipe-based context swith test, just enable scx_bpfland, native_queued_spin_lock_slowpath still very high, which related to the cpu idle process.

So is there any thing we can do to optimize these? From what I saw and tested, it doesn't limited to scx_bpfland, is this a scx kernel problem? All discussions are welcome.

ps: the follwing is single core score, which scx_bpfland is pretty good.

Single core Pipe-base Context Swithching(lps)
eevdf: 168
scx_simple: 261
scx_central: 25
scx_layered: 44
scx_bpfland: 197
scx_lavd: 127

@mmz-zmm mmz-zmm changed the title scx: 90% performance degradation for Unxibench Multi-core Pipe-based Context Switch scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch Nov 28, 2024
@multics69
Copy link
Contributor

multics69 commented Nov 28, 2024

@mmz-zmm -- Thanks for testing scx schedulers! In your 8-socker machine, LAVD should create 8 DSQs, one for each NUMA domain. However, I've never tested how much contention LAVD has in a large NUMA machine. If possible, could you share the perf results for LAVD?

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 28, 2024

@mmz-zmm -- Thanks for testing scx schedulers! In your 8-socker machine, LAVD should create 8 DSQs, one for each NUMA domain. However, I've never tested how much contention LAVD has in a large NUMA machine. If possible, could you share the perf results for LAVD?

Of course. Running scx_lavd with perf record -F 99 -- ./Run -q -c 128 context1, after finished, perf report show below:

微信图片_20241128165756

note: when exec scx_lavd, it warn that System does not provides proper CPU frequency information, not sure if this affects the final score.

@arighi
Copy link
Contributor

arighi commented Nov 28, 2024

Can you also share the output of lscpu -e?

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 28, 2024

Can you also share the output of lscpu -e?

lscpu_-e.txt
Hi arighi, I upload it as attachment.

@arighi
Copy link
Contributor

arighi commented Nov 28, 2024

Considering that we see idle CPU functions in perf top I was wondering if the per-NUMA idle cpumask change could help in this case...

@mmz-zmm if you have time (and if you can) could you give it a try with this kernel https://github.com/arighi/sched_ext/tree/arighi, in particular the last 2 patches, that are splitting the global idle cpumask into multiple per-node cpumasks:

  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 28, 2024

Can you also share the output of lscpu -e?

From what I observed, when running multi-core context1 test, it will fork a child thread, they two connected with two pipe, communacate with each other by read/write to pipe, one cycle is a context-switch. EEVDF prefers put context1 to different L3 domain, while scx_bpfland prefers to put them in same L3 domain until all L3 domain CPU is busy.

@arighi
Copy link
Contributor

arighi commented Nov 28, 2024

Can you also share the output of lscpu -e?

From what I observed, when running multi-core context1 test, it will fork a child thread, they two connected with two pipe, communacate with each other by read/write to pipe, one cycle is a context-switch. EEVDF prefers put context1 to different L3 domain, while scx_bpfland prefers to put them in same L3 domain until all L3 domain CPU is busy.

I'm pretty sure we can also do something at the scheduler level, but the fact that pretty much all the scx schedulers are showing poor performance with this particular test case / architecture makes me think it's more like a kernel issue...

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 28, 2024

Considering that we see idle CPU functions in perf top I was wondering if the per-NUMA idle cpumask change could help in this case...

@mmz-zmm if you have time (and if you can) could you give it a try with this kernel https://github.com/arighi/sched_ext/tree/arighi, in particular the last 2 patches, that are splitting the global idle cpumask into multiple per-node cpumasks:

  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()

Ok, I will have a try. It may needs some time, I will post result here when getting it done.

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 28, 2024

Considering that we see idle CPU functions in perf top I was wondering if the per-NUMA idle cpumask change could help in this case...
@mmz-zmm if you have time (and if you can) could you give it a try with this kernel https://github.com/arighi/sched_ext/tree/arighi, in particular the last 2 patches, that are splitting the global idle cpumask into multiple per-node cpumasks:

  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()

Ok, I will have a try. It may needs some time, I will post result here when getting it done.

The score improved from 366 to 630 (_find_next_and_bit lower), by still very low vs eevdf.
微信图片_2024-11-28_182950_600

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 28, 2024

If I use 8 dsq instead of 1 global dsq, the contention decrease, the score can reach 1400, same level as scx_lavd. perf report show like below:
微信图片_20241128183912

@arighi
Copy link
Contributor

arighi commented Nov 28, 2024

Is scx_simple also triggering that stall with the new kernel, or just bpfland? At least the idle cpumak split seems to improve performance in this case. But it could be also due to other kernel changes... it'd be interesting to see what happens dropping the last 2 patches from my tree... if you still have time to do more tests. BTW thanks tons for testing this!

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 29, 2024

Is scx_simple also triggering that stall with the new kernel, or just bpfland? At least the idle cpumak split seems to improve performance in this case. But it could be also due to other kernel changes... it'd be interesting to see what happens dropping the last 2 patches from my tree... if you still have time to do more tests. BTW thanks tons for testing this!

My bad, I use modified scx_bpfland with 8 global dsq, it caused crash. It has nothing with origin v1.0.6 scx_bpfland, so ignore it.

@multics69
Copy link
Contributor

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

@mmz-zmm
Copy link
Author

mmz-zmm commented Nov 29, 2024

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report.
kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

2024-11-29_12-40-02
2024-11-29_12-35-11

kernel-6.12.0.rc2, score 1839:
2024-11-29_13-02-52
2024-11-29_12-58-02

@arighi
Copy link
Contributor

arighi commented Nov 29, 2024

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report. kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

Do you get the same regression without changing scx_lavd? IIUC it seemed to improve the performance with scx_bpfland (the unmodified one).

But there might be other factors involved, because these schedulers are doing complex things in ops.select_cpu(). That's why I asked about scx_simple, that is basically relying on the built-in idle CPU selection policy, so it's a better test to measure if the cpumask patch is relevant or not in this case.

@mmz-zmm
Copy link
Author

mmz-zmm commented Dec 2, 2024

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report. kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

Do you get the same regression without changing scx_lavd? IIUC it seemed to improve the performance with scx_bpfland (the unmodified one).

Yes, I didn't change scx_lavd, but its score dropped, while scx_bpfland increased. Since the base kernel version is different, I am going to removing your patch and test it again with same kernel base, it needs some time and I will let you know when avaliable.

But there might be other factors involved, because these schedulers are doing complex things in ops.select_cpu(). That's why I asked about scx_simple, that is basically relying on the built-in idle CPU selection policy, so it's a better test to measure if the cpumask patch is relevant or not in this case.

Make sense, to test your patch, it's absoluty right, but it seams these three patches didn't improve Unixbench Context-Switch too much.

@mmz-zmm
Copy link
Author

mmz-zmm commented Dec 2, 2024

@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores.

Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report. kernel-6.11.0 + per-NUMA cpumask patch, score 1290:

Do you get the same regression without changing scx_lavd? IIUC it seemed to improve the performance with scx_bpfland (the unmodified one).

Yes, I didn't change scx_lavd, but its score dropped, while scx_bpfland increased. Since the base kernel version is different, I am going to removing your patch and test it again with same kernel base, it needs some time and I will let you know when avaliable.

I tested scx_simple, scx_bpfland, scx_lavd on @arighi 's sched-ext repo (https://github.com/arighi/sched_ext/tree/arighi), the base kernel revert last two commits:

  sched_ext: Introduce per-NUMA idle cpumasks
  nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()

while the patched kernel has this two commits. scx_bpfland and scx_lavd are of version v1.0.6.(the dirty flag in pic is because commit #999)

According Unixbench Multi-core context-switch test, these two patches didn't behave well, below are statistics:

scx_simple

from 474 to 440
scx_simple-base
scx_simple-patched

scx_simple-base-perf
scx_simple-patched-perf

scx_bpfland

from 426 to 424
scx_bpfland-base
scx_bpfland-patched

scx_bpfland-base-perf
scx_bpfland-patched-perf

scx_bpfland-base-monitor
scx_bpfland-patched-monitor

scx_lavd

from 1700 to 1253
scx_lavd-base
scx_lavd-patched

scx_lavd-base-perf
scx_lavd-patched-perf

scx_lavd-base-monitor
scx_lavd-patched-monitor

conclude

So, there must be something wrong, right? @arighi

@arighi
Copy link
Contributor

arighi commented Dec 2, 2024

@mmz-zmm thanks for testing and sharing all these results!

But now I'm confused, I thought you noticed a performance gain initially with the 2 extra kernel patches applied, now there's a performance drop instead. Also, for the kernel patch, let's focus at scx_simple and ignore the others for now (especially bpfland, that may require to be patched as well with the extra kernel patches applied).

Another thing to keep in mind: this is just one benchmark (and I'm not sure what it's doing exactly), it'd be interesting to see what happens also with other benchmarks, like a parallel kernel build. I'm not asking to do this, just mentioning that we may see if the performance drop / gain is consistent with other workloads, maybe the CPU idle selection policy is just irrelevant in this case and the main bottleneck might be somewhere else (so the different/inconsistent results might be just situational - I'm speaking about the extra kernel patches, not the performance diff between eevdf / sched_ext).

@arighi
Copy link
Contributor

arighi commented Dec 2, 2024

@mmz-zmm can you share more details about the benchmark you're running? What's the exact command line? Which version of Unixbench? (I'd like to run the test locally to better understand what it's doing).

@mmz-zmm
Copy link
Author

mmz-zmm commented Dec 2, 2024

@mmz-zmm can you share more details about the benchmark you're running? What's the exact command line? Which version of Unixbench? (I'd like to run the test locally to better understand what it's doing).

The benchmark is Unixbench 5.1.3(https://github.com/kdlucas/byte-unixbench/tree/v5.1.3), the command of this specific test is ./Run -c 128 context1. While using perf, I'm running perf record -F 99 -- ./Run -c 128 context1

@mmz-zmm
Copy link
Author

mmz-zmm commented Dec 2, 2024

@mmz-zmm thanks for testing and sharing all these results!

But now I'm confused, I thought you noticed a performance gain initially with the 2 extra kernel patches applied, now there's a performance drop instead. Also, for the kernel patch, let's focus at scx_simple and ignore the others for now (especially bpfland, that may require to be patched as well with the extra kernel patches applied).

Got it. I would say that this latest test was under more precise control, only changing kernel(same base) while keeping others the same, so it is more convincing. Another thing to address, when running the test, I am also using perf record, so the score may a little bit lower than previous results. Besides, I hited a scx_bpfland crash when running perf record -F 99 -- ./Run -c 128 context1 , this time I confirm it's version is v1.0.6, no DIY change at all (Thing gets realy complicated ;) ).

Another thing to keep in mind: this is just one benchmark (and I'm not sure what it's doing exactly), it'd be interesting to see what happens also with other benchmarks, like a parallel kernel build. I'm not asking to do this, just mentioning that we may see if the performance drop / gain is consistent with other workloads, maybe the CPU idle selection policy is just irrelevant in this case and the main bottleneck might be somewhere else (so the different/inconsistent results might be just situational - I'm speaking about the extra kernel patches, not the performance diff between eevdf / sched_ext).

Yes, I totally agree. The reason I created this issue is that, Unixbench has been widely used in server kernel test, when I found its low score on this specific testcase, I thought there may be some problems. If you want me testing kernel building, I will try it. Please be more precise on how to measure the difference, like using time to output building time? eevdf building vs sc_simple building or others?

As for the CPU idle policy, I built this kernel with make LLVM=1 -j binrpm-pkg and then install it, its CPU idlepolicy is None(cpupower idle-info).

@arighi
Copy link
Contributor

arighi commented Dec 2, 2024

As for the CPU idle policy, I built this kernel with make LLVM=1 -j binrpm-pkg and then install it, its CPU idlepolicy is None(cpupower idle-info).

Sorry, I wasn't clear, with "CPU idle selection policy" I mean the logic that the scx scheduler is using to pick an idle CPU when a task needs to run, it has nothing to do with cpupower idle policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants