-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx: 90% performance degradation for Unixbench Multi-core Pipe-based Context Switch #998
Comments
@mmz-zmm -- Thanks for testing scx schedulers! In your 8-socker machine, LAVD should create 8 DSQs, one for each NUMA domain. However, I've never tested how much contention LAVD has in a large NUMA machine. If possible, could you share the |
Of course. Running scx_lavd with note: when exec scx_lavd, it warn that System does not provides proper CPU frequency information, not sure if this affects the final score. |
Can you also share the output of |
lscpu_-e.txt |
Considering that we see idle CPU functions in perf top I was wondering if the per-NUMA idle cpumask change could help in this case... @mmz-zmm if you have time (and if you can) could you give it a try with this kernel https://github.com/arighi/sched_ext/tree/arighi, in particular the last 2 patches, that are splitting the global idle cpumask into multiple per-node cpumasks:
|
From what I observed, when running multi-core context1 test, it will fork a child thread, they two connected with two pipe, communacate with each other by read/write to pipe, one cycle is a context-switch. EEVDF prefers put context1 to different L3 domain, while scx_bpfland prefers to put them in same L3 domain until all L3 domain CPU is busy. |
I'm pretty sure we can also do something at the scheduler level, but the fact that pretty much all the scx schedulers are showing poor performance with this particular test case / architecture makes me think it's more like a kernel issue... |
Ok, I will have a try. It may needs some time, I will post result here when getting it done. |
The score improved from 366 to 630 (_find_next_and_bit lower), by still very low vs eevdf. |
Is scx_simple also triggering that stall with the new kernel, or just bpfland? At least the idle cpumak split seems to improve performance in this case. But it could be also due to other kernel changes... it'd be interesting to see what happens dropping the last 2 patches from my tree... if you still have time to do more tests. BTW thanks tons for testing this! |
My bad, I use modified scx_bpfland with 8 global dsq, it caused crash. It has nothing with origin v1.0.6 scx_bpfland, so ignore it. |
@mmz-zmm -- Did you happen to run lavd (creating 8 DSQs) with the per-NUMA cpumask patch? That would show us if there is another bottleneck. Unfortunately, I couldn't reproduce the problem with my 2-socket (40-cpus) intel Xeon server. That must have too small number of cores. |
Yes, I did. The score dropped from 1839 to 1290, which is unexpected. Below are the scx_lavd monitor output and perf report. |
Do you get the same regression without changing scx_lavd? IIUC it seemed to improve the performance with scx_bpfland (the unmodified one). But there might be other factors involved, because these schedulers are doing complex things in |
Yes, I didn't change scx_lavd, but its score dropped, while scx_bpfland increased. Since the base kernel version is different, I am going to removing your patch and test it again with same kernel base, it needs some time and I will let you know when avaliable.
Make sense, to test your patch, it's absoluty right, but it seams these three patches didn't improve Unixbench Context-Switch too much. |
I tested scx_simple, scx_bpfland, scx_lavd on @arighi 's sched-ext repo (https://github.com/arighi/sched_ext/tree/arighi), the base kernel revert last two commits: sched_ext: Introduce per-NUMA idle cpumasks
nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() while the patched kernel has this two commits. scx_bpfland and scx_lavd are of version v1.0.6.(the dirty flag in pic is because commit #999) According Unixbench Multi-core context-switch test, these two patches didn't behave well, below are statistics: scx_simplescx_bpflandscx_lavdconcludeSo, there must be something wrong, right? @arighi |
@mmz-zmm thanks for testing and sharing all these results! But now I'm confused, I thought you noticed a performance gain initially with the 2 extra kernel patches applied, now there's a performance drop instead. Also, for the kernel patch, let's focus at scx_simple and ignore the others for now (especially bpfland, that may require to be patched as well with the extra kernel patches applied). Another thing to keep in mind: this is just one benchmark (and I'm not sure what it's doing exactly), it'd be interesting to see what happens also with other benchmarks, like a parallel kernel build. I'm not asking to do this, just mentioning that we may see if the performance drop / gain is consistent with other workloads, maybe the CPU idle selection policy is just irrelevant in this case and the main bottleneck might be somewhere else (so the different/inconsistent results might be just situational - I'm speaking about the extra kernel patches, not the performance diff between eevdf / sched_ext). |
@mmz-zmm can you share more details about the benchmark you're running? What's the exact command line? Which version of Unixbench? (I'd like to run the test locally to better understand what it's doing). |
The benchmark is Unixbench 5.1.3(https://github.com/kdlucas/byte-unixbench/tree/v5.1.3), the command of this specific test is |
Got it. I would say that this latest test was under more precise control, only changing kernel(same base) while keeping others the same, so it is more convincing. Another thing to address, when running the test, I am also using perf record, so the score may a little bit lower than previous results. Besides, I hited a scx_bpfland crash when running
Yes, I totally agree. The reason I created this issue is that, Unixbench has been widely used in server kernel test, when I found its low score on this specific testcase, I thought there may be some problems. If you want me testing kernel building, I will try it. Please be more precise on how to measure the difference, like using time to output building time? eevdf building vs sc_simple building or others? As for the CPU idle policy, I built this kernel with |
Sorry, I wasn't clear, with "CPU idle selection policy" I mean the logic that the scx scheduler is using to pick an idle CPU when a task needs to run, it has nothing to do with cpupower idle policy. |
I am using scx_bpfland v1.0.6 to test Unixbench on a 128-Core, SMT enabled (64 physical core), 8 NUMA node AMD server, the overall result is great thanks to your guys work, except that multi-core pipe-based context swith shows a 90%+ performance degradation vs eevdf. The kernel version is 6.12.0.rc2.
Multi-core Pipe-based Context Switching(lps)
eevdf: 32508
scx_simple: 596
scx_central: 171
scx_layered: 638
scx_bpfland: 366
scx_lavd: 1839
They all show a 90%+ degradation on this test case. Since scx_simple and scx_central were simple and toy scheduler, it make sense not behaving well, but scx_layered/scx_bpfland/scx_lavd also perform bad on this. I mainly use scx_bpfland, below is some of my attempts on scx_bpfland. At first, I see
native_queued_spin_lock_slowpath
is super high, with perf, it shows that enqueue task to global share dsq and consume it cause spinlock contention, so I add 8 dsq for each numa node. This timenative_queued_spin_lock_slowpath
still exists, previous callsites becomes less but context-switch score remains the same . The current perf top show like below:Besides, when not running unixbench pipe-based context swith test, just enable scx_bpfland,
native_queued_spin_lock_slowpath
still very high, which related to the cpu idle process.So is there any thing we can do to optimize these? From what I saw and tested, it doesn't limited to scx_bpfland, is this a scx kernel problem? All discussions are welcome.
ps: the follwing is single core score, which scx_bpfland is pretty good.
Single core Pipe-base Context Swithching(lps)
eevdf: 168
scx_simple: 261
scx_central: 25
scx_layered: 44
scx_bpfland: 197
scx_lavd: 127
The text was updated successfully, but these errors were encountered: