-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce the lock contention in task spawn. #6001
reduce the lock contention in task spawn. #6001
Conversation
The origin version: spawn_tasks_current_thread time: [654.87 ns 664.26 ns 672.13 ns] change: [-4.5814% -1.2511% +2.1604%] (p = 0.48 > 0.05) No change in performance detected. spawn_tasks_current_thread_parallel time: [537.55 ns 540.00 ns 542.20 ns] change: [-1.0603% -0.2591% +0.6283%] (p = 0.58 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe spawn_tasks time: [1.0447 µs 1.0533 µs 1.0610 µs] change: [+0.9379% +2.2768% +3.7191%] (p = 0.00 < 0.05) Change within noise threshold. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low severe 2 (2.00%) low mild spawn_tasks_parallel time: [992.38 ns 1.0002 µs 1.0073 µs] change: [+0.1973% +1.4194% +2.5819%] (p = 0.02 < 0.05) Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 2 (2.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe This version: spawn_tasks_current_thread time: [705.92 ns 724.31 ns 739.78 ns] spawn_tasks_current_thread_parallel time: [529.33 ns 531.00 ns 532.61 ns] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe spawn_tasks time: [881.56 ns 892.21 ns 902.10 ns] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) low mild 1 (1.00%) high mild spawn_tasks_parallel time: [815.00 ns 819.87 ns 824.60 ns] Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe
f8a5f10
to
8e2c0b2
Compare
12c8a11
to
445d79f
Compare
445d79f
to
a500a79
Compare
Thanks for doing this. I tried something similar but did not measure an improvement. What CPU are you running this on? |
The above benchmark test runs on the Linux, the CPU is I also run the same benchmark test on other cpu architecture like ARM (mac m1), I found the silimar improvement too. The lock contention problem would be heavier, when we use more threads to spwan tasks concurrently. Do you make sure the The reason why I propose to use criterion crate in #5979 is that we could do conrrent jobs more conveniently. With async fn job(iters: usize, procs: usize) {
for _ in 0..procs {
let mut threads_handles = Vec::with_capacity(procs);
threads_handles.push(tokio::spawn(async move {
let mut thread_handles = Vec::with_capacity(iters / procs);
for _ in 0..iters / procs {
thread_handles.push(tokio::spawn(async {
let val = 1 + 1;
tokio::task::yield_now().await;
black_box(val)
}));
}
for handle in thread_handles {
handle.await.unwrap();
}
}));
for handle in threads_handles {
handle.await.unwrap();
}
}
} We can get The falmegraph told the same improvement. This version: The |
2808b31
to
632a8d3
Compare
aea5da1
to
6453017
Compare
Sorry for the delay in review. I've been visiting conferences last week and next week, so I'm not as actively watching the repository right now as I usually am.
Overall, I'm pretty happy with the approach. As you mention, whether we should have a config option is my only real question. I would lean towards no because it's simpler, but if it's unstable, I do not mind having it. Other than that, I think there were only various minor things to fix last I looked. |
I have considered for some time and believe that your idea is better choice. We will not allow users to configure the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes below, I'll be happy to merge this.
tokio/src/runtime/builder.rs
Outdated
fn get_spawn_concurrency_level(core_threads : usize) -> usize { | ||
const MAX_SPAWN_CONCURRENCY_LEVEL: usize = 1 << 16; | ||
let mut size = 1; | ||
while size / 4 < core_threads && size < MAX_SPAWN_CONCURRENCY_LEVEL { | ||
size <<= 1; | ||
} | ||
size.min(MAX_SPAWN_CONCURRENCY_LEVEL) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are removing it as an option from the builder, then all changes related to spawn_concurrency_level
in this file and other files should be removed. Instead, I would just move this into runtime/task/list.rs
as a const.
tokio/src/util/linked_list.rs
Outdated
pub(crate) fn for_each<F>(&mut self, mut f: F) | ||
pub(crate) fn for_each<F>(&mut self, f: &mut F) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please undo this change and continue to use mut f: F
. The code in sharded_list.rs
will still work because mutable references are also FnMut
.
27c0a1e
to
db58076
Compare
/// Due to the above reasons, we set a maximum value for the shared list size, | ||
/// denoted as `MAX_SHARED_LIST_SIZE`. | ||
fn gen_shared_list_size(num_cores: usize) -> usize { | ||
const MAX_SHARED_LIST_SIZE: usize = 1 << 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maximum seems really large to me. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just used the settings of the previous spawn_concurrency_level
function. The previous function could support user configuration, so this value was set to a large value.
If you want a smaller value, I'm ok with that.
As far as I know, some current server CPUs can reach close to 200 CPU cores (or hyperthreadings), such as the AMD EPYC 9654 which has 192 threads. Maybe we can set the value to 256(4 times this value is 1<<10
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to retract my previous point. I believe this should be a memory-bound scenario, so we only need to ensure a certain number of mutexes. In fact, the number of mutexes should not depend on the number of worker threads, but actually depends on the possible level of concurrency.
Below, I describe the concurrency level that may actually occur, rather than what occurs in the benchmark.
In real applications that use Tokio, multiple threads may perform tokio::spawn
concurrently. However, having too many threads concurrently performing tokio::spawn
at the same time can be considered a poor design choice in the higher-level application architecture. It is more likely that multiple threads will perform removing tasks concurrently at the same time. Nonetheless, removing a task from the list only takes up a minimal amount of runtime during the entire task life cycle, so the concurrency level of removing tasks is usually not a significant concern.
Therefore, I agree with your idea, and it is reasonable to set the upper limit to a small value. Setting this to 64 or 128 might make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. What do you think about multiplying by 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total number of random memory operations is consistent, providing multiple locks. Each thread obtains one of the locks through round robin, and can only perform random memory access after obtaining the lock. I got the following test results:
mutexes/threads | 1 | 2 | 4 | 8 | 12 | 16 | 24 | 32 | 48 | 64 | 128 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.980172444 | 2.899437975 | 1.447906311 | 0.828731566 | 0.689618066 | 0.612355429 | 0.589401394 | 0.587380871 | 0.525477567 | 0.578456362 | 0.552132325 |
1 | 7.970250774 | 17.29034894 | 22.60164692 | 25.97284605 | 28.12352579 | 33.31359697 | 31.18786342 | 31.61139126 | 29.23225856 | 30.94094675 | 31.59191497 |
2 | 7.883931727 | 15.97845738 | 16.11107368 | 18.73377898 | 20.34614133 | 23.02624802 | 22.69439808 | 23.15802647 | 21.80570219 | 22.48815498 | 22.98585238 |
4 | 7.975676415 | 10.25364766 | 11.88074538 | 15.40198137 | 15.51024255 | 16.35328034 | 15.46874828 | 15.7982897 | 15.48703267 | 15.67227903 | 15.35829948 |
8 | 8.058803258 | 8.138193999 | 7.619081588 | 7.936418179 | 7.654288652 | 7.901945312 | 7.642439744 | 7.861542054 | 7.730389506 | 7.821229611 | 7.748344488 |
16 | 9.797308994 | 6.213334839 | 4.455407945 | 4.496371955 | 4.291254249 | 4.130849346 | 4.347601475 | 4.294096757 | 3.990391527 | 4.028562691 | 4.059085994 |
32 | 8.742854719 | 4.847656612 | 3.301780829 | 2.578327826 | 2.480488617 | 2.331294827 | 2.388718271 | 2.306257478 | 2.421350161 | 2.278177495 | 2.26569423 |
64 | 8.042672888 | 4.963568223 | 3.012473492 | 2.08243512 | 1.828237002 | 1.653421053 | 1.550811454 | 1.536452054 | 1.519761769 | 1.618966043 | 1.48010674 |
128 | 8.62801309 | 4.978525185 | 2.637936755 | 1.777546296 | 1.549096849 | 1.359814529 | 1.43875245 | 1.385468038 | 1.238832309 | 1.249940559 | 1.248131329 |
256 | 8.584906215 | 4.591742459 | 2.441556366 | 1.504790937 | 1.335449235 | 1.169191715 | 1.115906268 | 1.230570609 | 1.075581823 | 1.048285585 | 1.02977064 |
512 | 8.171549127 | 4.182283461 | 2.37535305 | 1.54202412 | 1.1690348 | 1.054650104 | 1.015366906 | 1.153238581 | 0.993319168 | 0.998864737 | 0.981392837 |
1024 | 8.533398132 | 4.175120792 | 2.209645233 | 1.412410651 | 1.055442085 | 0.938202817 | 1.122801927 | 0.940661156 | 0.888767412 | 0.914867532 | 0.92237305 |
This is mainly because when the number of worker threads is relatively small, setting the number of locks to 4 times has a significant performance improvement compared to 2 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For complete testing, please refer to https://gist.github.com/wathenjiang/30b689a7ef20b4ea667a2e8f358c321d
I think this performance test can provide a reference for how many mutexes are needed for OwnedTasks.
Co-authored-by: Alice Ryhl <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Bumps tokio from 1.34.0 to 1.35.0. Release notes Sourced from tokio's releases. Tokio v1.35.0 1.35.0 (December 8th, 2023) Added net: add Apple watchOS support (#6176) Changed io: drop the Sized requirements from AsyncReadExt.read_buf (#6169) runtime: make Runtime unwind safe (#6189) runtime: reduce the lock contention in task spawn (#6001) tokio: update nix dependency to 0.27.1 (#6190) Fixed chore: make --cfg docsrs work without net feature (#6166) chore: use relaxed load for unsync_load on miri (#6179) runtime: handle missing context on wake (#6148) taskdump: fix taskdump cargo config example (#6150) taskdump: skip notified tasks during taskdumps (#6194) tracing: avoid creating resource spans with current parent, use a None parent instead (#6107) tracing: make task span explicit root (#6158) Documented io: flush in AsyncWriteExt examples (#6149) runtime: document fairness guarantees and current behavior (#6145) task: document cancel safety of LocalSet::run_until (#6147) #6001: tokio-rs/tokio#6001 #6107: tokio-rs/tokio#6107 #6144: tokio-rs/tokio#6144 #6145: tokio-rs/tokio#6145 #6147: tokio-rs/tokio#6147 #6148: tokio-rs/tokio#6148 #6149: tokio-rs/tokio#6149 #6150: tokio-rs/tokio#6150 #6158: tokio-rs/tokio#6158 #6166: tokio-rs/tokio#6166 #6169: tokio-rs/tokio#6169 #6176: tokio-rs/tokio#6176 #6179: tokio-rs/tokio#6179 #6189: tokio-rs/tokio#6189 #6190: tokio-rs/tokio#6190 #6194: tokio-rs/tokio#6194 Commits 92a3455 chore: prepare Tokio v1.35.0 (#6197) 1968565 chore: use relaxed load for unsync_load (#6203) c9273f1 sync: improve safety comments for WakeList (#6200) e05d0f8 changelog: fix missing link for 1.8.2 (#6199) debcb22 Revert "net: add SocketAddr::as_abstract_namespace (#6144)" (#6198) 83b7397 io: drop the Sized requirements from AsyncReadExt.read_buf (#6169) 3991f9f docs: fix typo in 'tokio/src/sync/broadcast.rs' (#6182) 48c0e62 chore: use relaxed load for unsync_load on miri (#6179) d561b58 taskdump: skip notified tasks during taskdumps (#6194) 3a4aef1 runtime: reduce the lock contention in task spawn (#6001) Additional commits viewable in compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase. Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: @dependabot rebase will rebase this PR @dependabot recreate will recreate this PR, overwriting any edits that have been made to it @dependabot merge will merge this PR after your CI passes on it @dependabot squash and merge will squash and merge this PR after your CI passes on it @dependabot cancel merge will cancel a previously requested merge and block automerging @dependabot reopen will reopen this PR if it is closed @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
We have to use a OwnedTasks to track the pointers of all Tasks, this is a multi-threaded scenario. When the Multi-threaded concurrent
tokio::spawn
is called, the lock contention would be very busy.I propose a solution without losing any single-threaded spawn task performance, to improve the multi-threaded spawn task performance aboout
15% ~ 20%
.This solution is that: let tasks with different task ID to be put into the different LinkedList. The different LinkedList use own Mutex to make sure multi-threaded concurrency safety instead of a runtime unique Mutex.
The origin version:
This version: