Waker optimization + O(woken) polling for every combinator except chain #115

wishawa · 2022-12-27T08:32:25Z

Performance improvements

WakerArray/WakerVec now keeps track of what futures were woken in a list, so that no O(n) search is needed when polling. This makes polling O(woken) instead of O(n). The impact is extremely significant for large combinators.

Also, the new implementation avoids locking and unlocking the Readiness Mutex in a loop. It copies out the data necessary for iteration once at the beginning of each poll.

I've also made WakerArray/WakerVec use a single Arc shared between all wakers (without giving up perfect waking) instead of needing a separate Arc for each. This change involves some unsafe work with RawWaker.

API changes

Race, TryJoin, and RaceOk for tuples now supports futures with heterogeneous results. Racing futures with different output types would return an enum whose variants are the possible outputs. If all outputs are the same type then there is a function to convert the enum to that type.

RaceOk error types are simplified to be just array/tuple/vec of the errors. I removed the wrapping AggregateError because it can be confusing for tuples (since error types are not necessarily homogeneous anymore).

Organizational changes

As part of rewriting the various combinators, I've merged the code for join/try_join/race/race_ok. There is now a crate-private futures::common module with a generic combinator whose behaviors can be controlled to match join/try_join/race/race_ok by a generic type parameter. For tuples, I basically implement try join and make every other combinator delegate to that.

I've also upped the edition to Rust 2021. This is not really necessary but the disjoint closure capture saves a few lines of code.

I renamed "Readiness" to "Awakeness" because the former gets confusing since Poll/PollState::Ready means the future is complete rather than awake.

Benchmark

Currently, CountdownFuture/Stream wake and complete in perfect order. This is unrealistic. I added shuffling (with a fixed seed) so that they better represent real workload.

Below:
after = this PR
before = origin/main with countdown shuffling commit cherry-picked

$ critcmp  after before
group                after                                  before
-----                -----                                  ------
array::join 10       1.00  1854.5±12.26ns        ? ?/sec    1.20      2.2±0.02µs        ? ?/sec
array::join 100      1.00     20.6±0.16µs        ? ?/sec    1.17     24.1±0.28µs        ? ?/sec
array::join 1000     1.00    219.4±3.13µs        ? ?/sec    5.67  1242.9±19.53µs        ? ?/sec
array::merge 10      1.00  1828.5±29.31ns        ? ?/sec    1.32      2.4±0.05µs        ? ?/sec
array::merge 100     1.00     19.9±0.26µs        ? ?/sec    2.15     42.8±3.00µs        ? ?/sec
array::merge 1000    1.00    225.9±3.09µs        ? ?/sec    8.98      2.0±0.04ms        ? ?/sec
array::race 10       1.07  1138.1±18.48ns        ? ?/sec    1.00  1061.1±43.17ns        ? ?/sec
array::race 100      1.00      7.9±0.09µs        ? ?/sec    1.41     11.1±0.17µs        ? ?/sec
array::race 1000     1.00     82.5±1.13µs        ? ?/sec    1.58    130.6±2.91µs        ? ?/sec
tuple::join 10       1.00  1912.5±29.00ns        ? ?/sec    1.14      2.2±0.03µs        ? ?/sec
tuple::merge 10      1.00      2.2±0.03µs        ? ?/sec    1.23      2.7±0.06µs        ? ?/sec
tuple::race 10       1.15  1134.5±20.99ns        ? ?/sec    1.00   987.5±13.09ns        ? ?/sec
vec::join 10         1.00      2.3±0.06µs        ? ?/sec    1.08      2.5±0.05µs        ? ?/sec
vec::join 100        1.00     18.0±0.19µs        ? ?/sec    2.05     36.9±0.47µs        ? ?/sec
vec::join 1000       1.00    202.1±1.68µs        ? ?/sec    9.45  1909.9±161.66µs        ? ?/sec
vec::merge 10        1.00      2.4±0.04µs        ? ?/sec    1.18      2.8±0.05µs        ? ?/sec
vec::merge 100       1.00     20.8±0.25µs        ? ?/sec    2.57     53.3±1.27µs        ? ?/sec
vec::merge 1000      1.00    222.8±3.95µs        ? ?/sec    12.86     2.9±0.05ms        ? ?/sec
vec::race 10         1.33  1349.1±15.53ns        ? ?/sec    1.00  1011.6±13.01ns        ? ?/sec
vec::race 100        1.00      7.3±0.10µs        ? ?/sec    1.45     10.6±0.14µs        ? ?/sec
vec::race 1000       1.00     72.0±1.64µs        ? ?/sec    1.70    122.7±1.75µs        ? ?/sec

matheus-consoli · 2022-12-30T10:20:56Z

wow, thank you for this!

I'll try to make time to review your changes soon -- but I loved the Behavior traits approach, and the bench results are quite impressive!

…tests

matheus-consoli

Thank you again for this wonderful work!

I think this overall improves the codebase and seems to be correct.
But, as this brings a lot of breaking changes, I prefer to have a second word from @yoshuawuyts

We'll need a new major after merging, and we probably should try to work on some details before the release, like improving error handling (happy_eyeballs exemplifies that returning Vec<Error> is not very Try-friendly), trying to iterate in a more fair fashion (in the old Indexer sense, as I understand, this PR removes it), and seeing if we would benefit of bitvec in some points. We can iterate on this in small steps after the merge.

matheus-consoli · 2023-01-03T17:12:56Z

We also may favor this over #113 and #112, I can close them after merging

wishawa · 2023-01-04T06:25:38Z

Thanks for the review!.

It is true that removing AggregateError makes error handling more verbose. Might not have been a good idea on my part.
What if we bring it back but replace array::AggregateError<E, N>/vec::AggregateError<E>/tuple::AggregateErrorN<E1, E2, E3, ...> with a single AggregateError<_> type that can be parameterized to be AggregateError<Vec<E>>/AggregateError<[E; N]>/AggregateError<(E1, E2, E3, ...)>? This would help avoid having to make the macro generate a new error type for every tuple length.

For fairness, things can be complicated

Consider these implementations

Design 0: the current implementation but with Indexer removed.
Design 1: the current implementation (with Indexer).
Design 2: this PR as is; poll subfutures from first to last in the initial round, and poll them in the order they wake for subsequent rounds.
Design 3: this PR with something like Indexer applied; in each round, poll the subfuture that woke second first, third second, on until circling around to poll the one that woke first last.
Design 4: this PR with an RNG to shuffle the initial poll order.

Situation A

async fn work() {
    for _ in 0..N {
        shared_mutex.lock().await;
    }
}
(work(), work()).race().await;

Here design 4 is the only fair one. With 0/1/2/3, the left subfuture wins.

Situation B

async fn work() {
    for _ in 0..N {
        shared_mutex.lock().await;
        yield_now().await;
    }
}
(work(), work()).race().await;

Here design 4 is fair as before. 0 and 2 prefer the left subfuture. For 1 and 3, the winner depends on the constant N being odd or even.

Situation C

async fn work(time: f32) {
    for _ in 0..N {
        sleep(time).await;
    }
}
(work(1.00), work(1.01)).race().await;

Here, the left subfuture wins in all implementation.

Situation D
Same code as situation C, but make the executor busy enough that it takes longer than 0.01 sec between Race waking up and getting polled.

Here 0, 2, and 4 prefer the left subfuture. 3 prefers the right subfuture (!). 1 depends on oddness/evenness of N.

So the RNG solution (design 4) is the clear winner. But it comes with the RNG's complexity and nondeterminism.

Apart from 4,

Design 2 give unfair headstarts to subfutures earlier in the list. (but it is still more fair than design 0)
Design 3 can end up punishing fast subfutures as shown in situation D.

But this PR is already too big as it is! So as you said, let's iterate on fairness later.

yoshuawuyts · 2023-01-09T10:57:39Z

We also may favor this over #113 and #112, I can close them after merging

Oops, I missed this! - I was going through the PRs in chronological order, and didn't realize these might in fact conflict. Reviewing this PR now!

yoshuawuyts · 2023-01-09T11:38:23Z

@wishawa thanks for this PR! - I think there are some interesting things here that I would like to tease apart, possibly into different PRs. I'm trying to summarize the work you've done, so far I'm seeing the following changes:

Changed the bench suite to use a BinaryHeap internally
Changed the APIs of Race, TryJoin, and RaceOk
Replaced the AggregateError structures used in RaceOk entirely
Sped up the internal implementation of WakerArray and WakerVec by reusing one Arc between all wakers
Renamed internal data structures (readiness -> awakeness)
Updated the edition to Rust 2021
Created an internal Behaviour trait implementation which all other traits are based on

This PR by itself is probably too big to review in-depth on its own. However, some of these might be interesting to start off with individually. For example: the edition bump, and changes to the bench suite seem like they would be great stand-alone PRs. I'd also love to see the internal data structure improvements as their own PRs. There are also things I'm less sure about: in particular the changes to the public API may not necessarily be what we want. And I think I'd want to dig in deeper to the proxy trait solution to better understand how that works. But we should be able to do those things incrementally.

@wishawa Would you be on board with splitting this PR into smaller PRs we can review those in a more granular manner?

wishawa · 2023-01-09T17:19:44Z

@yoshuawuyts That's a fair assessment. I'll split the changes into smaller PRs. Although the PR containing the main change (that allows O(woken subfutures) instead of O(total subfutures) polling) will still be pretty big because it involves changing the Readiness/Awakeness API that all the combinators uses.

I'll be doing the rebase manually anyway so no need to revert #112 and #113.

wishawa · 2023-01-10T04:15:04Z

Closing in favor of #117, #118, and #119.

wishawa added 8 commits December 27, 2022 08:25

waker array/vec: use a single big Arc shared between all wakers

0540016

fix clippy

ab63b98

O(woken) polling for join,merge,zip

4e4f8ec

shuffle vec/array/tuple for benchmark (with a fixed seed)

078ddb5

use a common combinator for join/race/try_join/race_ok

eb4665d

O(woken) for all future array/vec/tuple combinators

db1eedd

clean up imports

926afd8

extra checks and comments for poll-after-done situations

67392b5

wishawa changed the title ~~waker array/vec: use a single big Arc shared between all wakers~~ Waker optimization + O(woken) polling for every combinator except chain Dec 30, 2022

wishawa added 5 commits December 30, 2022 19:38

add more code explanation comments and fix some confused docs

68686d9

switch some imports from std to core

b3180bb

bring back the race_ok tuple tests (I lost them during the rewrite)

b343bfa

add the forgotten try_join::tuple to the module tree along with some …

bfc50f5

…tests

add a test for zip::tuple

19c2bf9

matheus-consoli approved these changes Jan 3, 2023

View reviewed changes

wishawa mentioned this pull request Jan 4, 2023

Possibly unnecessary use of Mutex in Merge implementation? #116

Open

wishawa mentioned this pull request Jan 10, 2023

Only iterate through subfutures/substreams that have woken #119

Open

wishawa closed this Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Waker optimization + O(woken) polling for every combinator except chain #115

Waker optimization + O(woken) polling for every combinator except chain #115

wishawa commented Dec 27, 2022 •

edited

Loading

matheus-consoli commented Dec 30, 2022

matheus-consoli left a comment

matheus-consoli commented Jan 3, 2023

wishawa commented Jan 4, 2023 •

edited

Loading

yoshuawuyts commented Jan 9, 2023 •

edited

Loading

yoshuawuyts commented Jan 9, 2023

wishawa commented Jan 9, 2023

wishawa commented Jan 10, 2023

Waker optimization + O(woken) polling for every combinator except chain #115

Waker optimization + O(woken) polling for every combinator except chain #115

Conversation

wishawa commented Dec 27, 2022 • edited Loading

Performance improvements

API changes

Organizational changes

Benchmark

matheus-consoli commented Dec 30, 2022

matheus-consoli left a comment

Choose a reason for hiding this comment

matheus-consoli commented Jan 3, 2023

wishawa commented Jan 4, 2023 • edited Loading

yoshuawuyts commented Jan 9, 2023 • edited Loading

yoshuawuyts commented Jan 9, 2023

wishawa commented Jan 9, 2023

wishawa commented Jan 10, 2023

wishawa commented Dec 27, 2022 •

edited

Loading

wishawa commented Jan 4, 2023 •

edited

Loading

yoshuawuyts commented Jan 9, 2023 •

edited

Loading