Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waker optimization + O(woken) polling for every combinator except chain #115

Closed
wants to merge 13 commits into from

Conversation

wishawa
Copy link
Contributor

@wishawa wishawa commented Dec 27, 2022

Performance improvements

WakerArray/WakerVec now keeps track of what futures were woken in a list, so that no O(n) search is needed when polling. This makes polling O(woken) instead of O(n). The impact is extremely significant for large combinators.

Also, the new implementation avoids locking and unlocking the Readiness Mutex in a loop. It copies out the data necessary for iteration once at the beginning of each poll.

I've also made WakerArray/WakerVec use a single Arc shared between all wakers (without giving up perfect waking) instead of needing a separate Arc for each. This change involves some unsafe work with RawWaker.

API changes

Race, TryJoin, and RaceOk for tuples now supports futures with heterogeneous results. Racing futures with different output types would return an enum whose variants are the possible outputs. If all outputs are the same type then there is a function to convert the enum to that type.

RaceOk error types are simplified to be just array/tuple/vec of the errors. I removed the wrapping AggregateError because it can be confusing for tuples (since error types are not necessarily homogeneous anymore).

Organizational changes

As part of rewriting the various combinators, I've merged the code for join/try_join/race/race_ok. There is now a crate-private futures::common module with a generic combinator whose behaviors can be controlled to match join/try_join/race/race_ok by a generic type parameter. For tuples, I basically implement try join and make every other combinator delegate to that.

I've also upped the edition to Rust 2021. This is not really necessary but the disjoint closure capture saves a few lines of code.

I renamed "Readiness" to "Awakeness" because the former gets confusing since Poll/PollState::Ready means the future is complete rather than awake.

Benchmark

Currently, CountdownFuture/Stream wake and complete in perfect order. This is unrealistic. I added shuffling (with a fixed seed) so that they better represent real workload.

Below:
after = this PR
before = origin/main with countdown shuffling commit cherry-picked

$ critcmp  after before
group                after                                  before
-----                -----                                  ------
array::join 10       1.00  1854.5±12.26ns        ? ?/sec    1.20      2.2±0.02µs        ? ?/sec
array::join 100      1.00     20.6±0.16µs        ? ?/sec    1.17     24.1±0.28µs        ? ?/sec
array::join 1000     1.00    219.4±3.13µs        ? ?/sec    5.67  1242.9±19.53µs        ? ?/sec
array::merge 10      1.00  1828.5±29.31ns        ? ?/sec    1.32      2.4±0.05µs        ? ?/sec
array::merge 100     1.00     19.9±0.26µs        ? ?/sec    2.15     42.8±3.00µs        ? ?/sec
array::merge 1000    1.00    225.9±3.09µs        ? ?/sec    8.98      2.0±0.04ms        ? ?/sec
array::race 10       1.07  1138.1±18.48ns        ? ?/sec    1.00  1061.1±43.17ns        ? ?/sec
array::race 100      1.00      7.9±0.09µs        ? ?/sec    1.41     11.1±0.17µs        ? ?/sec
array::race 1000     1.00     82.5±1.13µs        ? ?/sec    1.58    130.6±2.91µs        ? ?/sec
tuple::join 10       1.00  1912.5±29.00ns        ? ?/sec    1.14      2.2±0.03µs        ? ?/sec
tuple::merge 10      1.00      2.2±0.03µs        ? ?/sec    1.23      2.7±0.06µs        ? ?/sec
tuple::race 10       1.15  1134.5±20.99ns        ? ?/sec    1.00   987.5±13.09ns        ? ?/sec
vec::join 10         1.00      2.3±0.06µs        ? ?/sec    1.08      2.5±0.05µs        ? ?/sec
vec::join 100        1.00     18.0±0.19µs        ? ?/sec    2.05     36.9±0.47µs        ? ?/sec
vec::join 1000       1.00    202.1±1.68µs        ? ?/sec    9.45  1909.9±161.66µs        ? ?/sec
vec::merge 10        1.00      2.4±0.04µs        ? ?/sec    1.18      2.8±0.05µs        ? ?/sec
vec::merge 100       1.00     20.8±0.25µs        ? ?/sec    2.57     53.3±1.27µs        ? ?/sec
vec::merge 1000      1.00    222.8±3.95µs        ? ?/sec    12.86     2.9±0.05ms        ? ?/sec
vec::race 10         1.33  1349.1±15.53ns        ? ?/sec    1.00  1011.6±13.01ns        ? ?/sec
vec::race 100        1.00      7.3±0.10µs        ? ?/sec    1.45     10.6±0.14µs        ? ?/sec
vec::race 1000       1.00     72.0±1.64µs        ? ?/sec    1.70    122.7±1.75µs        ? ?/sec

@wishawa wishawa changed the title waker array/vec: use a single big Arc shared between all wakers Waker optimization + O(woken) polling for every combinator except chain Dec 30, 2022
@matheus-consoli
Copy link
Collaborator

wow, thank you for this!

I'll try to make time to review your changes soon -- but I loved the Behavior traits approach, and the bench results are quite impressive!

Copy link
Collaborator

@matheus-consoli matheus-consoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for this wonderful work!

I think this overall improves the codebase and seems to be correct.
But, as this brings a lot of breaking changes, I prefer to have a second word from @yoshuawuyts

We'll need a new major after merging, and we probably should try to work on some details before the release, like improving error handling (happy_eyeballs exemplifies that returning Vec<Error> is not very Try-friendly), trying to iterate in a more fair fashion (in the old Indexer sense, as I understand, this PR removes it), and seeing if we would benefit of bitvec in some points. We can iterate on this in small steps after the merge.

@matheus-consoli
Copy link
Collaborator

We also may favor this over #113 and #112, I can close them after merging

@wishawa
Copy link
Contributor Author

wishawa commented Jan 4, 2023

Thanks for the review!.

It is true that removing AggregateError makes error handling more verbose. Might not have been a good idea on my part.
What if we bring it back but replace array::AggregateError<E, N>/vec::AggregateError<E>/tuple::AggregateErrorN<E1, E2, E3, ...> with a single AggregateError<_> type that can be parameterized to be AggregateError<Vec<E>>/AggregateError<[E; N]>/AggregateError<(E1, E2, E3, ...)>? This would help avoid having to make the macro generate a new error type for every tuple length.

For fairness, things can be complicated

Consider these implementations

  • Design 0: the current implementation but with Indexer removed.
  • Design 1: the current implementation (with Indexer).
  • Design 2: this PR as is; poll subfutures from first to last in the initial round, and poll them in the order they wake for subsequent rounds.
  • Design 3: this PR with something like Indexer applied; in each round, poll the subfuture that woke second first, third second, on until circling around to poll the one that woke first last.
  • Design 4: this PR with an RNG to shuffle the initial poll order.

Situation A

async fn work() {
    for _ in 0..N {
        shared_mutex.lock().await;
    }
}
(work(), work()).race().await;

Here design 4 is the only fair one. With 0/1/2/3, the left subfuture wins.

Situation B

async fn work() {
    for _ in 0..N {
        shared_mutex.lock().await;
        yield_now().await;
    }
}
(work(), work()).race().await;

Here design 4 is fair as before. 0 and 2 prefer the left subfuture. For 1 and 3, the winner depends on the constant N being odd or even.

Situation C

async fn work(time: f32) {
    for _ in 0..N {
        sleep(time).await;
    }
}
(work(1.00), work(1.01)).race().await;

Here, the left subfuture wins in all implementation.

Situation D
Same code as situation C, but make the executor busy enough that it takes longer than 0.01 sec between Race waking up and getting polled.

Here 0, 2, and 4 prefer the left subfuture. 3 prefers the right subfuture (!). 1 depends on oddness/evenness of N.


So the RNG solution (design 4) is the clear winner. But it comes with the RNG's complexity and nondeterminism.

Apart from 4,

  • Design 2 give unfair headstarts to subfutures earlier in the list. (but it is still more fair than design 0)
  • Design 3 can end up punishing fast subfutures as shown in situation D.

But this PR is already too big as it is! So as you said, let's iterate on fairness later.

@yoshuawuyts
Copy link
Owner

yoshuawuyts commented Jan 9, 2023

We also may favor this over #113 and #112, I can close them after merging

Oops, I missed this! - I was going through the PRs in chronological order, and didn't realize these might in fact conflict. Reviewing this PR now!

@yoshuawuyts
Copy link
Owner

@wishawa thanks for this PR! - I think there are some interesting things here that I would like to tease apart, possibly into different PRs. I'm trying to summarize the work you've done, so far I'm seeing the following changes:

  1. Changed the bench suite to use a BinaryHeap internally
  2. Changed the APIs of Race, TryJoin, and RaceOk
  3. Replaced the AggregateError structures used in RaceOk entirely
  4. Sped up the internal implementation of WakerArray and WakerVec by reusing one Arc between all wakers
  5. Renamed internal data structures (readiness -> awakeness)
  6. Updated the edition to Rust 2021
  7. Created an internal Behaviour trait implementation which all other traits are based on

This PR by itself is probably too big to review in-depth on its own. However, some of these might be interesting to start off with individually. For example: the edition bump, and changes to the bench suite seem like they would be great stand-alone PRs. I'd also love to see the internal data structure improvements as their own PRs. There are also things I'm less sure about: in particular the changes to the public API may not necessarily be what we want. And I think I'd want to dig in deeper to the proxy trait solution to better understand how that works. But we should be able to do those things incrementally.

@wishawa Would you be on board with splitting this PR into smaller PRs we can review those in a more granular manner?

@wishawa
Copy link
Contributor Author

wishawa commented Jan 9, 2023

@yoshuawuyts That's a fair assessment. I'll split the changes into smaller PRs. Although the PR containing the main change (that allows O(woken subfutures) instead of O(total subfutures) polling) will still be pretty big because it involves changing the Readiness/Awakeness API that all the combinators uses.

I'll be doing the rebase manually anyway so no need to revert #112 and #113.

@wishawa
Copy link
Contributor Author

wishawa commented Jan 10, 2023

Closing in favor of #117, #118, and #119.

@wishawa wishawa closed this Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants