Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor async_rw_mutex #1379

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

msimberg
Copy link
Contributor

@msimberg msimberg commented Dec 19, 2024

This is an attempt at slightly simplifying and optimizing the internals of async_rw_mutex. This avoids the needs for a lock to keep track of continuations and instead triggers continuations through operation states. The continuations are linked to each other through an intrusive linked list of operation states. These changes also avoid the need to have a weak shared pointer between shared states.

Overall I'm hoping that the removal of extra reference counting and locks will slightly improve performance, but fundamentally the structure is still the same, requiring the same amount of dynamic allocations (cf. #1125; this PR does not address that) as before (in fact, one more allocation for the value stored by the mutex). So the impact may be minimal in terms of performance. However, I'm also making these changes to make the dependency triggering a bit more understandable.

@msimberg msimberg self-assigned this Dec 19, 2024
Copy link

codacy-production bot commented Dec 19, 2024

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-0.08% (target: -1.00%) 100.00% (target: 90.00%)
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (2ffc93f) 18217 13776 75.62%
Head commit (ef9e0aa) 18181 (-36) 13735 (-41) 75.55% (-0.08%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#1379) 85 85 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

@msimberg msimberg force-pushed the async-rw-mutex-optimizations branch 2 times, most recently from 24103a1 to 0cb0dba Compare January 7, 2025 08:51
@msimberg msimberg marked this pull request as ready for review January 7, 2025 10:09
@msimberg
Copy link
Contributor Author

msimberg commented Jan 7, 2025

I think I'm done with the refactorings for this time at least. I'd appreciate someone having at least a high level look at this.

I've added some more text and diagrams on how the implementation works. It could probably still be expanded, but I'm hoping it provides at least a better explanation than what was there before. Note that the implementation has changed sufficiently that the new implementation description does not match what was done before.

@msimberg msimberg force-pushed the async-rw-mutex-optimizations branch from 0cb0dba to 51812d5 Compare January 7, 2025 10:18
@msimberg
Copy link
Contributor Author

msimberg commented Jan 8, 2025

This in fact introduces quite significant performance regressions on some of the algorithms in DLA-Future when run on GH200. For example, cholesky is significantly slower (new is this branch, old is main where this was branched off):
chol_strong_time_20480
The performance difference is reproducible when I compare the branches manually.

Curiously, bt_band_to_trid is the only one that shows a small, but consistent, performance improvement:
bt_band2trid_strong_time_20480_20480

Many algorithms are unaffected, and the eigensolvers are slightly slower overall.

@msimberg msimberg marked this pull request as draft January 8, 2025 12:09
@msimberg
Copy link
Contributor Author

msimberg commented Jan 9, 2025

It looks like the performance regression is a result of the linked operation states being accessed in reverse order compared to before. This was only affecting the GPU backend in DLA-Future, where GPU work is scheduled inline. I've pushed a commit which restores the order of calling continuations and I'm rerunning benchmarks. We shouldn't make this a guarantee, but I'm keeping the order unchanged in this PR to not upset DLA-Future performance for now. We can see if it's possible to relax the order in the future without affecting performance.

msimberg added a commit to msimberg/DLA-Future that referenced this pull request Jan 9, 2025
@msimberg
Copy link
Contributor Author

msimberg commented Jan 9, 2025

Reversing the order of continuations now brings the performance very close to what it was with the old implementation. Some algorithms in DLA-Future still show a tiny performance improvement, but nothing dramatic.

@msimberg msimberg marked this pull request as ready for review January 9, 2025 15:19
@msimberg
Copy link
Contributor Author

msimberg commented Jan 9, 2025

This is again ready for review. Changes since last time are outlined in the previous comments.

@msimberg msimberg marked this pull request as draft January 10, 2025 08:32
The value is set directly in the constructor.
…c_rw_mutex

Reset the shared state before updating the head of the queue. Once the head of the queue is updated,
there's a small time window where continuations could be run inline, and resetting the shared state
in `done` could release the last reference to the shared state. Since we want to ensure that the
last reference is always released in a continuation, we move the resetting of the shared state to
happen before calling `done`. It's safe to do because if no continuations have been added, the
shared state is still kept alive by senders, and if continuations have been added, they'll also hold
references to the shared state.
@msimberg msimberg force-pushed the async-rw-mutex-optimizations branch from d00c0e1 to 0af4fda Compare January 16, 2025 10:50
Copy link

codacy-production bot commented Jan 16, 2025

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-0.03% (target: -1.00%) 100.00% (target: 90.00%)
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (664c138) 18223 13740 75.40%
Head commit (0af4fda) 18188 (-35) 13708 (-32) 75.37% (-0.03%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#1379) 86 86 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

msimberg added a commit to msimberg/DLA-Future that referenced this pull request Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

1 participant