Refactor `async_rw_mutex` #1379

msimberg · 2024-12-19T12:39:36Z

This is an attempt at slightly simplifying and optimizing the internals of async_rw_mutex. This avoids the needs for a lock to keep track of continuations and instead triggers continuations through operation states. The continuations are linked to each other through an intrusive linked list of operation states. These changes also avoid the need to have a weak shared pointer between shared states.

Overall I'm hoping that the removal of extra reference counting and locks will slightly improve performance, but fundamentally the structure is still the same, requiring the same amount of dynamic allocations (cf. #1125; this PR does not address that) as before (in fact, one more allocation for the value stored by the mutex). So the impact may be minimal in terms of performance. However, I'm also making these changes to make the dependency triggering a bit more understandable.

codacy-production · 2024-12-19T12:44:45Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ -0.08% (target: -1.00%)	✅ 100.00% (target: 90.00%)

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`2ffc93f`)	18217	13776	75.62%
Head commit (`ef9e0aa`)	18181 (-36)	13735 (-41)	75.55% (-0.08%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#1379)	85	85	100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more}

msimberg · 2025-01-07T10:10:48Z

I think I'm done with the refactorings for this time at least. I'd appreciate someone having at least a high level look at this.

I've added some more text and diagrams on how the implementation works. It could probably still be expanded, but I'm hoping it provides at least a better explanation than what was there before. Note that the implementation has changed sufficiently that the new implementation description does not match what was done before.

msimberg · 2025-01-08T12:09:30Z

This in fact introduces quite significant performance regressions on some of the algorithms in DLA-Future when run on GH200. For example, cholesky is significantly slower (new is this branch, old is main where this was branched off):

The performance difference is reproducible when I compare the branches manually.

Curiously, bt_band_to_trid is the only one that shows a small, but consistent, performance improvement:

Many algorithms are unaffected, and the eigensolvers are slightly slower overall.

msimberg · 2025-01-09T13:39:29Z

It looks like the performance regression is a result of the linked operation states being accessed in reverse order compared to before. This was only affecting the GPU backend in DLA-Future, where GPU work is scheduled inline. I've pushed a commit which restores the order of calling continuations and I'm rerunning benchmarks. We shouldn't make this a guarantee, but I'm keeping the order unchanged in this PR to not upset DLA-Future performance for now. We can see if it's possible to relax the order in the future without affecting performance.

msimberg · 2025-01-09T15:17:39Z

Reversing the order of continuations now brings the performance very close to what it was with the old implementation. Some algorithms in DLA-Future still show a tiny performance improvement, but nothing dramatic.

msimberg · 2025-01-09T15:20:24Z

This is again ready for review. Changes since last time are outlined in the previous comments.

This allows avoiding synchronization required when passing the value from one shared state to another.

Use the shared state already stored in the operation state in continuations.

…hared state Don't do it in the previous shared state, for simpler reasoning about ownership.

…on in async_rw_mutex for continuations

…sive linked list

…nc_rw_mutex

Explicitly specify expected type to avoid unwanted constructor calls.

…c_rw_mutex

The value is set directly in the constructor.

…y in async_rw_mutex" This reverts commit 7851830.

…red state" This reverts commit a584683.

…t more straightforward

…etween void and non-void case

…as done

…c_rw_mutex Reset the shared state before updating the head of the queue. Once the head of the queue is updated, there's a small time window where continuations could be run inline, and resetting the shared state in `done` could release the last reference to the shared state. Since we want to ensure that the last reference is always released in a continuation, we move the resetting of the shared state to happen before calling `done`. It's safe to do because if no continuations have been added, the shared state is still kept alive by senders, and if continuations have been added, they'll also hold references to the shared state.

codacy-production · 2025-01-16T10:53:51Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ -0.03% (target: -1.00%)	✅ 100.00% (target: 90.00%)

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`664c138`)	18223	13740	75.40%
Head commit (`0af4fda`)	18188 (-35)	13708 (-32)	75.37% (-0.03%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#1379)	86	86	100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more}

msimberg self-assigned this Dec 19, 2024

msimberg mentioned this pull request Dec 20, 2024

Debugging hangs eth-cscs/DLA-Future#1238

Closed

msimberg force-pushed the async-rw-mutex-optimizations branch 2 times, most recently from 24103a1 to 0cb0dba Compare January 7, 2025 08:51

msimberg marked this pull request as ready for review January 7, 2025 10:09

msimberg requested review from aurianer and biddisco as code owners January 7, 2025 10:09

msimberg force-pushed the async-rw-mutex-optimizations branch from 0cb0dba to 51812d5 Compare January 7, 2025 10:18

msimberg marked this pull request as draft January 8, 2025 12:09

msimberg added a commit to msimberg/DLA-Future that referenced this pull request Jan 9, 2025

Update pika branch to pika-org/pika#1379

f9588eb

msimberg marked this pull request as ready for review January 9, 2025 15:19

msimberg marked this pull request as draft January 10, 2025 08:32

msimberg added 12 commits January 16, 2025 11:49

Store async_rw_mutex value in a separate shared state

1d5564c

This allows avoiding synchronization required when passing the value from one shared state to another.

Don't pass shared state from previous to next state in async_rw_mutex

2efe891

Use the shared state already stored in the operation state in continuations.

Trigger continuations of async_rw_mutex shared states in the owning s…

4abd849

…hared state Don't do it in the previous shared state, for simpler reasoning about ownership.

Replace type-erased unique_function with a pointer and virtual functi…

ec9a22d

…on in async_rw_mutex for continuations

Store operation states of continuations in async_rw_mutex as an intru…

ea493e1

…sive linked list

Avoid storing pointer to previous shared state in async_rw_mutex

7b700f8

Manage shared state allocation and reference counting manually in asy…

dfaad33

…nc_rw_mutex

Don't shadow allocator_type typedef in async_rw_mutex

ae3cc62

Avoid shadowing alloc member variable in async_rw_mutex

cba3b98

Make one-parameter constructors explicit in async_rw_mutex

c68679d

Avoid templated one-parameter constructors in async_rw_mutex

920f30d

Explicitly specify expected type to avoid unwanted constructor calls.

Avoid writing out a few full pointer types in favour of auto* in asyn…

1bdd6f2

…c_rw_mutex

msimberg added 15 commits January 16, 2025 11:49

Mark move constructors noexcept in async_rw_mutex

819ab94

Remove unused set_value member function in async_rw_mutex shared state

84298d1

The value is set directly in the constructor.

Revert "Manage shared state allocation and reference counting manuall…

15ca558

…y in async_rw_mutex" This reverts commit 7851830.

Revert "Remove unused set_value member function in async_rw_mutex sha…

e37cfe0

…red state" This reverts commit a584683.

Ensure async_rw_mutex shared state is released as early as possible

95aa0f5

Refactor queue handling in async_rw_mutex operation states to be a bi…

95beac1

…t more straightforward

Add base class for async_rw_mutex shared state to avoid duplication b…

0cfb263

…etween void and non-void case

Expand implementation documentation for async_rw_mutex

accbd86

Remove unused includes and clang-tidy suppressions in async_rw_mutex.hpp

05a63f0

Add noexcept to many member functions in async_rw_mutex

b3f5767

Replace empty while-loop body with semicolon

0cb2e19

Add comment about void* in async_rw_mutex_operation_state_base

01671b3

Avoid compare-and-swap loop when marking async_rw_mutex shared state …

16eee55

…as done

Reverse the order of calling continuations in async_rw_mutex

5957ee8

msimberg force-pushed the async-rw-mutex-optimizations branch from d00c0e1 to 0af4fda Compare January 16, 2025 10:50

msimberg added a commit to msimberg/DLA-Future that referenced this pull request Jan 28, 2025

Update pika branch to pika-org/pika#1379

4f404f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `async_rw_mutex` #1379

Refactor `async_rw_mutex` #1379

msimberg commented Dec 19, 2024 •

edited

Loading

codacy-production bot commented Dec 19, 2024 •

edited

Loading

msimberg commented Jan 7, 2025

msimberg commented Jan 8, 2025

msimberg commented Jan 9, 2025

msimberg commented Jan 9, 2025

msimberg commented Jan 9, 2025

codacy-production bot commented Jan 16, 2025 •

edited

Loading

Refactor async_rw_mutex #1379

Are you sure you want to change the base?

Refactor async_rw_mutex #1379

Conversation

msimberg commented Dec 19, 2024 • edited Loading

codacy-production bot commented Dec 19, 2024 • edited Loading

Coverage summary from Codacy

See diff coverage on Codacy

See your quality gate settings Change summary preferences

msimberg commented Jan 7, 2025

msimberg commented Jan 8, 2025

msimberg commented Jan 9, 2025

msimberg commented Jan 9, 2025

msimberg commented Jan 9, 2025

codacy-production bot commented Jan 16, 2025 • edited Loading

Coverage summary from Codacy

See diff coverage on Codacy

See your quality gate settings Change summary preferences

Refactor `async_rw_mutex` #1379

Refactor `async_rw_mutex` #1379

msimberg commented Dec 19, 2024 •

edited

Loading

codacy-production bot commented Dec 19, 2024 •

edited

Loading

codacy-production bot commented Jan 16, 2025 •

edited

Loading