-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre restart probe fixup #25243
base: dev
Are you sure you want to change the base?
Pre restart probe fixup #25243
Conversation
Pre-restart probe is a lengthy operation with scheduling points, report cache refresh may change the data it iterates over. Use lw_shared_ptr to hold 2 copies if pre-restart probe calculation and refresh work concurrently.
CI test resultstest results on build#62600
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, there are a ton loops in health_monitor_backend, would be great to get another pair of 👀 incase I missed something.
absl::erase_if(_status, not_in_members_table); | ||
|
||
_reports = new_reports; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _reports = std::move(new_reports);
// local report | ||
auto local_report_it = std::as_const(_reports).find(_self); | ||
if (local_report_it == _reports.cend()) { | ||
auto local_report_it = reports->find(_self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me how using a shared pointer solves the concurrency problem. It looks like the we're still iterating over this structure with (presumably) shared access where iterators e.g. are held across scheduling points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(my understanding)
all the loops with scheduling points operate on the (shared_ptr) copy of the report. The only modification to _report is in collect_cluster_health() which makes a new shared_ptr (report) and eventually moves it into _report while the original loop iteration can safely operate on a copy until it is alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we only change data stored under _reports
in collect_cluster_health
where we construct a new object and replace the shared pointer to it. At this point the old object ceases to be referred by _reports
. If it is being iterated over in walk_local_and_remote_reports
it is kept alive by reports
shared pointer (or multiple copies thereof if there are multiple iterations in flight), otherwise it gets destructed immediately. When all iterations are over it eventually gets destructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we are making a copy for local concurrent-free iteration, what's the point of the shared pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never make a copy of the data explicitly, we only copy a shared pointer thus creating one more handle to the same dataset.
Also reports collection is never modified, we should change _reports
type to lw_shard_ptr<const report_cache_t>
(I will do this). The collection is only replaced as a whole. When it is replaced, this function still holds a shared pointer to the old version thus keeping it from being destroyed.
Does it shed any light?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit clearer now, thanks. If ya'll think it's safe then I'm good. It's pretty hard to analyze this stuff as a reviewer without seeing the big picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried to make it const and found a place where it is mutated! Thanks for the heads up.
Following post-merge comment from @dotnwat on #24928
Backports Required
Release Notes