-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement resume-resharding tool #12796
Conversation
@@ -222,6 +222,13 @@ impl FlatStorageResharder { | |||
split_params.clone(), | |||
)), | |||
); | |||
// Do not update parent flat head, to avoid overriding the resharding status. | |||
// In any case, at the end of resharding the parent shard will completely disappear. | |||
self.runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to freeze the parent flat storage in the end, because otherwise something in chain was overriding its status.
@@ -569,6 +576,10 @@ impl FlatStorageResharder { | |||
parent_shard, | |||
FlatStorageStatus::Ready(FlatStorageReadyStatus { flat_head }), | |||
); | |||
self.runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really needed, but included for completeness, same at line 898
@@ -235,6 +236,9 @@ impl FlatStorage { | |||
let shard_id = shard_uid.shard_id(); | |||
let flat_head = match store.get_flat_storage_status(shard_uid) { | |||
Ok(FlatStorageStatus::Ready(ready_status)) => ready_status.flat_head, | |||
Ok(FlatStorageStatus::Resharding(FlatStorageReshardingStatus::SplittingParent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to load the parent flat storage stuck in the "middle" of resharding
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12796 +/- ##
==========================================
- Coverage 70.53% 70.49% -0.05%
==========================================
Files 846 847 +1
Lines 174904 175138 +234
Branches 174904 175138 +234
==========================================
+ Hits 123372 123461 +89
- Misses 46282 46425 +143
- Partials 5250 5252 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
self.runtime | ||
.get_flat_storage_manager() | ||
.get_flat_storage_for_shard(parent_shard) | ||
.map(|flat_storage| flat_storage.set_flat_head_update_mode(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not needed in FlatStorageReshardingTaskResult::Cancelled
case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point you raised.
The operation of 'cancelling' resharing in the context of flat storage means that the node got SIGINT'ed. We want to cancel the tasks to shutdown quickly, but continue resharding on startup. So we keep flat storage locked to ensure the its state is locked in Resharding
.
We could argue that unlocking flat storage is not needed also for Failed
since we panic immediately, but well, maybe one day we won't panic
This PR adds a new tool for flat storage, to resume resharding if the node is stopped or crashes before an ongoing resharding can terminate.
Tested in this way:
.near/neard-runner/binaries/neard0 flat-storage resume-resharding --shard-id 0