Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement resume-resharding tool #12796

Merged
merged 5 commits into from
Jan 24, 2025
Merged

Conversation

Trisfald
Copy link
Contributor

This PR adds a new tool for flat storage, to resume resharding if the node is stopped or crashes before an ongoing resharding can terminate.

Tested in this way:

  • Grab a forknet
  • Schedule resharding
  • Stop the node while resharding is in progress
  • Start the node, verify it fails
  • Run the tool .near/neard-runner/binaries/neard0 flat-storage resume-resharding --shard-id 0
  • Verify flat state for parent and children is correct on disk
  • Restart the node and verify it can rejoin the network

@@ -222,6 +222,13 @@ impl FlatStorageResharder {
split_params.clone(),
)),
);
// Do not update parent flat head, to avoid overriding the resharding status.
// In any case, at the end of resharding the parent shard will completely disappear.
self.runtime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to freeze the parent flat storage in the end, because otherwise something in chain was overriding its status.

@@ -569,6 +576,10 @@ impl FlatStorageResharder {
parent_shard,
FlatStorageStatus::Ready(FlatStorageReadyStatus { flat_head }),
);
self.runtime
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really needed, but included for completeness, same at line 898

@@ -235,6 +236,9 @@ impl FlatStorage {
let shard_id = shard_uid.shard_id();
let flat_head = match store.get_flat_storage_status(shard_uid) {
Ok(FlatStorageStatus::Ready(ready_status)) => ready_status.flat_head,
Ok(FlatStorageStatus::Resharding(FlatStorageReshardingStatus::SplittingParent(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to load the parent flat storage stuck in the "middle" of resharding

Copy link

codecov bot commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 9.44882% with 115 lines in your changes missing coverage. Please review.

Project coverage is 70.49%. Comparing base (5b32984) to head (64dabf1).
Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
tools/flat-storage/src/resume_resharding.rs 0.00% 108 Missing ⚠️
tools/flat-storage/src/commands.rs 0.00% 4 Missing ⚠️
core/store/src/flat/storage.rs 33.33% 2 Missing ⚠️
chain/chain/src/flat_storage_resharder.rs 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12796      +/-   ##
==========================================
- Coverage   70.53%   70.49%   -0.05%     
==========================================
  Files         846      847       +1     
  Lines      174904   175138     +234     
  Branches   174904   175138     +234     
==========================================
+ Hits       123372   123461      +89     
- Misses      46282    46425     +143     
- Partials     5250     5252       +2     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.40% <0.00%> (+0.05%) ⬆️
linux 70.07% <9.44%> (+0.90%) ⬆️
linux-nightly 70.10% <9.44%> (-0.04%) ⬇️
pytests 1.70% <0.00%> (+0.05%) ⬆️
sanity-checks 1.51% <0.00%> (+0.05%) ⬆️
unittests 70.32% <9.44%> (-0.05%) ⬇️
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Trisfald Trisfald marked this pull request as ready for review January 24, 2025 17:25
@Trisfald Trisfald requested a review from a team as a code owner January 24, 2025 17:25
Copy link
Contributor

@staffik staffik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

self.runtime
.get_flat_storage_manager()
.get_flat_storage_for_shard(parent_shard)
.map(|flat_storage| flat_storage.set_flat_head_update_mode(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not needed in FlatStorageReshardingTaskResult::Cancelled case too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point you raised.
The operation of 'cancelling' resharing in the context of flat storage means that the node got SIGINT'ed. We want to cancel the tasks to shutdown quickly, but continue resharding on startup. So we keep flat storage locked to ensure the its state is locked in Resharding.

We could argue that unlocking flat storage is not needed also for Failed since we panic immediately, but well, maybe one day we won't panic

@Trisfald Trisfald added this pull request to the merge queue Jan 24, 2025
Merged via the queue into near:master with commit 5581f65 Jan 24, 2025
29 checks passed
@Trisfald Trisfald deleted the resharding-tool branch January 24, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants