test(resharding): enable resharding state sync and modify the reshard…

…ing state sanity checks accordingly (near#12546) Here we enable state sync after a resharding by removing the debug assert that prevented it, and we unignore the corresponding resharding TestLoop test. To make this work, we have to modify `assert_state_sanity_for_children_shard()`, because it currently assumes that the client tracks all shards. If we make the simple change of just checking whether the shard is tracked before proceeding, we'll run into a problem because memtries are unloaded at the end of an epoch, and this check was performed at the end of the test when the head of the chain is in a different epoch than the final head, so clients tracking a single shard won't have those memtries loaded anymore. So instead of making the check at the end, we can make it on every new block. The check is pretty quick anyway (<2 ms) and there's no harm in getting a bit more coverage. While we're at it, we also make the check for every shard rather than just the child shards, since it's a bit simpler and there's no downside to a bit more coverage. Performing the check on every block introduces a bit of extra work that needs to be done to make sure we're actually checking what we think we are, though. When checking flat storage and memtrie equality on arbitrary blocks, it might be the case that flat storage isn't ready because it's for a child shard that's still being created. This is not a big deal, because we just skip checking flat storage and memtrie equality in that case (but still check memtrie and trie equality). We could just leave it at that, but there's a risk that the test will silently not check what we think it was supposed to if there's a bug that prevents the flat storage comparison from happening for an entire epoch. So, we introduce a `struct TrieSanityCheck` to keep track of the shards for which we expect there to be flat storage and memtries (for each account for each epoch). Then at the end of the test we iterate over epoch IDs and make sure all expected checks were performed.
saketh-are · Dec 3, 2024 · 4b309ed · 4b309ed
1 parent 79db60a
commit 4b309ed
Show file tree

Hide file tree

Showing 4 changed files with 215 additions and 72 deletions.
diff --git a/chain/chain-primitives/src/error.rs b/chain/chain-primitives/src/error.rs
@@ -193,6 +193,8 @@ pub enum Error {
     InvalidShardId(ShardId),
     #[error("Shard index {0} does not exist")]
     InvalidShardIndex(ShardIndex),
+    #[error("Shard id {0} does not have a parent")]
+    NoParentShardId(ShardId),
     /// Invalid shard id
     #[error("Invalid state request: {0}")]
     InvalidStateRequest(String),
@@ -326,6 +328,7 @@ impl Error {
             | Error::InvalidBandwidthRequests(_)
             | Error::InvalidShardId(_)
             | Error::InvalidShardIndex(_)
+            | Error::NoParentShardId(_)
             | Error::InvalidStateRequest(_)
             | Error::InvalidRandomnessBeaconOutput
             | Error::InvalidBlockMerkleRoot
@@ -407,6 +410,7 @@ impl Error {
             Error::InvalidBandwidthRequests(_) => "invalid_bandwidth_requests",
             Error::InvalidShardId(_) => "invalid_shard_id",
             Error::InvalidShardIndex(_) => "invalid_shard_index",
+            Error::NoParentShardId(_) => "no_parent_shard_id",
             Error::InvalidStateRequest(_) => "invalid_state_request",
             Error::InvalidRandomnessBeaconOutput => "invalid_randomness_beacon_output",
             Error::InvalidBlockMerkleRoot => "invalid_block_merkele_root",
@@ -450,6 +454,7 @@ impl From<ShardLayoutError> for Error {
             ShardLayoutError::InvalidShardIndexError { shard_index } => {
                 Error::InvalidShardIndex(shard_index)
             }
+            ShardLayoutError::NoParentError { shard_id } => Error::NoParentShardId(shard_id),
         }
     }
 }

diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
@@ -798,27 +798,6 @@ impl Chain {
             me,
             &prev_hash,
         )?;
-        let prev_block = self.get_block(&prev_hash)?;
-
-        if prev_block.chunks().len() != epoch_first_block.chunks().len()
-            && !shards_to_state_sync.is_empty()
-        {
-            // Currently, the state sync algorithm assumes that the number of chunks do not change
-            // between the epoch being synced to and the last epoch.
-            // For example, if shard layout changes at the beginning of epoch T, validators
-            // will not be able to sync states at epoch T for epoch T+1
-            // Fortunately, since all validators track all shards for now, this error will not be
-            // triggered in live yet
-            // Instead of propagating the error, we simply log the error here because the error
-            // do not affect processing blocks for this epoch. However, when the next epoch comes,
-            // the validator will not have the states ready so it will halt.
-            error!(
-                "Cannot download states for epoch {:?} because sharding just changed. I'm {:?}",
-                epoch_first_block.header().epoch_id(),
-                me
-            );
-            debug_assert!(false);
-        }
         if shards_to_state_sync.is_empty() {
             Ok(None)
         } else {

diff --git a/core/primitives/src/shard_layout.rs b/core/primitives/src/shard_layout.rs
@@ -327,6 +327,7 @@ impl ShardLayoutV2 {
 pub enum ShardLayoutError {
     InvalidShardIdError { shard_id: ShardId },
     InvalidShardIndexError { shard_index: ShardIndex },
+    NoParentError { shard_id: ShardId },
 }
 
 impl fmt::Display for ShardLayoutError {
@@ -624,15 +625,15 @@ impl ShardLayout {
             return Err(ShardLayoutError::InvalidShardIdError { shard_id });
         }
         let parent_shard_id = match self {
-            Self::V0(_) => panic!("shard layout has no parent shard"),
+            Self::V0(_) => return Err(ShardLayoutError::NoParentError { shard_id }),
             Self::V1(v1) => match &v1.to_parent_shard_map {
                 // we can safely unwrap here because the construction of to_parent_shard_map guarantees
                 // that every shard has a parent shard
                 Some(to_parent_shard_map) => {
                     let shard_index = self.get_shard_index(shard_id)?;
                     *to_parent_shard_map.get(shard_index).unwrap()
                 }
-                None => panic!("shard_layout has no parent shard"),
+                None => return Err(ShardLayoutError::NoParentError { shard_id }),
             },
             Self::V2(v2) => match &v2.shards_parent_map {
                 Some(to_parent_shard_map) => {
@@ -641,7 +642,7 @@ impl ShardLayout {
                         .ok_or(ShardLayoutError::InvalidShardIdError { shard_id })?;
                     *parent_shard_id
                 }
-                None => panic!("shard_layout has no parent shard"),
+                None => return Err(ShardLayoutError::NoParentError { shard_id }),
             },
         };
         Ok(parent_shard_id)