Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add instructions on a state snapshot recovery #92

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions docs/troubleshooting/attach_state_snapshot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
id: attach-state-snapshot
title: Attach State Snapshot
sidebar_label: Attach State Snpashot
description: Instructions on attaching a supporting state snapshot.
---

## Terminology {#terminology}
State Snapshot is different from DB snapshot.
State Snapshot is checkpoint of some columns of the full DB taken at the epoch boundary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's a checkpoint of the whole db with some unneeded columns deleted. At the end of the day we have hardlinks to all of the sst files (at least without compaction). I would keep it simple and just say it's a checkpoing of the full db, and not mention about being selective of some columns. This is in line with the expected size of the snapshot too.

It is used in state sync and resharding.

State snapshots are identified by the last block hash of the epoch.
We save state snapshot in `$NEAR_HOME_DATA/state_snapshot/$BLOCK_HASH`.
We also save `$BLOCK_HASH` in DB to know which path to open when we need to use snapshot.

## How to attach state snapshot to existing node {#how to attach}
1. Download state snapshot on your machine.
You can download it to any directory, but `$NEAR_HOME_DATA/state_snapshot/$BLOCK_HASH` has to point to your new state snapshot.
2. Create a support directory anywhere on the node. We will refer to it as `$OTHER_HOME`.
3. Copy config to the new directory
```bash
cp $NEAR_HOME/config.json $OTHER_HOME/config.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also copied genesis and node key. I'm not sure if both were required but at least one of them was.

```
4. Point data directory of `$OTHER_HOME` to state snapshot.
```bash
ln -s <state snapshot path> $OTHER_HOME/test-data
```
5. Change `$OTHER_HOME` config to work with state snapshot
```bash
cat <<< $(jq '.archive = false | .cold_store = null | .store.path = "test-data"' $OTHER_HOME/config.json) > $OTHER_HOME/config.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll trust your bash scripting :)

```
6. Change state snapshot `DBKind` to suit your node.
If you are running a split storage archival node run
```bash
$NEARD --unsafe-fast-startup --home $OTHER_HOME database change-db-kind --new-kind Hot change-hot
```
If you are running rpc node run
```bash
$NEARD --unsafe-fast-startup --home $OTHER_HOME database change-db-kind --new-kind RPC change-hot
```
7. You can delete `$OTHER_HOME` now.
8. If you are fixing a problem for 1.37 or 1.38 release you need to build a binary from [this tool branch](https://github.com/near/nearcore/tree/1.37.0-fix).
Changes from this branch will be included in 1.39 release by default.
9. Stop your node
10. Run a binary with [tool branch](https://github.com/near/nearcore/tree/1.37.0-fix) changes to save `$BLOCK_HASH` in RocksDB.
```bash
$NEARD_TOOL --unsafe-fast-startup --home $NEAR_HOME database write-crypto-hash --hash $BLOCK_HASH
```
11. Restart your node
10 changes: 10 additions & 0 deletions docs/troubleshooting/resharding.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,16 @@ If you observe problems with block production or resharding performance, you can
This does not require a node restart, you can send a signal to the neard process to load the new config.
Read more [on github](https://github.com/near/nearcore/blob/master/docs/architecture/how/resharding.md#monitoring).

### Mitigating state snapshot issue {#state snapshot}
Node has to have a state snapshot in order for resharding to run.
State snapshot is a smaller checkpoint of the whole DB taken at the epoch boundary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add something like:
"If the node fails to capture a snapshot at the epoch boundary it will not be able to proceed with resharding. In this case manual recovery will be needed."

If you see any errors around creating or opening state snapshot, you may download state snapshot and attach it to your node.
Look for `ERROR state_snapshot` log lines around the epoch switch times.
For 1.37 the epoch switch happened around `2024-03-11 19:28:30`.

Further instructions are in [Attaching State Snapshot page](/troubleshooting/attach-state-snapshot).
For 1.37 release resharding use block hash `EqT4A5h9ayaALpJZNX4SK3dG3HDPWUH9QDuhfCcWSXHi`.

### After resharding {#after 1.37}
If your node failed to reshard or is not able to sync with the network after the protocol upgrade, you will need to download the latest DB snapshot provided by Pagoda from s3
[Node Data Snapshots](/intro/node-data-snapshots).
Expand Down
Loading