Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync error handling #119

Open
dknopik opened this issue Jan 31, 2025 · 2 comments
Open

Sync error handling #119

dknopik opened this issue Jan 31, 2025 · 2 comments

Comments

@dknopik
Copy link
Member

dknopik commented Jan 31, 2025

I noticed during my local tests that

  • sync sometimes crashes instead of retrying
  • when sync crashes, the rest of the process lives on

IMO:

  • Sync should never crash, but retry indefinitely
  • IF it crashes anyways due to a bug, the rest of the process should quit as well (in order for an external restart to trigger)
  • Maybe we need some flag if sync is stalled for some reason to stop performing duties during that time, in order to avoid signing stuff for validators that have been removed from our clusters.
@Zacholme7
Copy link
Member

Few follow ups.

  1. What do you mean here by crash? What specific error is it encountering? Or are there some warn logs emitted? Is there a specific block range it gets hung up on?
  2. It could retry indefinitely but I think there are also arguments that it should not. The chain data does not change so pretty much the only issue would be the rpc being inaccessible/going down. Do we want to keep hitting an rpc that we know does not work or just give it a set number of retrys and exit since we cant make progress anyways?
  3. I think a flag could be a good addition. I dont think it is functionally necessary but it is good to have. Say we have 4 operators and 1 has a bad rpc and does not see the validator removed event. The other 3 will stop the duties and the one remaining will not be able to reach consensus anyways. So really no harm, but we are wasting resources there.

@dknopik
Copy link
Member Author

dknopik commented Feb 3, 2025

  1. Here, for example, we never retry.

    let current_block = self.rpc_client.get_block_number().await.map_err(|e| {
    error!(?e, "Failed to fetch block number");
    ExecutionError::RpcError(format!("Unable to fetch block number {}", e))
    })?;

    This is the very first EL call we make, and therefore very prone to fail because the anchor client might be started simultaneously with EL and CL and the EL might not yet have the endpoint up.

  2. Do we want to keep hitting an rpc that we know does not work or just give it a set number of retrys and exit since we cant make progress anyways?

    IMO, we should keep trying, ideally with capped randomized exponential backoff to avoid DoS.

    If we stay running, we have at least the chance of recovering. So in my opinion, we should stay running and keep trying to connect. In most cases, the users have some kind automatic restart, but we should not rely on that and always try to resume - after all, we want to perform duties.

  3. Your example is correct, but I am still worried about cases where multiple operators suffer similar problems - just continuing would then cause issues. So strictly speaking, we should not perform duties if we are not synced, or at least within a certain margin. Yes, this is pedantic - But when having partial responsibility for up to 500 * 32 ETH, safety is rather important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants