Sync error handling #119

dknopik · 2025-01-31T11:14:55Z

I noticed during my local tests that

sync sometimes crashes instead of retrying
when sync crashes, the rest of the process lives on

IMO:

Sync should never crash, but retry indefinitely
IF it crashes anyways due to a bug, the rest of the process should quit as well (in order for an external restart to trigger)
Maybe we need some flag if sync is stalled for some reason to stop performing duties during that time, in order to avoid signing stuff for validators that have been removed from our clusters.

Zacholme7 · 2025-01-31T12:40:07Z

Few follow ups.

What do you mean here by crash? What specific error is it encountering? Or are there some warn logs emitted? Is there a specific block range it gets hung up on?
It could retry indefinitely but I think there are also arguments that it should not. The chain data does not change so pretty much the only issue would be the rpc being inaccessible/going down. Do we want to keep hitting an rpc that we know does not work or just give it a set number of retrys and exit since we cant make progress anyways?
I think a flag could be a good addition. I dont think it is functionally necessary but it is good to have. Say we have 4 operators and 1 has a bad rpc and does not see the validator removed event. The other 3 will stop the duties and the one remaining will not be able to reach consensus anyways. So really no harm, but we are wasting resources there.

dknopik · 2025-02-03T08:18:52Z

Here, for example, we never retry.

anchor/anchor/eth/src/sync.rs

Lines 196 to 199 in 5cc2376

    
           let current_block = self.rpc_client.get_block_number().await.map_err(|e| { 
        
               error!(?e, "Failed to fetch block number"); 
        
               ExecutionError::RpcError(format!("Unable to fetch block number {}", e)) 
        
           })?;

This is the very first EL call we make, and therefore very prone to fail because the anchor client might be started simultaneously with EL and CL and the EL might not yet have the endpoint up.

Do we want to keep hitting an rpc that we know does not work or just give it a set number of retrys and exit since we cant make progress anyways?

IMO, we should keep trying, ideally with capped randomized exponential backoff to avoid DoS.

If we stay running, we have at least the chance of recovering. So in my opinion, we should stay running and keep trying to connect. In most cases, the users have some kind automatic restart, but we should not rely on that and always try to resume - after all, we want to perform duties.
Your example is correct, but I am still worried about cases where multiple operators suffer similar problems - just continuing would then cause issues. So strictly speaking, we should not perform duties if we are not synced, or at least within a certain margin. Yes, this is pedantic - But when having partial responsibility for up to 500 * 32 ETH, safety is rather important.

dknopik added the execution layer label Jan 31, 2025

dknopik mentioned this issue Jan 31, 2025

Subnet service #115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync error handling #119

Sync error handling #119

dknopik commented Jan 31, 2025

Zacholme7 commented Jan 31, 2025

dknopik commented Feb 3, 2025 •

edited

Loading

Sync error handling #119

Sync error handling #119

Comments

dknopik commented Jan 31, 2025

Zacholme7 commented Jan 31, 2025

dknopik commented Feb 3, 2025 • edited Loading

dknopik commented Feb 3, 2025 •

edited

Loading