Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validators' view numbers go out of sync #2100

Open
DrZoltanFazekas opened this issue Jan 6, 2025 · 6 comments
Open

Validators' view numbers go out of sync #2100

DrZoltanFazekas opened this issue Jan 6, 2025 · 6 comments
Assignees

Comments

@DrZoltanFazekas
Copy link
Contributor

DrZoltanFazekas commented Jan 6, 2025

As observed twice on the proto-testnet and documented on Slack:
https://zilliqa-team.slack.com/archives/C04D147S9QX/p1734803926514389?thread_ts=1734802780.508109&cid=C04D147S9QX
https://zilliqa-team.slack.com/archives/C04D147S9QX/p1734978319797799
without any obvious reasons such as nodes being down, the validators formed two groups that ended up being 3 views apart. Neither of the groups was large enough to reach supermajority so no blocks could be proposed for a couple of hours until the supermajority was in the same view again.

Investigate the GCP logs to figure out what caused the validators' view numbers to diverge and how it can be prevented in the future?

@DrZoltanFazekas DrZoltanFazekas added bug Agate Required for mainnet launch and removed Agate Required for mainnet launch labels Jan 7, 2025
@86667
Copy link
Contributor

86667 commented Jan 8, 2025

The issues seen here are (at least partly) caused by a bug fixed here - #2064.

proto-testnet is currently running v0.5.1 on bootstrap and validator and fe0ca8fc on all other nodes.

The code linked has been running in devnet over the weekend and we have not seen any timeouts which have failed to resolve within seconds.

I suggest we get this code running in proto-testnet. I'll upgrade once ive confirmed with James

@DrZoltanFazekas
Copy link
Contributor Author

We should deploy #2064 asap no doubt. But were the issues reported above caused by NewView messages being ignored? I mean we introduced broadcasting of NewViews as a measure to help the network recover when nodes were down like it is the case during upgrades. But in the above cases all nodes were up and in sync and were supposed to vote on block proposals without the need to broadcast any NewView message.

@86667
Copy link
Contributor

86667 commented Jan 13, 2025

I'm seeing nodes receiving a Proposal but spending time on BlockRequest, NewView and Vote messages before processing the Proposal to the point of reaching a timeout for the round and moving view on, thereby invalidating the Proposal (proposal is outdated). This is because there are too many messages to process before getting to Proposal.

Vote and NewView are relatively fast to execute but BlockRequest takes longer (~0.4ms) and there can be many of them. These take up the node's processing time when the network is messy as proto-testnet and proto-mainnet have been.

We're seeing nodes stuck in a conversation around who has which blocks, sending blocks which they have and requesting others, with Validators sometimes not having chance to process Proposal and thus timing out.

Proposed fixes

  • A node which is syncing should indeed request blocks before processing anything else. It should perhaps throw away other messages until it has good reason to believe it is near head.
  • A Validator node at head should prioritise Proposal messages over BlockRequest

Shawn's block syncing work may help as there will be a reduction in BlockRequest messages.

@shawn-zil
Copy link
Contributor

I don't believe that there is any priority order at the moment. The selects are unbiased.

@86667
Copy link
Contributor

86667 commented Jan 13, 2025

I don't believe that there is any priority order at the moment. The selects are unbiased.

Thanks you're right. Updated comment to reflect that.

@86667
Copy link
Contributor

86667 commented Jan 24, 2025

Status update: I'm holding this ticket open and will re-assess once #2089 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants