-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stricter "failed" node definition for dBFT #2057
Comments
We tried this configuration before, @roman-khimov. I do not remember exactly, but by also considering messages on the current view we would skip change view as soon as messages arrives from, at least, neo/src/neo/Consensus/ConsensusService.cs Line 639 in 5d5bd02
I believe that this modification can be, yes, be implemented. As we are discussing in your another recent opened issue #2058, I remember from our past experiments and trials that |
@roman-khimov, it is more conservative to keep in the way it is now, considering the last view and the current view. The I suggest that you update |
We have exactly this without this change. Nodes that don't receive some messages in this round for any reason start a CV, with this change they instead try to make a recovery first and only if/when they get messages from other nodes they start a CV. So I think we get more conservative behavior with this change and it's also well-tested by now in production. |
I see, @roman-khimov. |
Summary or problem description
We're using the notion of "failed" nodes for some dBFT logic (even though technically it'd be better to call them "unknown", we can't really judge if the node failed or not in BFT algorithm) and we count them this way:
Which means that any node that we didn't receive any message from for current or previous block is considered to be failed. This value is then used to determine whether to change view or not or whether to accept some messages or not.
I think it's problematic in that availability of the node a block ago doesn't say anything about its state for current block. Moreover, even for current block presence at view N doesn't mean that the node is fine at N+1. This is especially relevant wrt liveness lock described in paper from #2029 (and observed many times both in tests and in the wild), the node not receiving any packets initially will try to change view and if some node is locked in the Commit state it'll help others gather the required number of CVs and jump to the next view.
Do you have any solution you want to propose?
I'd like to share and propose to adopt the modification we're using in neo-go to count "failed" nodes (initially added to 0.72.0 release February this year with a modification applied in 0.76.0, June this year) and we're doing it this way:
Any node that we didn't receive anything from in current view of current height is considered to be in unknown state and added to this counter. It makes liveness lock less likely to happen (making consensus more robust), but we're not sure it solves this problem completely (most likely it's not).
Neo Version
Where in the software does this update applies to?
The text was updated successfully, but these errors were encountered: