Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter "failed" node definition for dBFT #2057

Open
roman-khimov opened this issue Nov 11, 2020 · 4 comments
Open

Stricter "failed" node definition for dBFT #2057

roman-khimov opened this issue Nov 11, 2020 · 4 comments
Labels
Discussion Initial issue state - proposed but not yet accepted

Comments

@roman-khimov
Copy link
Contributor

Summary or problem description
We're using the notion of "failed" nodes for some dBFT logic (even though technically it'd be better to call them "unknown", we can't really judge if the node failed or not in BFT algorithm) and we count them this way:

Validators.Count(p => !LastSeenMessage.TryGetValue(p, out var value) || value < (Block.Index - 1))

Which means that any node that we didn't receive any message from for current or previous block is considered to be failed. This value is then used to determine whether to change view or not or whether to accept some messages or not.

I think it's problematic in that availability of the node a block ago doesn't say anything about its state for current block. Moreover, even for current block presence at view N doesn't mean that the node is fine at N+1. This is especially relevant wrt liveness lock described in paper from #2029 (and observed many times both in tests and in the wild), the node not receiving any packets initially will try to change view and if some node is locked in the Commit state it'll help others gather the required number of CVs and jump to the next view.

Do you have any solution you want to propose?
I'd like to share and propose to adopt the modification we're using in neo-go to count "failed" nodes (initially added to 0.72.0 release February this year with a modification applied in 0.76.0, June this year) and we're doing it this way:

        for _, hv := range c.LastSeenMessage {
                if hv == nil || hv.Height < c.BlockIndex || hv.View < c.ViewNumber {
                        count++
                }
        }

Any node that we didn't receive anything from in current view of current height is considered to be in unknown state and added to this counter. It makes liveness lock less likely to happen (making consensus more robust), but we're not sure it solves this problem completely (most likely it's not).

Neo Version

  • Neo 2
  • Neo 3

Where in the software does this update applies to?

  • Consensus
@roman-khimov roman-khimov added the Discussion Initial issue state - proposed but not yet accepted label Nov 11, 2020
@vncoelho
Copy link
Member

vncoelho commented Nov 11, 2020

We tried this configuration before, @roman-khimov.

I do not remember exactly, but by also considering messages on the current view we would skip change view as soon as messages arrives from, at least, 2f + 1 nodes:

if ((context.CountCommitted + context.CountFailed) > context.F)

I believe that this modification can be, yes, be implemented.

As we are discussing in your another recent opened issue #2058, I remember from our past experiments and trials that CountFailed is not really needed if we keep consistency across views. That is, by implementing an improvement such as dBFT 2.0+ we are probably going to be able to remove this synchrony condition.

@vncoelho
Copy link
Member

@roman-khimov, it is more conservative to keep in the way it is now, considering the last view and the current view.

The LastSeenMessage is updated every time OnConsensusPayload is received.

I suggest that you update neo-go to be more conservative and consider inactivity from previous round, in order to avoid premature change view in edge cases.

@roman-khimov
Copy link
Contributor Author

premature change view in edge cases.

We have exactly this without this change. Nodes that don't receive some messages in this round for any reason start a CV, with this change they instead try to make a recovery first and only if/when they get messages from other nodes they start a CV. So I think we get more conservative behavior with this change and it's also well-tested by now in production.

@vncoelho
Copy link
Member

I see, @roman-khimov.
I had confused my logic, you are right.
In fact your implementation is more strict in terms of avoiding CV, because it does not consider previous messages. I believe that the first version of this mechanism in dbft 2.0 was like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Initial issue state - proposed but not yet accepted
Projects
None yet
Development

No branches or pull requests

2 participants