Stricter "failed" node definition for dBFT #2057

roman-khimov · 2020-11-11T17:58:39Z

Summary or problem description
We're using the notion of "failed" nodes for some dBFT logic (even though technically it'd be better to call them "unknown", we can't really judge if the node failed or not in BFT algorithm) and we count them this way:

Validators.Count(p => !LastSeenMessage.TryGetValue(p, out var value) || value < (Block.Index - 1))

Which means that any node that we didn't receive any message from for current or previous block is considered to be failed. This value is then used to determine whether to change view or not or whether to accept some messages or not.

I think it's problematic in that availability of the node a block ago doesn't say anything about its state for current block. Moreover, even for current block presence at view N doesn't mean that the node is fine at N+1. This is especially relevant wrt liveness lock described in paper from #2029 (and observed many times both in tests and in the wild), the node not receiving any packets initially will try to change view and if some node is locked in the Commit state it'll help others gather the required number of CVs and jump to the next view.

Do you have any solution you want to propose?
I'd like to share and propose to adopt the modification we're using in neo-go to count "failed" nodes (initially added to 0.72.0 release February this year with a modification applied in 0.76.0, June this year) and we're doing it this way:

        for _, hv := range c.LastSeenMessage {
                if hv == nil || hv.Height < c.BlockIndex || hv.View < c.ViewNumber {
                        count++
                }
        }

Any node that we didn't receive anything from in current view of current height is considered to be in unknown state and added to this counter. It makes liveness lock less likely to happen (making consensus more robust), but we're not sure it solves this problem completely (most likely it's not).

Neo Version

Neo 2
Neo 3

Where in the software does this update applies to?

Consensus

vncoelho · 2020-11-11T20:12:38Z

We tried this configuration before, @roman-khimov.

I do not remember exactly, but by also considering messages on the current view we would skip change view as soon as messages arrives from, at least, 2f + 1 nodes:

neo/src/neo/Consensus/ConsensusService.cs

Line 639 in 5d5bd02

if ((context.CountCommitted + context.CountFailed) > context.F)

I believe that this modification can be, yes, be implemented.

As we are discussing in your another recent opened issue #2058, I remember from our past experiments and trials that CountFailed is not really needed if we keep consistency across views. That is, by implementing an improvement such as dBFT 2.0+ we are probably going to be able to remove this synchrony condition.

vncoelho · 2022-09-16T19:11:24Z

@roman-khimov, it is more conservative to keep in the way it is now, considering the last view and the current view.

The LastSeenMessage is updated every time OnConsensusPayload is received.

I suggest that you update neo-go to be more conservative and consider inactivity from previous round, in order to avoid premature change view in edge cases.

roman-khimov · 2022-09-16T20:01:12Z

premature change view in edge cases.

We have exactly this without this change. Nodes that don't receive some messages in this round for any reason start a CV, with this change they instead try to make a recovery first and only if/when they get messages from other nodes they start a CV. So I think we get more conservative behavior with this change and it's also well-tested by now in production.

vncoelho · 2022-09-16T21:06:20Z

I see, @roman-khimov.
I had confused my logic, you are right.
In fact your implementation is more strict in terms of avoiding CV, because it does not consider previous messages. I believe that the first version of this mechanism in dbft 2.0 was like that.

roman-khimov added the Discussion Initial issue state - proposed but not yet accepted label Nov 11, 2020

roman-khimov mentioned this issue Nov 11, 2020

dBFT 3.0 Specification #2029

Open

roman-khimov mentioned this issue Sep 16, 2022

Just extend by factor if commit is from current view neo-project/neo-modules#762

Merged

roman-khimov mentioned this issue Oct 17, 2022

Unusual blocktime with same validator in subsequent blocks after OracleRequest neo-project/neo-node#882

Closed

roman-khimov mentioned this issue Feb 17, 2023

models: add original dBFT 2.0 specifications nspcc-dev/dbft#65

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stricter "failed" node definition for dBFT #2057

Stricter "failed" node definition for dBFT #2057

roman-khimov commented Nov 11, 2020

vncoelho commented Nov 11, 2020 •

edited

Loading

vncoelho commented Sep 16, 2022

roman-khimov commented Sep 16, 2022

vncoelho commented Sep 16, 2022

Stricter "failed" node definition for dBFT #2057

Stricter "failed" node definition for dBFT #2057

Comments

roman-khimov commented Nov 11, 2020

vncoelho commented Nov 11, 2020 • edited Loading

vncoelho commented Sep 16, 2022

roman-khimov commented Sep 16, 2022

vncoelho commented Sep 16, 2022

vncoelho commented Nov 11, 2020 •

edited

Loading