-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase fixed delay with small randomness in failover delay logic to avoid same election #1669
base: unstable
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1669 +/- ##
============================================
- Coverage 71.02% 71.00% -0.02%
============================================
Files 121 121
Lines 65254 65255 +1
============================================
- Hits 46344 46334 -10
- Misses 18910 18921 +11
|
a98062a
to
c7a032a
Compare
@hpatro @enjoy-binbin could you take a look at this pr please when you have time |
can you try to keep this only change and test it again? I did have some doubt when i added this line. |
e669dbf
to
6357c9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the high level comment around the behavior of this PR.
I tried it hundreds of times with only this change R1
R2
|
6357c9b
to
431bff3
Compare
I want to know whether the main reason here is delay or because the pfail node is not included in the rank. In the logs i am guessing the main reason is the delay, can this be fixed if we only improve the delay logic? From the log, we can see that it is because of the
Is this 3 also caused by the delay design? My original idea was that when a primary is up again, the primaries that were delayed by rank can initiate elections as soon as possible. |
I agree with you, but what I am really expecting with this change is it will reduce the chance of elections starting at the same time. (there are cases that pfail node is not included that is why I added this change) And yes, we can refine this delay logic or implement an entirely new approach later, such as adding a flag to indicate whether a node is in the election process and by comparing the sender's current epoch with its auth epoch, we detect/assume an election conflict and immediately reset it if necessary (could be wrong, just my instant thought).
You are right, but I thought it would be better to add a slight delay rather than risk an election conflict since both replicas must wait until the auth timeout (node timeout * 4) in the worst case. I can just apply increasing delay with small random only to this pr (1000 + random() %100) if you are ok with it as this is a clear improvement we can do right away. What do you think? |
yes, please do it, we also need other eyes to take a look with this delay logic after the changes. |
Signed-off-by: Seungmin Lee <[email protected]>
431bff3
to
0f5f0be
Compare
Issue #1640
Problem
We introduced the primary failover rank to delay elections when multiple primaries fail simultaneously, preventing elections in the same epoch and failover timeouts caused by insufficient votes. However, under certain timing conditions, this delay can still result in multiple nodes initiating elections in the same epoch.
e.g., the two replicas started the election at the same time and got failover timeout then restarted
Solution
In some cases, replica node may not receive all FAILED propagation messages before the election starts, potentially resulting in an incorrect failover rank or defaulting to rank 0.
To prevent simultaneous elections, we need to introduce a sufficient delay before starting the election. I observed that replicas can begin elections for the same epoch within a 500ms difference, and the random delay should be less than the fixed delay to avoid the same epoch election issue in the worst case.
500 + random() % 500
:R1 (starts failover handling at 10:00:00.000): 500ms + (499 % 500) = 999ms delay → Election starts at
10:00:00.999
R2 (starts failover handling at 10:00:00.500): 500ms + (0 % 500) = 500ms delay → Election starts at
10:00:01.000
Once the failover rank is upgraded, it should not be downgraded again because it will likely trigger the same epoch election.

e.g.,
Test
Ran test over a thousand times