Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

sungming2
Copy link
Contributor

@sungming2 sungming2 commented Feb 5, 2025

Issue #1640

Problem

We introduced the primary failover rank to delay elections when multiple primaries fail simultaneously, preventing elections in the same epoch and failover timeouts caused by insufficient votes. However, under certain timing conditions, this delay can still result in multiple nodes initiating elections in the same epoch.

e.g., the two replicas started the election at the same time and got failover timeout then restarted

654043:S 30 Jan 2025 16:19:37.076 * Starting a failover election for epoch 9, node config epoch is 7
654043:S 30 Jan 2025 16:19:39.110 * Currently unable to failover: Waiting for votes, but majority still not reached.
654043:S 30 Jan 2025 16:19:39.110 * Needed quorum: 4. Number of votes received so far: 2
...
654043:S 30 Jan 2025 16:19:48.034 * Currently unable to failover: Failover attempt expired.
654043:S 30 Jan 2025 16:19:48.034 * Needed quorum: 4. Number of votes received so far: 2
...
654043:S 30 Jan 2025 16:19:57.876 * Starting a failover election for epoch 10, node config epoch is 7
654043:S 30 Jan 2025 16:19:58.440 * Failover election won: I'm the new primary.
654043:S 30 Jan 2025 16:19:58.440 * configEpoch set to 10 after successful failover
654224:S 30 Jan 2025 16:19:37.074 * Starting a failover election for epoch 9, node config epoch is 4
654224:S 30 Jan 2025 16:19:38.095 * Needed quorum: 4. Number of votes received so far: 3
654224:S 30 Jan 2025 16:19:39.104 * Currently unable to failover: Waiting for votes, but majority still not reached.
...
654224:S 30 Jan 2025 16:19:47.096 * Currently unable to failover: Failover attempt expired.
654224:S 30 Jan 2025 16:19:47.096 * Needed quorum: 4. Number of votes received so far: 3
...
654224:S 30 Jan 2025 16:19:58.472 * Starting a failover election for epoch 11, node config epoch is 4
654224:S 30 Jan 2025 16:19:58.849 * Failover election won: I'm the new primary.
654224:S 30 Jan 2025 16:19:58.849 * configEpoch set to 11 after successful failover

Solution

  1. In some cases, replica node may not receive all FAILED propagation messages before the election starts, potentially resulting in an incorrect failover rank or defaulting to rank 0.

  2. To prevent simultaneous elections, we need to introduce a sufficient delay before starting the election. I observed that replicas can begin elections for the same epoch within a 500ms difference, and the random delay should be less than the fixed delay to avoid the same epoch election issue in the worst case.

    • worst example of 500 + random() % 500:
      R1 (starts failover handling at 10:00:00.000): 500ms + (499 % 500) = 999ms delay → Election starts at 10:00:00.999
      R2 (starts failover handling at 10:00:00.500): 500ms + (0 % 500) = 500ms delay → Election starts at 10:00:01.000
  3. Once the failover rank is upgraded, it should not be downgraded again because it will likely trigger the same epoch election.
    e.g.,
    Untitled Diagram drawio (1)

Test

Ran test over a thousand times

@sungming2 sungming2 marked this pull request as draft February 5, 2025 13:16
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.00%. Comparing base (fc55142) to head (0f5f0be).

Files with missing lines Patch % Lines
src/cluster_legacy.c 85.71% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1669      +/-   ##
============================================
- Coverage     71.02%   71.00%   -0.02%     
============================================
  Files           121      121              
  Lines         65254    65255       +1     
============================================
- Hits          46344    46334      -10     
- Misses        18910    18921      +11     
Files with missing lines Coverage Δ
src/cluster_legacy.c 85.93% <85.71%> (-0.15%) ⬇️

... and 14 files with indirect coverage changes

@sungming2 sungming2 force-pushed the fix-flaky-failover2 branch from a98062a to c7a032a Compare February 5, 2025 23:07
@sungming2
Copy link
Contributor Author

@hpatro @enjoy-binbin could you take a look at this pr please when you have time

@sungming2 sungming2 marked this pull request as ready for review February 5, 2025 23:17
@enjoy-binbin
Copy link
Member

Once the failover rank is upgraded, it should not be downgraded again because it will likely trigger the same epoch election.

can you try to keep this only change and test it again? I did have some doubt when i added this line.

@sungming2 sungming2 force-pushed the fix-flaky-failover2 branch 2 times, most recently from e669dbf to 6357c9b Compare February 6, 2025 05:40
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the high level comment around the behavior of this PR.

@sungming2
Copy link
Contributor Author

sungming2 commented Feb 6, 2025

can you try to keep this only change and test it again? I did have some doubt when i added this line.

I tried it hundreds of times with only this change >, I don't see any specific error due to this change but I can see some same epoch elections because it has insufficient delay

R1

162615:S 06 Feb 2025 11:41:23.803 * Start of election delayed for 672 milliseconds (rank #0, primary rank #0, offset 14).
162615:S 06 Feb 2025 11:41:24.210 * Failed primary rank updated to #1, added 500 milliseconds of delay.
162615:S 06 Feb 2025 11:41:25.014 * Starting a failover election for epoch 8, node config epoch is 0, rank is 1

R2

162730:S 06 Feb 2025 11:41:24.209 * Start of election delayed for 735 milliseconds (rank #0, primary rank #0, offset 14).
162730:S 06 Feb 2025 11:41:25.012 * Starting a failover election for epoch 8, node config epoch is 3, rank is 0

@enjoy-binbin
Copy link
Member

I want to know whether the main reason here is delay or because the pfail node is not included in the rank. In the logs i am guessing the main reason is the delay, can this be fixed if we only improve the delay logic?

From the log, we can see that it is because of the added 500 milliseconds of delay. that they initiate the election at the same time. This is a familiar shortcoming and widely accepted. The reason i think is that, the tcl pause_process has some delay so these primaries doesn't down in the same time but within 500ms range. Maybe it is a acceptable shortcoming we can try to improve the TCL code? Or we do can improve the delay logic, i haven't figured this out yet (it does happen, no matter how you adjust it, the delay will always randomly conflict)

Once the failover rank is upgraded, it should not be downgraded again because it will likely trigger the same epoch election.

Is this 3 also caused by the delay design? My original idea was that when a primary is up again, the primaries that were delayed by rank can initiate elections as soon as possible.

@sungming2
Copy link
Contributor Author

sungming2 commented Feb 7, 2025

I agree with you, but what I am really expecting with this change is it will reduce the chance of elections starting at the same time. (there are cases that pfail node is not included that is why I added this change)

And yes, we can refine this delay logic or implement an entirely new approach later, such as adding a flag to indicate whether a node is in the election process and by comparing the sender's current epoch with its auth epoch, we detect/assume an election conflict and immediately reset it if necessary (could be wrong, just my instant thought).

My original idea was that when a primary is up again, the primaries that were delayed by rank can initiate elections as soon as possible.

You are right, but I thought it would be better to add a slight delay rather than risk an election conflict since both replicas must wait until the auth timeout (node timeout * 4) in the worst case.

I can just apply increasing delay with small random only to this pr (1000 + random() %100) if you are ok with it as this is a clear improvement we can do right away. What do you think?

@enjoy-binbin
Copy link
Member

I can just apply increasing delay with small random only to this pr (1000 + random() %100) if you are ok with it as this is a clear improvement we can do right away. What do you think?

yes, please do it, we also need other eyes to take a look with this delay logic after the changes.

@sungming2 sungming2 force-pushed the fix-flaky-failover2 branch from 431bff3 to 0f5f0be Compare February 7, 2025 20:03
@hpatro hpatro requested a review from PingXie February 7, 2025 22:05
@sungming2 sungming2 changed the title Fix flaky test in failover2 Increase fixed delay with small randomness in failover delay logic to avoid same election Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants