Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

sungming2 · 2025-02-05T13:16:13Z

Issue #1640

Problem

We introduced the primary failover rank to delay elections when multiple primaries fail simultaneously, preventing elections in the same epoch and failover timeouts caused by insufficient votes. However, under certain timing conditions, this delay can still result in multiple nodes initiating elections in the same epoch.

e.g., the two replicas started the election at the same time and got failover timeout then restarted

654043:S 30 Jan 2025 16:19:37.076 * Starting a failover election for epoch 9, node config epoch is 7
654043:S 30 Jan 2025 16:19:39.110 * Currently unable to failover: Waiting for votes, but majority still not reached.
654043:S 30 Jan 2025 16:19:39.110 * Needed quorum: 4. Number of votes received so far: 2
...
654043:S 30 Jan 2025 16:19:48.034 * Currently unable to failover: Failover attempt expired.
654043:S 30 Jan 2025 16:19:48.034 * Needed quorum: 4. Number of votes received so far: 2
...
654043:S 30 Jan 2025 16:19:57.876 * Starting a failover election for epoch 10, node config epoch is 7
654043:S 30 Jan 2025 16:19:58.440 * Failover election won: I'm the new primary.
654043:S 30 Jan 2025 16:19:58.440 * configEpoch set to 10 after successful failover

654224:S 30 Jan 2025 16:19:37.074 * Starting a failover election for epoch 9, node config epoch is 4
654224:S 30 Jan 2025 16:19:38.095 * Needed quorum: 4. Number of votes received so far: 3
654224:S 30 Jan 2025 16:19:39.104 * Currently unable to failover: Waiting for votes, but majority still not reached.
...
654224:S 30 Jan 2025 16:19:47.096 * Currently unable to failover: Failover attempt expired.
654224:S 30 Jan 2025 16:19:47.096 * Needed quorum: 4. Number of votes received so far: 3
...
654224:S 30 Jan 2025 16:19:58.472 * Starting a failover election for epoch 11, node config epoch is 4
654224:S 30 Jan 2025 16:19:58.849 * Failover election won: I'm the new primary.
654224:S 30 Jan 2025 16:19:58.849 * configEpoch set to 11 after successful failover

Solution

In some cases, replica node may not receive all FAILED propagation messages before the election starts, potentially resulting in an incorrect failover rank or defaulting to rank 0.
To prevent simultaneous elections, we need to introduce a sufficient delay before starting the election. I observed that replicas can begin elections for the same epoch within a 500ms difference, and the random delay should be less than the fixed delay to avoid the same epoch election issue in the worst case.
- worst example of 500 + random() % 500:
  R1 (starts failover handling at 10:00:00.000): 500ms + (499 % 500) = 999ms delay → Election starts at 10:00:00.999
  R2 (starts failover handling at 10:00:00.500): 500ms + (0 % 500) = 500ms delay → Election starts at 10:00:01.000
Once the failover rank is upgraded, it should not be downgraded again because it will likely trigger the same epoch election.
e.g.,

Test

Ran test over a thousand times

codecov · 2025-02-05T13:32:01Z

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.00%. Comparing base (fc55142) to head (0f5f0be).

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	85.71%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1669      +/-   ##
============================================
- Coverage     71.02%   71.00%   -0.02%     
============================================
  Files           121      121              
  Lines         65254    65255       +1     
============================================
- Hits          46344    46334      -10     
- Misses        18910    18921      +11

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`85.93% <85.71%> (-0.15%)`	⬇️

... and 14 files with indirect coverage changes

sungming2 · 2025-02-05T23:17:38Z

@hpatro @enjoy-binbin could you take a look at this pr please when you have time

enjoy-binbin · 2025-02-06T02:13:38Z

Once the failover rank is upgraded, it should not be downgraded again because it will likely trigger the same epoch election.

can you try to keep this only change and test it again? I did have some doubt when i added this line.

src/cluster_legacy.c

hpatro

Could you update the high level comment around the behavior of this PR.

src/cluster_legacy.c

sungming2 · 2025-02-06T19:58:47Z

can you try to keep this only change and test it again? I did have some doubt when i added this line.

I tried it hundreds of times with only this change >, I don't see any specific error due to this change but I can see some same epoch elections because it has insufficient delay

R1

162615:S 06 Feb 2025 11:41:23.803 * Start of election delayed for 672 milliseconds (rank #0, primary rank #0, offset 14).
162615:S 06 Feb 2025 11:41:24.210 * Failed primary rank updated to #1, added 500 milliseconds of delay.
162615:S 06 Feb 2025 11:41:25.014 * Starting a failover election for epoch 8, node config epoch is 0, rank is 1

R2

162730:S 06 Feb 2025 11:41:24.209 * Start of election delayed for 735 milliseconds (rank #0, primary rank #0, offset 14).
162730:S 06 Feb 2025 11:41:25.012 * Starting a failover election for epoch 8, node config epoch is 3, rank is 0

enjoy-binbin · 2025-02-07T03:15:27Z

I want to know whether the main reason here is delay or because the pfail node is not included in the rank. In the logs i am guessing the main reason is the delay, can this be fixed if we only improve the delay logic?

From the log, we can see that it is because of the added 500 milliseconds of delay. that they initiate the election at the same time. This is a familiar shortcoming and widely accepted. The reason i think is that, the tcl pause_process has some delay so these primaries doesn't down in the same time but within 500ms range. Maybe it is a acceptable shortcoming we can try to improve the TCL code? Or we do can improve the delay logic, i haven't figured this out yet (it does happen, no matter how you adjust it, the delay will always randomly conflict)

Once the failover rank is upgraded, it should not be downgraded again because it will likely trigger the same epoch election.

Is this 3 also caused by the delay design? My original idea was that when a primary is up again, the primaries that were delayed by rank can initiate elections as soon as possible.

sungming2 · 2025-02-07T05:39:16Z

I agree with you, but what I am really expecting with this change is it will reduce the chance of elections starting at the same time. (there are cases that pfail node is not included that is why I added this change)

And yes, we can refine this delay logic or implement an entirely new approach later, such as adding a flag to indicate whether a node is in the election process and by comparing the sender's current epoch with its auth epoch, we detect/assume an election conflict and immediately reset it if necessary (could be wrong, just my instant thought).

My original idea was that when a primary is up again, the primaries that were delayed by rank can initiate elections as soon as possible.

You are right, but I thought it would be better to add a slight delay rather than risk an election conflict since both replicas must wait until the auth timeout (node timeout * 4) in the worst case.

I can just apply increasing delay with small random only to this pr (1000 + random() %100) if you are ok with it as this is a clear improvement we can do right away. What do you think?

enjoy-binbin · 2025-02-07T06:05:00Z

I can just apply increasing delay with small random only to this pr (1000 + random() %100) if you are ok with it as this is a clear improvement we can do right away. What do you think?

yes, please do it, we also need other eyes to take a look with this delay logic after the changes.

Signed-off-by: Seungmin Lee <[email protected]>

sungming2 marked this pull request as draft February 5, 2025 13:16

sungming2 force-pushed the fix-flaky-failover2 branch from a98062a to c7a032a Compare February 5, 2025 23:07

sungming2 marked this pull request as ready for review February 5, 2025 23:17

zuiderkwast requested a review from enjoy-binbin February 6, 2025 01:04

enjoy-binbin reviewed Feb 6, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin reviewed Feb 6, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

sungming2 force-pushed the fix-flaky-failover2 branch 2 times, most recently from e669dbf to 6357c9b Compare February 6, 2025 05:40

hpatro reviewed Feb 6, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

sungming2 force-pushed the fix-flaky-failover2 branch from 6357c9b to 431bff3 Compare February 6, 2025 20:27

sungming2 requested review from enjoy-binbin and hpatro February 6, 2025 20:32

Improve delay logic for the same epoch election

0f5f0be

Signed-off-by: Seungmin Lee <[email protected]>

sungming2 force-pushed the fix-flaky-failover2 branch from 431bff3 to 0f5f0be Compare February 7, 2025 20:03

hpatro requested a review from PingXie February 7, 2025 22:05

hpatro assigned sungming2 Feb 7, 2025

sungming2 changed the title ~~Fix flaky test in failover2~~ Increase fixed delay with small randomness in failover delay logic to avoid same election Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

sungming2 commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

sungming2 commented Feb 5, 2025

enjoy-binbin commented Feb 6, 2025

hpatro left a comment

sungming2 commented Feb 6, 2025 •

edited

Loading

enjoy-binbin commented Feb 7, 2025

sungming2 commented Feb 7, 2025 •

edited

Loading

enjoy-binbin commented Feb 7, 2025

Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

Are you sure you want to change the base?

Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

Conversation

sungming2 commented Feb 5, 2025 • edited Loading

Issue #1640

Problem

Solution

Test

codecov bot commented Feb 5, 2025 • edited Loading

Codecov Report

sungming2 commented Feb 5, 2025

enjoy-binbin commented Feb 6, 2025

hpatro left a comment

Choose a reason for hiding this comment

sungming2 commented Feb 6, 2025 • edited Loading

enjoy-binbin commented Feb 7, 2025

sungming2 commented Feb 7, 2025 • edited Loading

enjoy-binbin commented Feb 7, 2025

sungming2 commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

sungming2 commented Feb 6, 2025 •

edited

Loading

sungming2 commented Feb 7, 2025 •

edited

Loading