Implement: CLUSTER REPLICATE NO ONE #1674

skolosov-snap · 2025-02-05T22:43:54Z

Currently, ValKey doesn't allow to detach replica attached to primary node. So, if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients. All clients will keep sending traffic to this node (with getting inaccurate responses) until they refresh their topology.

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

codecov · 2025-02-06T01:17:26Z

Codecov Report

Attention: Patch coverage is 40.00000% with 15 lines in your changes missing coverage. Please review.

Project coverage is 71.10%. Comparing base (591ae9a) to head (445e078).
Report is 35 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	40.00%	15 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1674      +/-   ##
============================================
+ Coverage     71.04%   71.10%   +0.06%     
============================================
  Files           121      123       +2     
  Lines         65254    65555     +301     
============================================
+ Hits          46357    46612     +255     
- Misses        18897    18943      +46

Files with missing lines	Coverage Δ
src/commands.def	`100.00% <ø> (ø)`
src/cluster_legacy.c	`85.46% <40.00%> (-0.68%)`	⬇️

... and 28 files with indirect coverage changes

src/cluster_legacy.c

hpatro

cluster-replicate.json file should be updated and as part of the build commands.def will get updated. or if it was accidentally not staged, please add that.

Also, could you run the clang-format on your end to fix some of the formatting issue.

hpatro · 2025-02-07T00:02:57Z

src/cluster_legacy.c

+            if (server.primary != NULL) {
+                replicationUnsetPrimary();
+            }


All other invocation of replicationUnsetPrimary() don't have this wrapped under the condition. Is it unnecessary to invoke it if server.primary is NULL ?

Also unsure why clusterPromoteSelfToPrimary was introduced, seems like it's the same behavior at this point but good to call this abstraction if there will be additional steps introduced in the future for cluster mode.

Also unsure why clusterPromoteSelfToPrimary was introduced, seems like it's the same behavior at this point but good to call this abstraction if there will be additional steps introduced in the future for cluster mode.

Ok.

All other invocation of replicationUnsetPrimary() don't have this wrapped under the condition. Is it unnecessary to invoke it if server.primary is NULL ?

Probably not needed. It looks like left over from KeyDB's implementation where primary needed to be passed to replicationUnsetMaster

skolosov-snap · 2025-02-07T23:47:20Z

cluster-replicate.json file should be updated and as part of the build commands.def will get updated. or if it was accidentally not staged, please add that.

Also, could you run the clang-format on your end to fix some of the formatting issue.

Updated.

zuiderkwast

This feature makes sense to me.

@valkey-io/core-team New arguments = major decision. Please approve or vote if you agree.

src/cluster_legacy.c

src/commands/cluster-replicate.json

zuiderkwast · 2025-02-10T17:12:17Z

The CI job "DCO" is failing. You need to use git commit -s. See the Details link next to the DCO job.

Why we need it? See here: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#developer-certificate-of-origin thanks!

src/commands/cluster-replicate.json

skolosov-snap · 2025-02-10T17:37:35Z

The CI job "DCO" is failing. You need to use git commit -s. See the Details link next to the DCO job.

Why we need it? See here: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#developer-certificate-of-origin thanks!

Done

src/cluster_legacy.c

enjoy-binbin · 2025-02-11T04:30:10Z

src/cluster_legacy.c

+            /* Reset manual failover state. */
+            resetManualFailover();


it become a primary, do we need this reset? it has no way to enter the failover check logic. or at least we remove the comment since this line is super easy.

Suggested change

/* Reset manual failover state. */

resetManualFailover();

resetManualFailover();

I'm not sure what you mean by "no way to enter the failover check logic". This is to abort any in-progress failover (which can happen, right?).
Comment is removed.

i mean, when the node become a primary, it has no way enter this mf handle logic.

if (nodeIsReplica(myself)) { clusterHandleManualFailover();

Sorry. I'm not sure I got you right. If you mean that there is no way to have manual failover to be in progress by the time we come to

valkey/src/cluster_legacy.c

Line 7050 in bed392b

resetManualFailover();

, then I would disagree with that. Manual failover could start before we got CLUSTER REPLICATE NO ONE. Also, CLUSTER RESET (which also may changes replica to primary) does the same: it resets any in-progress failover:

valkey/src/cluster_legacy.c

Line 1298 in bed392b

resetManualFailover();

i know it could start before the NO ONE, i mean, if mf is in progress, will it be a problem if it is not reset? What will happen? I know CLUSTER RESET do the same thing, it is a old time stuff, i am ok with resetting it, i was just thinking about this, we definitely have other places where we haven't reset it when the role changed.

we definitely have other places where we haven't reset it when the role changed.

Interesting. Is this a problem?

What is the reason of not resetting it?

i am not against it, this is just a question, after it becomes a primary, will there be any problems with the uncleaned mf failover? As far as i know there is no problem (only a log issue like the replica (empty primary) will print the mf timeout something like that). The original primary has a known issue with pause timeout.

i am ok with resetting it. I would probably prefer to have one that refuses to execute NO ONE if the replica is in failover or is in mf failover. The admin should do it in a stable state, which means the replica should not be in the failover state (or at least should not in mf failover). But there is no harm i guess (i dont see a problem here), so i am also ok with it.

src/cluster_legacy.c

skolosov-snap · 2025-02-12T17:54:09Z

Any objection to merge it?

zuiderkwast · 2025-02-12T18:58:28Z

We're busy making the 8.1.0 release candidate just now. This one will need to wait and get merged after that.

madolson · 2025-02-13T22:29:39Z

Any objection to merge it?

We should also have some tests validating this new behavior works as intended. Have a cluster, disconnect the replica, make sure slots/shards and all are still consistent and the rest of the cluster agrees on the state.

enjoy-binbin · 2025-02-14T02:59:37Z

src/commands/cluster-replicate.json

+                    {
+                        "name": "no-one",
+                        "type": "block",
+                        "arguments": [


we should also add a since field, and a history field to mention this args.

enjoy-binbin · 2025-02-14T03:00:32Z

src/cluster_legacy.c

-    } else if (!strcasecmp(c->argv[1]->ptr, "replicate") && c->argc == 3) {
-        /* CLUSTER REPLICATE <NODE ID> */
+    } else if (!strcasecmp(c->argv[1]->ptr, "replicate") && (c->argc == 3 || c->argc == 4)) {
+        /* CLUSTER REPLICATE (<NODE ID> | NO ONE)*/
        /* Lookup the specified node in our table. */


please also move this comment to below, near the clusterLookupNode one.

PingXie · 2025-02-17T18:26:08Z

if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients.

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

BTW, I just noticed that the forget path is not always working. The reset node joined back to the cluster quickly.

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

hpatro · 2025-02-17T19:15:29Z

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

Good catch!

skolosov-snap · 2025-02-19T20:21:14Z

if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients.

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

IMHO that is just a syntactical question. Whatever command name we would come up with, the behavior of it would be the same: turn replica into primary with leaving it in the cluster. If you think the name of CLUSTER RESET SOFT is better I can support it in that name. My personal opinion is that CLUSTER RESET is not the best command to implement this feature, because the main thing what CLUSTER RESET does is excluding node from the cluster and that is exactly what we want to avoid. On the other hand CLUSTER REPLICATE NO ONE is consistent with similar non-cluster version of REPLICAOF NO ONE what should not confuse client but even give some kind of similarity.

BTW, I just noticed that the forget path is not always working.

What do you mean by "forget path is not always working"? What forget path? We are not doing any forgetting here.

The reset node joined back to the cluster quickly.

AFAIU if node is reset it will not be added back to the cluster automatically, but only when somebody does it explicitly. So if you want to remove replica manually (i.e. node is scheduled for maintenance) the only option for you is to reset with affecting all clients (start seeing errors).

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

What is a shard? Would you please shed a light on this term? Is it something special for ValKey? AFAIU shard is primary with replicas attached to it. So, when we switched role of the node from replica to empty primary, doesn't it mean that we moved it to separate new shard?

skolosov-snap · 2025-02-19T23:08:35Z

I believe I figured it out. There is internal shards dict that needs to be updated.

zuiderkwast · 2025-02-20T00:10:04Z

@PingXie

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

I agree with @skolosov-snap that CLUSTER REPLICATE NO ONE is better. The node is not leaving the cluster so RESET seems less intuitive.

BTW, I just noticed that the forget path is not always working. The reset node joined back to the cluster quickly.

To make the cluster forget a node, you need to send CLUSTER FORGET to the other nodes in the cluster. If you only reset a node, it will disconnect but the other nodes will find it again soon. I believe this is the documented behavior.

skolosov-snap · 2025-02-20T01:15:05Z

To make the cluster forget a node, you need to send CLUSTER FORGET to the other nodes in the cluster. If you only reset a node, it will disconnect but the other nodes will find it again soon. I believe this is the documented behavior.

Looking at the doc it should behave differently:

This command is mainly useful to re-provision a Valkey Cluster node in order to be used in the context of a new, different cluster.

So, I believe potentially it may be fixed in the future.

@PingXie, @zuiderkwast You are right about current CLUSTER RESET behavior. However, it takes about 15 secs to re-join the cluster. So, we gonna have 15 seconds of errors which is pretty long for errors in such scenarios:

27859:M 19 Feb 2025 17:11:14.472 * Connection with replica 127.0.0.1:17383 lost.
27859:M 19 Feb 2025 17:11:29.523 * Sending MEET packet to node 096e0230bddddb11d32e36497adcc4235bffa7d1 () because there is no inbound link for it
27859:M 19 Feb 2025 17:11:29.593 * Successfully completed handshake with 096e0230bddddb11d32e36497adcc4235bffa7d1 ()

Signed-off-by: Sergey Kolosov <[email protected]>

PingXie · 2025-02-20T16:31:15Z

@skolosov-snap @zuiderkwast you have a good point. CLUSTER REPLICATE NO ONE works better.

BTW, a follow up thought, should we consider a single token NONE as opposed to NO ONE? I understand the rationale of mimicking REPLICAOF NO ONE but it using two tokens for one concept seems quite arbitrary. I wonder if it is a lesser evil if we go with NONE this time? So CLUSTER REPLICATE NONE perhaps?

zuiderkwast · 2025-02-20T16:34:54Z

a single token NONE as opposed to NO ONE?

I suggested it earlier in this PR in comment #1674 (comment) and I was already convinced that NO ONE is better. 😆

PingXie · 2025-02-20T16:37:58Z

a single token NONE as opposed to NO ONE?

I suggested it earlier in this PR in comment #1674 (comment) and I was already convinced that NO ONE is better. 😆

Hmm the comment link seems broken. What convinced you? :)

skolosov-snap · 2025-02-20T16:46:22Z

BTW, a follow up thought, should we consider a single token NONE as opposed to NO ONE? I understand the rationale of mimicking REPLICAOF NO ONE but it using two tokens for one concept seems quite arbitrary. I wonder if it is a lesser evil if we go with NONE this time? So CLUSTER REPLICATE NONE perhaps?

Not a big deal to me, but IMHO consistency/similarity between commands is better. What benefit will we get if we replace it with single toke?
Also, I understand that probability of having NONE node-id is near the zero, but is not zero. I.e. in a future if we let say allow to set ID by user or in some kind of test.
Again, not a big deal to me, but I don't see any benefits from making a single token (except may be shorter if-statement in the code).

PingXie · 2025-02-20T17:03:54Z

Also, I understand that probability of having NONE node-id is near the zero, but is not zero.

The probability will be practically 0.

I.e. in a future if we let say allow to set ID by user or in some kind of test.

We have human_nodename for this purpose.

    sds human_nodename;                     /* The known human readable nodename for this node */

zuiderkwast · 2025-02-20T17:09:16Z

a single token NONE as opposed to NO ONE?

I suggested it earlier in this PR in comment #1674 (comment) and I was already convinced that NO ONE is better. 😆

Hmm the comment link seems broken. What convinced you? :)

Interesting, the link works for me. It's one of the resolved comments above on src/commands/cluster-replicate.json.

Quoting @skolosov-snap's comment which made sense to me:

I think similarity to REPLICAOF NO ONE is better than trying to keep the same arity for different cases. I think client are not thinking about arity at all but similarity between commands is easier for them remember. In addition, for us it is easier to distinguish <node-id> from a special case.

skolosov-snap force-pushed the skolosov/replicate-no-one branch from 91589d1 to ff96c0f Compare February 5, 2025 22:51

enjoy-binbin reviewed Feb 6, 2025

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

hpatro reviewed Feb 7, 2025

View reviewed changes

skolosov-snap force-pushed the skolosov/replicate-no-one branch 2 times, most recently from fde1ab6 to 4abbae8 Compare February 7, 2025 23:46

zuiderkwast added the major-decision-pending Major decision pending by TSC team label Feb 10, 2025

zuiderkwast reviewed Feb 10, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/commands/cluster-replicate.json Outdated Show resolved Hide resolved

skolosov-snap force-pushed the skolosov/replicate-no-one branch from 4abbae8 to 85238e6 Compare February 10, 2025 16:12

skolosov-snap force-pushed the skolosov/replicate-no-one branch from 85238e6 to 3789227 Compare February 10, 2025 17:24

zuiderkwast reviewed Feb 10, 2025

View reviewed changes

src/commands/cluster-replicate.json Outdated Show resolved Hide resolved

skolosov-snap force-pushed the skolosov/replicate-no-one branch from 3789227 to e4e8b24 Compare February 10, 2025 17:36

enjoy-binbin reviewed Feb 11, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin reviewed Feb 11, 2025

View reviewed changes

skolosov-snap force-pushed the skolosov/replicate-no-one branch from e4e8b24 to bed392b Compare February 11, 2025 16:31

enjoy-binbin reviewed Feb 14, 2025

View reviewed changes

skolosov-snap force-pushed the skolosov/replicate-no-one branch from bed392b to b419cfb Compare February 19, 2025 23:04

skolosov-snap force-pushed the skolosov/replicate-no-one branch 2 times, most recently from b9ad4fd to 95c0b42 Compare February 20, 2025 00:06

Implement: CLUSTER REPLICATE NO ONE

445e078

Signed-off-by: Sergey Kolosov <[email protected]>

skolosov-snap force-pushed the skolosov/replicate-no-one branch from 95c0b42 to 445e078 Compare February 20, 2025 01:55

zuiderkwast mentioned this pull request Feb 20, 2025

CLUSTER RESET forgets other nodes, but they don't forget you #1757

Open

	/* Reset manual failover state. */
	resetManualFailover();
	resetManualFailover();

Implement: CLUSTER REPLICATE NO ONE #1674

Are you sure you want to change the base?

Implement: CLUSTER REPLICATE NO ONE #1674

Conversation

skolosov-snap commented Feb 5, 2025 • edited Loading

codecov bot commented Feb 6, 2025 • edited Loading

Codecov Report

hpatro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skolosov-snap commented Feb 7, 2025

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast commented Feb 10, 2025

skolosov-snap commented Feb 10, 2025

Choose a reason for hiding this comment

skolosov-snap Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skolosov-snap Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

enjoy-binbin Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skolosov-snap commented Feb 12, 2025

zuiderkwast commented Feb 12, 2025

madolson commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie commented Feb 17, 2025

hpatro commented Feb 17, 2025

skolosov-snap commented Feb 19, 2025 • edited Loading

skolosov-snap commented Feb 19, 2025

zuiderkwast commented Feb 20, 2025

skolosov-snap commented Feb 20, 2025 • edited Loading

PingXie commented Feb 20, 2025

zuiderkwast commented Feb 20, 2025

PingXie commented Feb 20, 2025

skolosov-snap commented Feb 20, 2025 • edited Loading

PingXie commented Feb 20, 2025

zuiderkwast commented Feb 20, 2025

skolosov-snap commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

skolosov-snap Feb 11, 2025 •

edited

Loading

skolosov-snap Feb 11, 2025 •

edited

Loading

enjoy-binbin Feb 12, 2025 •

edited

Loading

skolosov-snap commented Feb 19, 2025 •

edited

Loading

skolosov-snap commented Feb 20, 2025 •

edited

Loading

skolosov-snap commented Feb 20, 2025 •

edited

Loading