Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement: CLUSTER REPLICATE NO ONE #1674

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

skolosov-snap
Copy link

@skolosov-snap skolosov-snap commented Feb 5, 2025

Currently, ValKey doesn't allow to detach replica attached to primary node. So, if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients. All clients will keep sending traffic to this node (with getting inaccurate responses) until they refresh their topology.

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 91589d1 to ff96c0f Compare February 5, 2025 22:51
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 40.00000% with 15 lines in your changes missing coverage. Please review.

Project coverage is 71.10%. Comparing base (591ae9a) to head (445e078).
Report is 35 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 40.00% 15 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1674      +/-   ##
============================================
+ Coverage     71.04%   71.10%   +0.06%     
============================================
  Files           121      123       +2     
  Lines         65254    65555     +301     
============================================
+ Hits          46357    46612     +255     
- Misses        18897    18943      +46     
Files with missing lines Coverage Δ
src/commands.def 100.00% <ø> (ø)
src/cluster_legacy.c 85.46% <40.00%> (-0.68%) ⬇️

... and 28 files with indirect coverage changes

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster-replicate.json file should be updated and as part of the build commands.def will get updated. or if it was accidentally not staged, please add that.

Also, could you run the clang-format on your end to fix some of the formatting issue.

Comment on lines 7056 to 7048
if (server.primary != NULL) {
replicationUnsetPrimary();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other invocation of replicationUnsetPrimary() don't have this wrapped under the condition. Is it unnecessary to invoke it if server.primary is NULL ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unsure why clusterPromoteSelfToPrimary was introduced, seems like it's the same behavior at this point but good to call this abstraction if there will be additional steps introduced in the future for cluster mode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unsure why clusterPromoteSelfToPrimary was introduced, seems like it's the same behavior at this point but good to call this abstraction if there will be additional steps introduced in the future for cluster mode.

Ok.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other invocation of replicationUnsetPrimary() don't have this wrapped under the condition. Is it unnecessary to invoke it if server.primary is NULL ?

Probably not needed. It looks like left over from KeyDB's implementation where primary needed to be passed to replicationUnsetMaster

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch 2 times, most recently from fde1ab6 to 4abbae8 Compare February 7, 2025 23:46
@skolosov-snap
Copy link
Author

cluster-replicate.json file should be updated and as part of the build commands.def will get updated. or if it was accidentally not staged, please add that.

Also, could you run the clang-format on your end to fix some of the formatting issue.

Updated.

@zuiderkwast zuiderkwast added the major-decision-pending Major decision pending by TSC team label Feb 10, 2025
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature makes sense to me.

@valkey-io/core-team New arguments = major decision. Please approve or vote if you agree.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 4abbae8 to 85238e6 Compare February 10, 2025 16:12
@zuiderkwast
Copy link
Contributor

The CI job "DCO" is failing. You need to use git commit -s. See the Details link next to the DCO job.

Why we need it? See here: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#developer-certificate-of-origin thanks!

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 85238e6 to 3789227 Compare February 10, 2025 17:24
@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 3789227 to e4e8b24 Compare February 10, 2025 17:36
@skolosov-snap
Copy link
Author

The CI job "DCO" is failing. You need to use git commit -s. See the Details link next to the DCO job.

Why we need it? See here: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#developer-certificate-of-origin thanks!

Done

Comment on lines 7049 to 7051
/* Reset manual failover state. */
resetManualFailover();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it become a primary, do we need this reset? it has no way to enter the failover check logic. or at least we remove the comment since this line is super easy.

Suggested change
/* Reset manual failover state. */
resetManualFailover();
resetManualFailover();

Copy link
Author

@skolosov-snap skolosov-snap Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "no way to enter the failover check logic". This is to abort any in-progress failover (which can happen, right?).
Comment is removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean, when the node become a primary, it has no way enter this mf handle logic.

        if (nodeIsReplica(myself)) {
            clusterHandleManualFailover();

Copy link
Author

@skolosov-snap skolosov-snap Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I'm not sure I got you right. If you mean that there is no way to have manual failover to be in progress by the time we come to

resetManualFailover();
, then I would disagree with that. Manual failover could start before we got CLUSTER REPLICATE NO ONE. Also, CLUSTER RESET (which also may changes replica to primary) does the same: it resets any in-progress failover:
resetManualFailover();

Copy link
Member

@enjoy-binbin enjoy-binbin Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know it could start before the NO ONE, i mean, if mf is in progress, will it be a problem if it is not reset? What will happen? I know CLUSTER RESET do the same thing, it is a old time stuff, i am ok with resetting it, i was just thinking about this, we definitely have other places where we haven't reset it when the role changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we definitely have other places where we haven't reset it when the role changed.

Interesting. Is this a problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason of not resetting it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not against it, this is just a question, after it becomes a primary, will there be any problems with the uncleaned mf failover? As far as i know there is no problem (only a log issue like the replica (empty primary) will print the mf timeout something like that). The original primary has a known issue with pause timeout.

i am ok with resetting it. I would probably prefer to have one that refuses to execute NO ONE if the replica is in failover or is in mf failover. The admin should do it in a stable state, which means the replica should not be in the failover state (or at least should not in mf failover). But there is no harm i guess (i dont see a problem here), so i am also ok with it.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from e4e8b24 to bed392b Compare February 11, 2025 16:31
@skolosov-snap
Copy link
Author

Any objection to merge it?

@zuiderkwast
Copy link
Contributor

We're busy making the 8.1.0 release candidate just now. This one will need to wait and get merged after that.

@madolson
Copy link
Member

Any objection to merge it?

We should also have some tests validating this new behavior works as intended. Have a cluster, disconnect the replica, make sure slots/shards and all are still consistent and the rest of the cluster agrees on the state.

{
"name": "no-one",
"type": "block",
"arguments": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add a since field, and a history field to mention this args.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

} else if (!strcasecmp(c->argv[1]->ptr, "replicate") && c->argc == 3) {
/* CLUSTER REPLICATE <NODE ID> */
} else if (!strcasecmp(c->argv[1]->ptr, "replicate") && (c->argc == 3 || c->argc == 4)) {
/* CLUSTER REPLICATE (<NODE ID> | NO ONE)*/
/* Lookup the specified node in our table. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also move this comment to below, near the clusterLookupNode one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@PingXie
Copy link
Member

PingXie commented Feb 17, 2025

if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients.

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

BTW, I just noticed that the forget path is not always working. The reset node joined back to the cluster quickly.

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

@hpatro
Copy link
Collaborator

hpatro commented Feb 17, 2025

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

Good catch!

@skolosov-snap
Copy link
Author

skolosov-snap commented Feb 19, 2025

if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients.

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

IMHO that is just a syntactical question. Whatever command name we would come up with, the behavior of it would be the same: turn replica into primary with leaving it in the cluster. If you think the name of CLUSTER RESET SOFT is better I can support it in that name. My personal opinion is that CLUSTER RESET is not the best command to implement this feature, because the main thing what CLUSTER RESET does is excluding node from the cluster and that is exactly what we want to avoid. On the other hand CLUSTER REPLICATE NO ONE is consistent with similar non-cluster version of REPLICAOF NO ONE what should not confuse client but even give some kind of similarity.

BTW, I just noticed that the forget path is not always working.

What do you mean by "forget path is not always working"? What forget path? We are not doing any forgetting here.

The reset node joined back to the cluster quickly.

AFAIU if node is reset it will not be added back to the cluster automatically, but only when somebody does it explicitly. So if you want to remove replica manually (i.e. node is scheduled for maintenance) the only option for you is to reset with affecting all clients (start seeing errors).

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

What is a shard? Would you please shed a light on this term? Is it something special for ValKey? AFAIU shard is primary with replicas attached to it. So, when we switched role of the node from replica to empty primary, doesn't it mean that we moved it to separate new shard?

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from bed392b to b419cfb Compare February 19, 2025 23:04
@skolosov-snap
Copy link
Author

I believe I figured it out. There is internal shards dict that needs to be updated.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch 2 times, most recently from b9ad4fd to 95c0b42 Compare February 20, 2025 00:06
@zuiderkwast
Copy link
Contributor

@PingXie

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

I agree with @skolosov-snap that CLUSTER REPLICATE NO ONE is better. The node is not leaving the cluster so RESET seems less intuitive.

BTW, I just noticed that the forget path is not always working. The reset node joined back to the cluster quickly.

To make the cluster forget a node, you need to send CLUSTER FORGET to the other nodes in the cluster. If you only reset a node, it will disconnect but the other nodes will find it again soon. I believe this is the documented behavior.

@skolosov-snap
Copy link
Author

skolosov-snap commented Feb 20, 2025

To make the cluster forget a node, you need to send CLUSTER FORGET to the other nodes in the cluster. If you only reset a node, it will disconnect but the other nodes will find it again soon. I believe this is the documented behavior.

Looking at the doc it should behave differently:

This command is mainly useful to re-provision a Valkey Cluster node in order to be used in the context of a new, different cluster.

So, I believe potentially it may be fixed in the future.

@PingXie, @zuiderkwast You are right about current CLUSTER RESET behavior. However, it takes about 15 secs to re-join the cluster. So, we gonna have 15 seconds of errors which is pretty long for errors in such scenarios:

27859:M 19 Feb 2025 17:11:14.472 * Connection with replica 127.0.0.1:17383 lost.
27859:M 19 Feb 2025 17:11:29.523 * Sending MEET packet to node 096e0230bddddb11d32e36497adcc4235bffa7d1 () because there is no inbound link for it
27859:M 19 Feb 2025 17:11:29.593 * Successfully completed handshake with 096e0230bddddb11d32e36497adcc4235bffa7d1 ()

@PingXie
Copy link
Member

PingXie commented Feb 20, 2025

@skolosov-snap @zuiderkwast you have a good point. CLUSTER REPLICATE NO ONE works better.

BTW, a follow up thought, should we consider a single token NONE as opposed to NO ONE? I understand the rationale of mimicking REPLICAOF NO ONE but it using two tokens for one concept seems quite arbitrary. I wonder if it is a lesser evil if we go with NONE this time? So CLUSTER REPLICATE NONE perhaps?

@zuiderkwast
Copy link
Contributor

a single token NONE as opposed to NO ONE?

I suggested it earlier in this PR in comment #1674 (comment) and I was already convinced that NO ONE is better. 😆

@PingXie
Copy link
Member

PingXie commented Feb 20, 2025

a single token NONE as opposed to NO ONE?

I suggested it earlier in this PR in comment #1674 (comment) and I was already convinced that NO ONE is better. 😆

Hmm the comment link seems broken. What convinced you? :)

@skolosov-snap
Copy link
Author

skolosov-snap commented Feb 20, 2025

BTW, a follow up thought, should we consider a single token NONE as opposed to NO ONE? I understand the rationale of mimicking REPLICAOF NO ONE but it using two tokens for one concept seems quite arbitrary. I wonder if it is a lesser evil if we go with NONE this time? So CLUSTER REPLICATE NONE perhaps?

Not a big deal to me, but IMHO consistency/similarity between commands is better. What benefit will we get if we replace it with single toke?
Also, I understand that probability of having NONE node-id is near the zero, but is not zero. I.e. in a future if we let say allow to set ID by user or in some kind of test.
Again, not a big deal to me, but I don't see any benefits from making a single token (except may be shorter if-statement in the code).

@PingXie
Copy link
Member

PingXie commented Feb 20, 2025

Also, I understand that probability of having NONE node-id is near the zero, but is not zero.

The probability will be practically 0.

I.e. in a future if we let say allow to set ID by user or in some kind of test.

We have human_nodename for this purpose.

    sds human_nodename;                     /* The known human readable nodename for this node */

@zuiderkwast
Copy link
Contributor

a single token NONE as opposed to NO ONE?

I suggested it earlier in this PR in comment #1674 (comment) and I was already convinced that NO ONE is better. 😆

Hmm the comment link seems broken. What convinced you? :)

Interesting, the link works for me. It's one of the resolved comments above on src/commands/cluster-replicate.json.

Quoting @skolosov-snap's comment which made sense to me:

I think similarity to REPLICAOF NO ONE is better than trying to keep the same arity for different cases. I think client are not thinking about arity at all but similarity between commands is easier for them remember. In addition, for us it is easier to distinguish <node-id> from a special case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants