Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-database support to cluster mode #1671

Open
wants to merge 9 commits into
base: unstable
Choose a base branch
from

Conversation

xbasel
Copy link
Member

@xbasel xbasel commented Feb 5, 2025

This commit introduces multi-database support in cluster mode while maintaining backward compatibility and requiring no API changes. Key features include:

  • Database-agnostic hashing: The hashing algorithm is unchanged. Identical keys map to the same slot across all databases. No changes to slot calculation. This ensures consistency in key distribution and maintains compatibility with existing single-database setups.

  • Implementation is fully backward compatible with no API changes.

  • The core structure remains an array of databases, each containing a list of hashtables (one per slot).

Cluster management commands are global commands, except for GETKEYSINSLOT and COUNTKEYSINSLOT, which run in selected-DB context.

MIGRATE command operates a selected-db context. Please note that MIGRATE command parameter destination-db is used, when migrating keys they can be migrated to a different database in the target, like in non-cluster mode.

Slot migration process changes when multiple databases are used:

	Iterate through all databases
 		SELECT database
 		keys = GETKEYSINSLOT
 		MIGRATE source target keys

Valkey-cli has been updated to support resharding across all databases.

#1319

This commit introduces multi-database support in cluster mode while
maintaining backward compatibility and requiring no API changes. Key
features include:

- Database-agnostic hashing: The hashing algorithm is unchanged.
  Identical keys map to the same slot across all databases. No changes
  to slot calculation. This ensures consistency in key distribution
  and maintains compatibility with existing single-database setups.

- Implementation is fully backward compatible with no API changes.

- The core structure remains an array of databases, each containing a
  list of hashtables (one per slot).

Cluster management commands are global commands, except for
GETKEYSINSLOT and COUNTKEYSINSLOT, which run in selected-DB context.

MIGRATE command operates a selected-db context. Please note that
MIGRATE command parameter destination-db is used, when migrating keys
they can be migrated to a different database in the target, like in
non-cluster mode.

Slot migration process changes when multiple databases are used:
	Iterate through all databases
 		SELECT database
 		keys = GETKEYSINSLOT
 		MIGRATE source target keys

Valkey-cli has been updated to support resharding across all
databases.

Signed-off-by: xbasel <[email protected]>
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 94.91525% with 3 lines in your changes missing coverage. Please review.

Project coverage is 71.15%. Comparing base (2eac2cc) to head (a670e9d).
Report is 41 commits behind head on unstable.

Files with missing lines Patch % Lines
src/valkey-cli.c 92.59% 2 Missing ⚠️
src/cluster.c 90.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1671      +/-   ##
============================================
+ Coverage     70.97%   71.15%   +0.17%     
============================================
  Files           121      123       +2     
  Lines         65238    65542     +304     
============================================
+ Hits          46305    46638     +333     
+ Misses        18933    18904      -29     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.17% <100.00%> (+0.27%) ⬆️
src/config.c 78.35% <ø> (-0.06%) ⬇️
src/db.c 89.94% <ø> (+0.37%) ⬆️
src/valkey-benchmark.c 61.75% <ø> (+1.61%) ⬆️
src/cluster.c 89.12% <90.00%> (-0.12%) ⬇️
src/valkey-cli.c 56.30% <92.59%> (+0.41%) ⬆️

... and 32 files with indirect coverage changes

@xbasel xbasel marked this pull request as ready for review February 10, 2025 21:37
@xbasel xbasel requested a review from zuiderkwast February 10, 2025 22:13
@@ -1728,12 +1714,6 @@ void swapMainDbWithTempDb(serverDb *tempDb) {
void swapdbCommand(client *c) {
int id1, id2;

/* Not allowed in cluster mode: we have just DB 0 there. */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be enough for swapdb to work in cluster mode? What will happen in setup with 2 shards, each responsible for half of slots in db's?

Copy link
Member Author

@xbasel xbasel Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this implementation SWAPDB must be executed in all primary nodes. There are three options:

  1. Allow SWAPDB and shift responsibility to the user – Risky, non-atomic, can cause temporary inconsistency and data corruption. Needs strong warnings.
  2. Keep SWAPDB disabled in cluster mode – Safest, avoids inconsistency.
  3. Make SWAPDB cluster-wide and atomic or – Complex, unclear feasibility.

I think option 2 is the safest bet. @JoBeR007 wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SWAPDB replicated as a single command? Then it's atomic.

If it's risky, it's risky in standslone mode with replicas too, right?

I think we can allow it. Swapping the data can only be done in some non-realtime workloads anyway I think.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think risky because of replication and risky because of the need to execute SWAPDB on all primary nodes are unrelated just because as a user you can't control first, but user is the main risk in the second case.
I would keep SWAPDB disabled in cluster mode, if we decide to continue with this implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cluster mode, consistency is per slot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SWAPDB replicated as a single command? Then it's atomic.

If it's risky, it's risky in standslone mode with replicas too, right?

I think we can allow it. Swapping the data can only be done in some non-realtime workloads anyway I think.

I don’t think it’s very risky with standalone replicas. The only downside is if SWAPDB propagation to the replica takes time, a client might still access the wrong database. At least the client won’t be able to modify the wrong database, as they can only read.
In cluster mode, the same (logical) DB can be DB0 on one node and DB1 on another, but similar issues already exist today, FLUSHDB on one node doesn’t clear the entire DB since data exists in other slots/nodes. But as you said, consistency is per slot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, FLUSHDB is very similar in this regard. If a failover happens just before this command has been propagated to replicas, it's a big thing, but it's no surprise I think. The client can use WAIT or check replication offset to make sure the FLUSHDB or SWAPDB was successful on the replicas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this, I think it is not just an issue of Multi-database but is more related to atomic slot migration. If a shard is in a stable state (not undergoing slot migration), then flushdb/flushall/swapdb are safe. However, if slot migration is in progress, it might lead to data inconsistency.

I think this needs to be considered alongside atomic-slot-migration:

  1. During the ATM process, for slots being migrated, if we encounter flushall/flushdb, we can send a command like flushslot or flushslotall to the target shard
  2. As for swapdb, I recommend temporarily prohibiting execution during the ATM process

@PingXie @enjoy-binbin , please also take note of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. @murphyjacob4 FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a comment on the issue about this, but also worth mentioning it's hard to orchestrate SWAPDB. Even in steady state, flushdb and flushall are idempotent (you can send them multiple times) but swapdb isn't. If a command times out on one node, it's hard to reason about if it was successful and how to retry it. I think we should continue to disable SWAPDB in cluster mode for now, unless we introduce an idempotent way to do the swap.

@soloestoy soloestoy requested review from soloestoy and removed request for zuiderkwast February 12, 2025 06:28
@@ -1102,7 +1110,7 @@ getNodeByQuery(client *c, struct serverCommand *cmd, robj **argv, int argc, int
* NODE <node-id>. */
int flags = LOOKUP_NOTOUCH | LOOKUP_NOSTATS | LOOKUP_NONOTIFY | LOOKUP_NOEXPIRE;
if ((migrating_slot || importing_slot) && !pubsubshard_included) {
if (lookupKeyReadWithFlags(&server.db[0], thiskey, flags) == NULL)
if (lookupKeyReadWithFlags(c->db, thiskey, flags) == NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I modified it to use c->db, so for most commands, the key it wants to access can be correctly located. However, some cross-DB commands, such as COPY, still require additional checks. The ultimate solution is atomic-slot-migration I believe. Once ATM is implemented, the TRYAGAIN issue will no longer occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that getNodeByQuery doesn't follow selects either, so this might not be the right database. If you for example have:

SELECT 0
GET FOO
SELECT 1
GET FOO

c->db won't be correct here either. COPY and move are also such problems as mentioned. I wonder if there is some way to make this correct without having ATM so we can limit the breakage if you're moving from standalone to cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, c->db can obtain the correct context information. Are you referring to the scenario where the select command is also used within a transaction (MULTI/EXEC)?

@soloestoy
Copy link
Member

I'm happy that we did "Unified db rehash method for both standalone and cluster #12848" when developing kvstore , which made the implementation of multi-database simpler.

@ranshid ranshid added the release-notes This issue should get a line item in the release notes label Feb 17, 2025
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add history to SWAPDB, SELECT, MOVE json files to indicate it's supported since 9.0.

Comment on lines +6881 to +6888
int dbHasNoKeys(void) {
for (int i = 0; i < server.dbnum; i++) {
if (kvstoreSize(server.db[i].keys) != 0) {
return 0;
}
}
return 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to db.c

@@ -196,7 +196,7 @@ proc ::valkey_cluster::__method__masternode_notfor_slot {id slot} {
error "Slot $slot is everywhere"
}

proc ::valkey_cluster::__dispatch__ {id method args} {
proc ::valkey_cluster::__dispatch__ {id method args} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid.

@@ -0,0 +1,481 @@
# Tests multi-databases in cluster mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the legacy clustering system. Ideally this test should be in unit/cluster

@@ -0,0 +1,481 @@
# Tests multi-databases in cluster mode

proc pause {{message "Hit Enter to continue ==> "}} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you used

proc bp {{s {}}} {
for break points in other places in the code, make sure to clean this all up.

}
}
}

} ;# tags

set ::singledb $old_singledb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting we no longer need this constraint, so we might want to consider removing this throughout the codebase except when we are running in external mode.

@@ -1102,7 +1110,7 @@ getNodeByQuery(client *c, struct serverCommand *cmd, robj **argv, int argc, int
* NODE <node-id>. */
int flags = LOOKUP_NOTOUCH | LOOKUP_NOSTATS | LOOKUP_NONOTIFY | LOOKUP_NOEXPIRE;
if ((migrating_slot || importing_slot) && !pubsubshard_included) {
if (lookupKeyReadWithFlags(&server.db[0], thiskey, flags) == NULL)
if (lookupKeyReadWithFlags(c->db, thiskey, flags) == NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that getNodeByQuery doesn't follow selects either, so this might not be the right database. If you for example have:

SELECT 0
GET FOO
SELECT 1
GET FOO

c->db won't be correct here either. COPY and move are also such problems as mentioned. I wonder if there is some way to make this correct without having ATM so we can limit the breakage if you're moving from standalone to cluster.

@@ -1728,12 +1714,6 @@ void swapMainDbWithTempDb(serverDb *tempDb) {
void swapdbCommand(client *c) {
int id1, id2;

/* Not allowed in cluster mode: we have just DB 0 there. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a comment on the issue about this, but also worth mentioning it's hard to orchestrate SWAPDB. Even in steady state, flushdb and flushall are idempotent (you can send them multiple times) but swapdb isn't. If a command times out on one node, it's hard to reason about if it was successful and how to retry it. I think we should continue to disable SWAPDB in cluster mode for now, unless we introduce an idempotent way to do the swap.

@@ -1,5 +1,12 @@
start_server {tags {"lazyfree"}} {
test "UNLINK can reclaim memory in background" {

# The test framework invokes "flushall", replacing kvstores even if empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather we did a sync flushall then in the test framework, so we don't have these random waits all over the place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants