IGNITE-23304 #4821

ascherbakoff · 2024-12-03T11:43:23Z

This PR addresses the safe timestamp generation behavior for partition replication groups.

safeTs is an entity tied to majority based replication protocols and is used for serializable backups reads.
Each raft command is assigned a mototonic ts and a replica updates its local ts value on receiving replication commands.
All reads at safe ts are serializable
Currently safeTs is assigined on primary replica, which involves additional synchronizatoin (currently uses huge critical section) and involves retries (added latency).
Also it's bad from the pluggable replication point of view, because not all protocols require this concept.

Safe ts behavior was modified in the following way:

Safe timestamp generation is moved outside primary replica to a replication layer, making it protocol specific. All request sycnhronization is removed from primary replica.
Generated timestamp is applied by binary paching to a command then it enters raft pipeline on a leader.
Added guaranties on monotonic ts generation if raft leader has changed:
3.1 raft election timeout now accounts max clock skew. Then a new election starts on a node, it has local time higher than last generated safe ts.
3.2 hlc is propagated in timoutnow requests, then a leader directly transfers ownership to other candidate to maintain proper clock ordering.
safe timestamp reordering now counts as assertion condition which never should happen. corresponding error code is removed as user should never see it.

Benchmark results
oracle JDK 21.0.4, Xeon Silver 4314, aipersist engine (20G pagecache size)

direct writes to storage (IGNITE_SKIP_REPLICATION_IN_BENCHMARK=true, IGNITE_SKIP_STORAGE_UPDATE_IN_BENCHMARK=false)
master revision=32737c0dc9fcd0632ba37e2949a40b199429fddb

8 thread(new)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 1 thrpt 20 197936.874 ± 12727.709 ops/s

16 threads(new)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 1 thrpt 20 254981.169 ± 21278.635 ops/s

32 threads(new)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 1 thrpt 20 286127.032 ± 16145.256 ops/s

8 threads(old)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 1 thrpt 20 86624.141 ± 3472.632 ops/s

16 threads(old)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 1 thrpt 20 89446.504 ± 6623.490 ops/s

32 threads(old)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 1 thrpt 20 89516.016 ± 6092.740 ops/s

It's obvious old version has zero scaling on writing to partition.

Full raft pipeline, same hardware
LOGIT_STORAGE_ENABLED=true
IGNITE_SKIP_REPLICATION_IN_BENCHMARK=false
IGNITE_SKIP_STORAGE_UPDATE_IN_BENCHMARK=false

32 threads(new)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 32 thrpt 20 229083.089 ± 36856.962 ops/s

32 thread(old)

Benchmark (batch) (fsync) (partitionCount) Mode Cnt Score Error Units
UpsertKvBenchmark.upsert 1 false 32 thrpt 20 181908.090 ± 26821.026 ops/s

…nto ignite-23304 # Conflicts: # modules/raft/src/main/java/org/apache/ignite/internal/raft/server/impl/JraftServerImpl.java

…nto ignite-23304 # Conflicts: # modules/table/src/main/java/org/apache/ignite/internal/table/distributed/raft/PartitionListener.java # modules/table/src/main/java/org/apache/ignite/internal/table/distributed/replicator/PartitionReplicaListener.java

...main/java/org/apache/ignite/internal/network/direct/stream/DirectByteBufferStreamImplV1.java

.../apache/ignite/internal/partition/replicator/marshaller/PartitionCommandsMarshallerImpl.java

modules/raft-api/src/main/java/org/apache/ignite/internal/raft/Command.java

...rc/integrationTest/java/org/apache/ignite/distributed/ItTxObservableTimePropagationTest.java

...table/src/main/java/org/apache/ignite/internal/table/distributed/raft/PartitionListener.java

...n/java/org/apache/ignite/internal/table/distributed/replicator/PartitionReplicaListener.java

.../org/apache/ignite/internal/table/distributed/schema/CheckCatalogVersionOnAppendEntries.java

...in/java/org/apache/ignite/internal/table/distributed/schema/PartitionCommandsMarshaller.java

modules/core/src/main/java/org/apache/ignite/internal/hlc/HybridClockImpl.java

modules/core/src/main/java/org/apache/ignite/internal/util/PendingComparableValuesTracker.java

.../apache/ignite/internal/partition/replicator/marshaller/PartitionCommandsMarshallerImpl.java

modules/raft/src/main/java/org/apache/ignite/internal/raft/server/RaftGroupOptions.java

...apache/ignite/internal/table/distributed/command/PartitionRaftCommandsSerializationTest.java

modules/transactions/src/main/java/org/apache/ignite/internal/tx/UpdateCommandResult.java

modules/metastorage/build.gradle

modules/core/src/main/java/org/apache/ignite/internal/hlc/HybridClock.java

modules/metastorage/build.gradle

modules/distribution-zones/build.gradle

modules/raft-api/src/main/java/org/apache/ignite/internal/raft/Command.java

modules/raft/src/main/java/org/apache/ignite/internal/raft/RaftGroupServiceImpl.java

sanpwc · 2024-12-13T16:36:31Z

modules/raft/src/main/java/org/apache/ignite/internal/raft/server/impl/JraftServerImpl.java

@@ -447,6 +447,10 @@ public boolean startRaftNode(
            // Thread pools are shared by all raft groups.
            NodeOptions nodeOptions = opts.copy();

+            // Then a new election starts on a node, it has local physical time higher than last generated safe ts
+            // because we wait out the clock skew.
+            nodeOptions.setElectionTimeoutMs(Math.max(nodeOptions.getElectionTimeoutMs(), groupOptions.maxClockSkew()));


How are you going to guarantee it has local physical time higher than last generated safe ts in case of immediate leader election?
E.g. if there is only one node in partition. (let's say that partition was rebalanced from A to B)
I'm not sure whether it's the only case of immediate leader election attempt.

Generally, leader lease timeout enforces this condition.
I know only one scenario, where manual ordering propagation is required, see below comment on timeoutnowrequest.
for a single node partition I see zero issues.
can you provide more details ?

I've investigated this scenario and ensured everything is ok, because:

Then a configuration is changed from A to B, on new configuration commit A steps down and sends timeoutnowrequest to B

If it dies before sending the request, B will elect self a leader after previos leader (A) lease timeout.

Added a new test for this scenario: org.apache.ignite.distributed.ReplicasSafeTimePropagationTest#testSafeTimeReorderingOnClusterShrink

B will elect self a leader after previos leader (A) lease timeout

Do you mean raft-leader-lease timeout or primary-replica-lease timeout here?

raft-leader-lease

Could you please share a place in code where raft awaits previous leader lease to expire prior to proposing the new one?

This happens here [1] and [2]
[1] org.apache.ignite.raft.jraft.core.NodeImpl#handleElectionTimeout
[2] org.apache.ignite.raft.jraft.core.NodeImpl#handlePreVoteRequest
Elections don't start if current leader is active by lease.

So, it's required for getLeaderLeaseTimeoutMs to be >= maxClockSkew, right? Seems that it's not guaranteed because it's possible to set any value using org.apache.ignite.raft.jraft.option.NodeOptions#setElectionTimeoutMs. With defaults it should work though.

Not quite. We are choosing max of both, so the final value is safe to use:
Math.max(nodeOptions.getElectionTimeoutMs(), groupOptions.maxClockSkew())

modules/raft/src/main/java/org/apache/ignite/raft/jraft/core/NodeImpl.java

...replicator/src/main/java/org/apache/ignite/internal/replicator/CommandApplicationResult.java

.../src/main/java/org/apache/ignite/internal/replicator/command/SafeTimePropagatingCommand.java

.../src/integrationTest/java/org/apache/ignite/distributed/ReplicasSafeTimePropagationTest.java

...table/src/main/java/org/apache/ignite/internal/table/distributed/raft/PartitionListener.java

modules/core/src/main/java/org/apache/ignite/internal/util/PendingComparableValuesTracker.java

…nto ignite-23304 # Conflicts: # modules/replicator/src/main/java/org/apache/ignite/internal/replicator/ReplicaManager.java

ascherbakoff added 21 commits November 22, 2024 17:18

IGNITE-23304 Rework safe time tracking logic.

32a4b69

IGNITE-23304 Rework safe time tracking logic.

2def25f

IGNITE-23304 Test wip.

32ecacf

IGNITE-23304 HLC inside raft wip.

cfdbe94

IGNITE-23304 Primary replica ts propagation.

a12413f

IGNITE-23304 Add marshaller in metastore.

7bc211e

IGNITE-23304 Use maxClockSkew to set election timeout.

4d0b9c5

Merge branch 'main' of https://gitbox.apache.org/repos/asf/ignite-3 i…

c0e1842

…nto ignite-23304 # Conflicts: # modules/raft/src/main/java/org/apache/ignite/internal/raft/server/impl/JraftServerImpl.java

IGNITE-23304 Fix metastore tests.

836b7d8

IGNITE-23304 Fix tests wip 1.

e6e3dec

IGNITE-23304 Fix tests wip 2.

661f502

IGNITE-23304 Fix tests wip 3.

3e4af63

IGNITE-23304 Fix tests wip 4.

ce04313

IGNITE-23304 Fix tests wip 6.

3dbd60e

IGNITE-23304 Fix tests wip 7.

4a6f8d0

IGNITE-23304 Fix tests wip 8.

eb99fda

IGNITE-23304 Fix tests wip 10.

2500708

IGNITE-23304 Fix tests wip 11.

f6767b2

IGNITE-23304 Fix tests wip 12.

8279069

IGNITE-23304 Cleanup.

638d15e

ibessonov reviewed Dec 11, 2024

View reviewed changes

vldpyatkov reviewed Dec 12, 2024

View reviewed changes

modules/core/src/main/java/org/apache/ignite/internal/hlc/HybridClockImpl.java Outdated Show resolved Hide resolved

rpuch reviewed Dec 13, 2024

View reviewed changes