Improve failover mechanism for primary and secondary replica connections #243

lolski · 2021-01-24T19:52:40Z

What is the goal of this PR?

We have increased resilience by improving the failover mechanism between replicas.

What are the changes implemented in this PR?

Remove the terminology leader / non-leader which are Raft specific. We now use the terminology "replica" to refer to a single copy of a database. The active replica that can receive data is now called "primary replica" whereas the passive ones "secondary replica"
Split GraknOptions to GraknOptions.core() and GraknOptions.cluster(), the later of which contains an option to read from secondary replica
Increase resilience:
- Database and cluster discovery will now be re-attempted to all cluster members instead of just to one of them
- When the cluster have not decided which replica is the primary replica, the client will wait and retry instead of simply failing
Changed info-level log to debug

… secondary replica

lolski · 2021-01-25T12:51:46Z

rpc/RPCSession.java

+
+        private void sleepWait() {
+            try {
+                Thread.sleep(2000);


I think the use of Thread.sleep here is justified since there's no other way to wait before performing retry

GraknClient.java

lolski · 2021-01-25T13:11:05Z

Grakn.java

-            READ_REPLICA(2),
-            WRITE(1);
+            WRITE(1),
+            READ_SECONDARY(2);


The reasoning to rename it to READ_SECONDARY is because it allows you to read from secondary replicas, whereas READ and WRITE reads to the primary replica.

We agreed in a verbal discussion with Haikal to get rid of the READ_SECONDARY transaction type and replace it with a new Option.

Fixed in fec34cc, 62c4368, 8f31ae6, and b8d3aa5 (I don't know why it took me 4 commits to do so :D )

lolski · 2021-01-25T13:11:49Z

dependencies/graknlabs/repositories.bzl

@@ -50,8 +50,8 @@ def graknlabs_dependencies():
 def graknlabs_protocol():
    git_repository(
        name = "graknlabs_protocol",
-        remote = "https://github.com/graknlabs/protocol",
-        tag = "2.0.0-alpha-6", # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_protocol
+        remote = "https://github.com/lolski/protocol",


Revert to graknlabs when typedb/typedb-protocol#110 is merged.

lolski · 2021-01-25T13:21:52Z

common/exception/GraknClientException.java

        assert !getMessage().contains("%s");
        this.errorMessage = error;
    }

    public static GraknClientException of(StatusRuntimeException statusRuntimeException) {
        if (statusRuntimeException.getStatus().getCode() == Status.Code.UNAVAILABLE) {
            return new GraknClientException(ErrorMessage.Client.UNABLE_TO_CONNECT);
+        } else if (statusRuntimeException.getStatus().getCode() == Status.Code.INTERNAL && statusRuntimeException.getStatus().getDescription().contains("[RFT01]")) {


In this particular if block I want to check if the server throws an exception because there's no leader yet.

The server will throw an exception of code "[RFT01]", and the only way to propagate that information about the exception is by embedding it in the message, hence the need for getDescription().contains(...).

Is there a better way to propagate exceptions than this?

Do you have better ideas?

The best alternatives that we have right now would be to either:

change the protocol and server: don't throw an exception if there's no leader, but rather encode the error in protobuf (but this modifies the behaviour in a strange way); or

add a new method to protocol and server to check if the leader exists - but this is both inefficient and non-atomic.

So in conclusion parsing the error message is the least bad option right now.

There are plans, in the future, to implement a dedicated Error message in protocol, which we will be able to interpret without the need for hacks: see https://github.com/graknlabs/client-java/issues/180 .

I've moved the hacky code into its own method and added a TODO: 268c884

…plica

alexjpwalker

Updating review to reflect recent chat

alexjpwalker · 2021-01-26T09:34:55Z

GraknOptions.java

+    }
+
+    public Cluster asCluster() {
+        if (isCluster()) return (Cluster) this;


This could be implemented more cleanly by just throwing in this method, and overriding asCluster in GraknOptions.Cluster to return this.

This does look nicer: 3041170

rpc/RPCSession.java

alexjpwalker · 2021-01-26T10:02:13Z

rpc/RPCSession.java

+                Thread.sleep(2000);
+            } catch (InterruptedException e2) {
+                throw new GraknClientException(UNEXPECTED_INTERRUPTION);
+            }
        }

        public static class Database {


Is this class distinct from the concept of a database (i.e. a knowledge graph?) If so, do you think we can disambiguate?

Yes, it is distinct from the concept of a "Grakn database" but I think it should be fine since it's already contextualised to be within RPCSession.

The name is consistent with the name that we are using in the protocol definition: https://github.com/graknlabs/protocol/blob/master/protobuf/cluster/database.proto.

I do feel that something here can be improved, but it would have to be the overall structure rather than just this one part. I think we should do it as part of the major refactor we're going to do after we're done with Cluster alpha.

What do you think @alexjpwalker ?

Perfectly fine with that

alexjpwalker · 2021-01-26T10:04:13Z

rpc/RPCSession.java

-            );
-            LOG.info("Opening a transaction of of type '{}' to leader '{}'", type, selected);
-            return selection.transaction(type, options);
+            if (!options.isCluster()) throw new GraknClientException(ILLEGAL_CAST, options);


I believe this line is redundant and should be deleted. GraknOptions.asCluster already throws if you're performing an illegal cast.

Suggested change

if (!options.isCluster()) throw new GraknClientException(ILLEGAL_CAST, options);

Fixed in 8f7cf33

…ait().

…ry replica

Improve failover mechanism for both read/write to primary and read to…

7baefda

… secondary replica

lolski requested review from alexjpwalker and haikalpribadi as code owners January 24, 2021 19:52

grabl assigned haikalpribadi and alexjpwalker Jan 24, 2021

lolski added type: feature x: do not merge labels Jan 24, 2021

Ganeshwara Hananda added 5 commits January 24, 2021 20:01

Merge branch 'master' into failover-reattempt-v2

4cf0cfa

Add log.debug()

7f2207f

Fix log statements

9a44e1e

Add exception helper

d937287

Improve logging

2c8751f

lolski commented Jan 25, 2021

View reviewed changes

lolski removed the x: do not merge label Jan 25, 2021

Replace the terminology leader / non-leader to primary / secondary re…

af4c91d

…plica

alexjpwalker approved these changes Jan 25, 2021

View reviewed changes

alexjpwalker self-requested a review January 25, 2021 16:01

alexjpwalker requested changes Jan 25, 2021

View reviewed changes

Ganeshwara Hananda added 9 commits January 25, 2021 17:00

Update replica exception code and extract the hacky method

268c884

Introduce GraknOptions.core() and GraknOptions.cluster()

fec34cc

Update @graknlabs_protocol

f353ade

Update GraknOptions

62c4368

Bugfix

c197c4a

Bugfix

2561b6b

Fix checkstyle errors

01c7361

Remove READ_SECONDARY options

8f31ae6

Remove READ_SECONDARY options

b8d3aa5

alexjpwalker requested changes Jan 26, 2021

View reviewed changes

Ganeshwara Hananda added 2 commits January 26, 2021 12:39

Update @graknlabs_protocol and sync the new field name. Rename sleepW…

641b576

…ait().

Change the condition for determining connecting to primary or seconda…

5dd3a4d

…ry replica

Ganeshwara Hananda added 2 commits January 26, 2021 12:46

Make asCluster() more concise

3041170

Remove redundant checks which is already done by asCluster()

8f7cf33

alexjpwalker approved these changes Jan 26, 2021

View reviewed changes

lolski merged commit 41c18e4 into typedb:master Jan 26, 2021

lolski deleted the failover-reattempt-v2 branch January 26, 2021 13:56

lolski added this to the 2.0.0-alpha-7 milestone Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve failover mechanism for primary and secondary replica connections #243

Improve failover mechanism for primary and secondary replica connections #243

lolski commented Jan 24, 2021 •

edited

Loading

lolski Jan 25, 2021

lolski Jan 25, 2021

alexjpwalker Jan 25, 2021

lolski Jan 26, 2021

lolski Jan 25, 2021

lolski Jan 25, 2021

alexjpwalker Jan 25, 2021 •

edited

Loading

lolski Jan 25, 2021

alexjpwalker left a comment

alexjpwalker Jan 26, 2021

lolski Jan 26, 2021

alexjpwalker Jan 26, 2021

lolski Jan 26, 2021

alexjpwalker Jan 26, 2021

alexjpwalker Jan 26, 2021

lolski Jan 26, 2021

Improve failover mechanism for primary and secondary replica connections #243

Improve failover mechanism for primary and secondary replica connections #243

Conversation

lolski commented Jan 24, 2021 • edited Loading

What is the goal of this PR?

What are the changes implemented in this PR?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjpwalker Jan 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjpwalker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lolski commented Jan 24, 2021 •

edited

Loading

alexjpwalker Jan 25, 2021 •

edited

Loading