Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve failover mechanism for primary and secondary replica connections #243

Merged
merged 20 commits into from
Jan 26, 2021

Conversation

lolski
Copy link
Member

@lolski lolski commented Jan 24, 2021

What is the goal of this PR?

We have increased resilience by improving the failover mechanism between replicas.

What are the changes implemented in this PR?

  • Remove the terminology leader / non-leader which are Raft specific. We now use the terminology "replica" to refer to a single copy of a database. The active replica that can receive data is now called "primary replica" whereas the passive ones "secondary replica"
  • Split GraknOptions to GraknOptions.core() and GraknOptions.cluster(), the later of which contains an option to read from secondary replica
  • Increase resilience:
    • Database and cluster discovery will now be re-attempted to all cluster members instead of just to one of them
    • When the cluster have not decided which replica is the primary replica, the client will wait and retry instead of simply failing
  • Changed info-level log to debug


private void sleepWait() {
try {
Thread.sleep(2000);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of Thread.sleep here is justified since there's no other way to wait before performing retry

GraknClient.java Outdated Show resolved Hide resolved
Grakn.java Outdated
READ_REPLICA(2),
WRITE(1);
WRITE(1),
READ_SECONDARY(2);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning to rename it to READ_SECONDARY is because it allows you to read from secondary replicas, whereas READ and WRITE reads to the primary replica.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed in a verbal discussion with Haikal to get rid of the READ_SECONDARY transaction type and replace it with a new Option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fec34cc, 62c4368, 8f31ae6, and b8d3aa5 (I don't know why it took me 4 commits to do so :D )

@@ -50,8 +50,8 @@ def graknlabs_dependencies():
def graknlabs_protocol():
git_repository(
name = "graknlabs_protocol",
remote = "https://github.com/graknlabs/protocol",
tag = "2.0.0-alpha-6", # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_protocol
remote = "https://github.com/lolski/protocol",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert to graknlabs when typedb/typedb-protocol#110 is merged.

assert !getMessage().contains("%s");
this.errorMessage = error;
}

public static GraknClientException of(StatusRuntimeException statusRuntimeException) {
if (statusRuntimeException.getStatus().getCode() == Status.Code.UNAVAILABLE) {
return new GraknClientException(ErrorMessage.Client.UNABLE_TO_CONNECT);
} else if (statusRuntimeException.getStatus().getCode() == Status.Code.INTERNAL && statusRuntimeException.getStatus().getDescription().contains("[RFT01]")) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular if block I want to check if the server throws an exception because there's no leader yet.

The server will throw an exception of code "[RFT01]", and the only way to propagate that information about the exception is by embedding it in the message, hence the need for getDescription().contains(...).

Is there a better way to propagate exceptions than this?

Do you have better ideas?

Copy link
Member

@alexjpwalker alexjpwalker Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best alternatives that we have right now would be to either:

  • change the protocol and server: don't throw an exception if there's no leader, but rather encode the error in protobuf (but this modifies the behaviour in a strange way); or
  • add a new method to protocol and server to check if the leader exists - but this is both inefficient and non-atomic.

So in conclusion parsing the error message is the least bad option right now.

There are plans, in the future, to implement a dedicated Error message in protocol, which we will be able to interpret without the need for hacks: see https://github.com/graknlabs/client-java/issues/180 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the hacky code into its own method and added a TODO: 268c884

@alexjpwalker alexjpwalker self-requested a review January 25, 2021 16:01
Copy link
Member

@alexjpwalker alexjpwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating review to reflect recent chat

}

public Cluster asCluster() {
if (isCluster()) return (Cluster) this;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be implemented more cleanly by just throwing in this method, and overriding asCluster in GraknOptions.Cluster to return this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does look nicer: 3041170

rpc/RPCSession.java Outdated Show resolved Hide resolved
rpc/RPCSession.java Outdated Show resolved Hide resolved
rpc/RPCSession.java Show resolved Hide resolved
Thread.sleep(2000);
} catch (InterruptedException e2) {
throw new GraknClientException(UNEXPECTED_INTERRUPTION);
}
}

public static class Database {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class distinct from the concept of a database (i.e. a knowledge graph?) If so, do you think we can disambiguate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is distinct from the concept of a "Grakn database" but I think it should be fine since it's already contextualised to be within RPCSession.

The name is consistent with the name that we are using in the protocol definition: https://github.com/graknlabs/protocol/blob/master/protobuf/cluster/database.proto.

I do feel that something here can be improved, but it would have to be the overall structure rather than just this one part. I think we should do it as part of the major refactor we're going to do after we're done with Cluster alpha.

What do you think @alexjpwalker ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfectly fine with that

);
LOG.info("Opening a transaction of of type '{}' to leader '{}'", type, selected);
return selection.transaction(type, options);
if (!options.isCluster()) throw new GraknClientException(ILLEGAL_CAST, options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this line is redundant and should be deleted. GraknOptions.asCluster already throws if you're performing an illegal cast.

Suggested change
if (!options.isCluster()) throw new GraknClientException(ILLEGAL_CAST, options);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8f7cf33

@lolski lolski merged commit 41c18e4 into typedb:master Jan 26, 2021
@lolski lolski deleted the failover-reattempt-v2 branch January 26, 2021 13:56
@lolski lolski added this to the 2.0.0-alpha-7 milestone Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants