-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve failover mechanism for primary and secondary replica connections #243
Conversation
… secondary replica
rpc/RPCSession.java
Outdated
|
||
private void sleepWait() { | ||
try { | ||
Thread.sleep(2000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of Thread.sleep
here is justified since there's no other way to wait before performing retry
Grakn.java
Outdated
READ_REPLICA(2), | ||
WRITE(1); | ||
WRITE(1), | ||
READ_SECONDARY(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning to rename it to READ_SECONDARY
is because it allows you to read from secondary replicas, whereas READ
and WRITE
reads to the primary replica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed in a verbal discussion with Haikal to get rid of the READ_SECONDARY
transaction type and replace it with a new Option
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -50,8 +50,8 @@ def graknlabs_dependencies(): | |||
def graknlabs_protocol(): | |||
git_repository( | |||
name = "graknlabs_protocol", | |||
remote = "https://github.com/graknlabs/protocol", | |||
tag = "2.0.0-alpha-6", # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_protocol | |||
remote = "https://github.com/lolski/protocol", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert to graknlabs
when typedb/typedb-protocol#110 is merged.
assert !getMessage().contains("%s"); | ||
this.errorMessage = error; | ||
} | ||
|
||
public static GraknClientException of(StatusRuntimeException statusRuntimeException) { | ||
if (statusRuntimeException.getStatus().getCode() == Status.Code.UNAVAILABLE) { | ||
return new GraknClientException(ErrorMessage.Client.UNABLE_TO_CONNECT); | ||
} else if (statusRuntimeException.getStatus().getCode() == Status.Code.INTERNAL && statusRuntimeException.getStatus().getDescription().contains("[RFT01]")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular if block I want to check if the server throws an exception because there's no leader yet.
The server will throw an exception of code "[RFT01]", and the only way to propagate that information about the exception is by embedding it in the message, hence the need for getDescription().contains(...)
.
Is there a better way to propagate exceptions than this?
Do you have better ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best alternatives that we have right now would be to either:
- change the protocol and server: don't throw an exception if there's no leader, but rather encode the error in protobuf (but this modifies the behaviour in a strange way); or
- add a new method to protocol and server to check if the leader exists - but this is both inefficient and non-atomic.
So in conclusion parsing the error message is the least bad option right now.
There are plans, in the future, to implement a dedicated Error
message in protocol, which we will be able to interpret without the need for hacks: see https://github.com/graknlabs/client-java/issues/180 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the hacky code into its own method and added a TODO: 268c884
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating review to reflect recent chat
GraknOptions.java
Outdated
} | ||
|
||
public Cluster asCluster() { | ||
if (isCluster()) return (Cluster) this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be implemented more cleanly by just throwing in this method, and overriding asCluster
in GraknOptions.Cluster
to return this
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does look nicer: 3041170
Thread.sleep(2000); | ||
} catch (InterruptedException e2) { | ||
throw new GraknClientException(UNEXPECTED_INTERRUPTION); | ||
} | ||
} | ||
|
||
public static class Database { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class distinct from the concept of a database (i.e. a knowledge graph?) If so, do you think we can disambiguate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is distinct from the concept of a "Grakn database" but I think it should be fine since it's already contextualised to be within RPCSession
.
The name is consistent with the name that we are using in the protocol definition: https://github.com/graknlabs/protocol/blob/master/protobuf/cluster/database.proto.
I do feel that something here can be improved, but it would have to be the overall structure rather than just this one part. I think we should do it as part of the major refactor we're going to do after we're done with Cluster alpha.
What do you think @alexjpwalker ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfectly fine with that
rpc/RPCSession.java
Outdated
); | ||
LOG.info("Opening a transaction of of type '{}' to leader '{}'", type, selected); | ||
return selection.transaction(type, options); | ||
if (!options.isCluster()) throw new GraknClientException(ILLEGAL_CAST, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this line is redundant and should be deleted. GraknOptions.asCluster
already throws if you're performing an illegal cast.
if (!options.isCluster()) throw new GraknClientException(ILLEGAL_CAST, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 8f7cf33
What is the goal of this PR?
We have increased resilience by improving the failover mechanism between replicas.
What are the changes implemented in this PR?
GraknOptions
toGraknOptions.core()
andGraknOptions.cluster()
, the later of which contains an option to read from secondary replica