-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support TE connection dynamic config + fix heartbeat deadlock #585
Conversation
Test Results130 files +1 130 suites +1 8m 11s ⏱️ +46s For more details on these failures, see this check. Results for commit 94d2c5c. ± Comparison against base commit 81bce54. ♻️ This comment has been updated with latest results. |
Uploaded ArtifactsTo use these artifacts in your Gradle project, paste the following lines in your build.gradle.
|
bbfe915
to
069aec3
Compare
0, | ||
heartBeatInterval.getSize(), | ||
heartBeatInterval.getUnit()); | ||
return new CustomScheduler() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's from google common lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice - I think this is probably what we should use for ExponentialBackoffAbstractScheduledService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @fdc-ntflx
...er/mantis-server-agent/src/main/java/io/mantisrx/server/agent/ResourceManagerGatewayCxn.java
Outdated
Show resolved
Hide resolved
0, | ||
heartBeatInterval.getSize(), | ||
heartBeatInterval.getUnit()); | ||
return new CustomScheduler() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice - I think this is probably what we should use for ExponentialBackoffAbstractScheduledService.
* limitations under the License. | ||
*/ | ||
|
||
package io.mantisrx.common.properties; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to mantis-config instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "mantis-config"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am proposing moving these classes under a package that's specific to configuration such as io.mantisrx.config.dynamic
. Currently, this class is included in the mantis-common package. However, I believe it would be more efficient to utilize a gradle module agnostic approach. This would enable us to refactor or divide mantis-common into multiple modules with ease. As a result, users of this class would only need to modify the location from which they retrieve the code, rather than modifying the import statement as well.
@@ -60,6 +63,7 @@ class ResourceManagerGatewayCxn extends ExponentialBackoffAbstractScheduledServi | |||
@Getter | |||
private volatile boolean registered = false; | |||
|
|||
private boolean hasRan = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volatile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there is no actual multi-thread access for this one?
...er/mantis-server-agent/src/main/java/io/mantisrx/server/agent/ResourceManagerGatewayCxn.java
Outdated
Show resolved
Hide resolved
@@ -194,6 +193,9 @@ public TaskExecutor( | |||
this.registeredState = new DurableBooleanState( | |||
new File(workerConfiguration.getRegistrationStoreDir(), | |||
"rmCxnState.txt").getAbsolutePath()); | |||
this.rpcCallTimeoutMsDp = | |||
ConfigUtils.getDynamicPropertyLong("heartbeatTimeoutMs", WorkerConfiguration.class, | |||
workerConfiguration.heartbeatTimeoutMs(), this.dynamicPropertiesLoader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we define this property in both places. Wondering if we can just have this defined in one place to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Discussed offline]: leaving as it is to avoid duplicating the skife config annotations.
@@ -307,13 +309,21 @@ private ResourceManagerGatewayCxn newResourceManagerCxn() { | |||
ResourceClusterGateway resourceManagerGateway = resourceClusterGatewaySupplier.getCurrent(); | |||
|
|||
// let's register ourselves with the resource manager | |||
// todo: move timeout/retry to apply values from this.dynamicPropertiesLoader | |||
LongDynamicProperty heartbeatIntervalDp = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
public abstract void initalize(); | ||
void initalize(); | ||
|
||
public abstract void shutdown(); | ||
void shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of these methods in the interface.
protected final T defaultValue; | ||
protected T lastValue; | ||
protected Instant lastRefreshTime; | ||
private final long refreshDuration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use java::util::Duration or reflect the unit in the name.
* limitations under the License. | ||
*/ | ||
|
||
package io.mantisrx.common.properties; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am proposing moving these classes under a package that's specific to configuration such as io.mantisrx.config.dynamic
. Currently, this class is included in the mantis-common package. However, I believe it would be more efficient to utilize a gradle module agnostic approach. This would enable us to refactor or divide mantis-common into multiple modules with ease. As a result, users of this class would only need to modify the location from which they retrieve the code, rather than modifying the import statement as well.
protected T lastValue; | ||
protected Instant lastRefreshTime; | ||
private final long refreshDuration; | ||
private final Clock clock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
this.refreshDuration = Long.parseLong( | ||
propertiesLoader.getStringValue(DYNAMICPROPERTY_REFRESH_SECONDS_KEY, "30")); | ||
} | ||
catch (NumberFormatException ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this to the previous line.
return this.propertiesLoader.getStringValue(this.propertyName, this.lastValue.toString()); | ||
} | ||
|
||
protected boolean shouldRefresh() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why you want to expose 'shouldRefresh' to child classes? Instead, you can perform the check in this class and only reach out to the subclass when necessary to obtain the value. This approach eliminates the need to duplicate this logic in all subclasses.
import java.lang.reflect.Method; | ||
import org.skife.config.Config; | ||
|
||
public class ConfigUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would move this under the config package too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one has a dep to skife (which is why it's not in common module in the first place) and it's better to keep the Skife-related parts on a higher level module so it doesn't pollute the whole hierarchy.
72786e6
to
94d2c5c
Compare
Context
Add TE-level dynamic configs for connection interval/timeout.
Refactor resource gateway connection on TE so that the startup call doesn't deadlock on the main thread when calling "getCurrentReport" to build heartbeat payload. Move both registration and HB calls into service runIteration.
Also, remove the "callAsync" on TaskExecutor's getCurrentReport method which caused a deadlock timeout on TE reconnection (where resource gateway cxn tries to create HB request on the main thread while the main thread is on "cxn.startAsync().awaitRunning()").This might cause stale HB info to the control plane in a race scenario but should be recoverable on the control plane side with retry.
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests