Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TE connection dynamic config + fix heartbeat deadlock #585

Merged
merged 9 commits into from
Nov 16, 2023

Conversation

Andyz26
Copy link
Collaborator

@Andyz26 Andyz26 commented Nov 13, 2023

Context

Add TE-level dynamic configs for connection interval/timeout.

Refactor resource gateway connection on TE so that the startup call doesn't deadlock on the main thread when calling "getCurrentReport" to build heartbeat payload. Move both registration and HB calls into service runIteration.

Also, remove the "callAsync" on TaskExecutor's getCurrentReport method which caused a deadlock timeout on TE reconnection (where resource gateway cxn tries to create HB request on the main thread while the main thread is on "cxn.startAsync().awaitRunning()").
This might cause stale HB info to the control plane in a race scenario but should be recoverable on the control plane side with retry.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

Copy link

github-actions bot commented Nov 13, 2023

Test Results

130 files  +1  130 suites  +1   8m 11s ⏱️ +46s
548 tests +2  539 ✔️ +1  8 💤 ±0  1 +1 
549 runs  +3  540 ✔️ +2  8 💤 ±0  1 +1 

For more details on these failures, see this check.

Results for commit 94d2c5c. ± Comparison against base commit 81bce54.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Nov 13, 2023

Uploaded Artifacts

To use these artifacts in your Gradle project, paste the following lines in your build.gradle.

resolutionStrategy {
    force "io.mantisrx:mantis-client:0.1.0-20231114.042158-442"
    force "io.mantisrx:mantis-common:0.1.0-20231114.042158-441"
    force "io.mantisrx:mantis-common-serde:0.1.0-20231114.042158-441"
    force "io.mantisrx:mantis-discovery-proto:0.1.0-20231114.042158-441"
    force "io.mantisrx:mantis-network:0.1.0-20231114.042158-441"
    force "io.mantisrx:mantis-remote-observable:0.1.0-20231114.042158-442"
    force "io.mantisrx:mantis-runtime:0.1.0-20231114.042158-442"
    force "io.mantisrx:mantis-runtime-loader:0.1.0-20231114.042158-442"
    force "io.mantisrx:mantis-shaded:0.1.0-20231114.042158-440"
    force "io.mantisrx:mantis-testcontainers:0.1.0-20231114.042158-111"
    force "io.mantisrx:mantis-connector-iceberg:0.1.0-20231114.042158-440"
    force "io.mantisrx:mantis-connector-job:0.1.0-20231114.042158-442"
    force "io.mantisrx:mantis-connector-kafka:0.1.0-20231114.042158-442"
    force "io.mantisrx:mantis-connector-publish:0.1.0-20231114.042158-441"
    force "io.mantisrx:mantis-control-plane-client:0.1.0-20231114.042158-441"
    force "io.mantisrx:mantis-control-plane-core:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-control-plane-server:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-core:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-groupby-sample:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-jobconnector-sample:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-mantis-publish-sample:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-sine-function:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-synthetic-sourcejob:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-twitter-sample:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-examples-wordcount:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-publish-core:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-publish-netty:0.1.0-20231114.042158-434"
    force "io.mantisrx:mantis-publish-netty-guice:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-server-agent:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-server-worker:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-server-worker-client:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-source-job-kafka:0.1.0-20231114.042158-435"
    force "io.mantisrx:mantis-source-job-publish:0.1.0-20231114.042158-435"
}

@Andyz26 Andyz26 changed the title Andyz/te config on fp Support TE connection dynamic config Nov 13, 2023
@Andyz26 Andyz26 temporarily deployed to Integrate Pull Request November 13, 2023 23:51 — with GitHub Actions Inactive
@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 14, 2023 00:37 — with GitHub Actions Failure
@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 14, 2023 01:41 — with GitHub Actions Failure
@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 14, 2023 03:25 — with GitHub Actions Failure
@Andyz26 Andyz26 temporarily deployed to Integrate Pull Request November 14, 2023 04:17 — with GitHub Actions Inactive
0,
heartBeatInterval.getSize(),
heartBeatInterval.getUnit());
return new CustomScheduler() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this defined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's from google common lib

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice - I think this is probably what we should use for ExponentialBackoffAbstractScheduledService.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 15, 2023 00:02 — with GitHub Actions Failure
0,
heartBeatInterval.getSize(),
heartBeatInterval.getUnit());
return new CustomScheduler() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice - I think this is probably what we should use for ExponentialBackoffAbstractScheduledService.

* limitations under the License.
*/

package io.mantisrx.common.properties;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to mantis-config instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "mantis-config"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am proposing moving these classes under a package that's specific to configuration such as io.mantisrx.config.dynamic. Currently, this class is included in the mantis-common package. However, I believe it would be more efficient to utilize a gradle module agnostic approach. This would enable us to refactor or divide mantis-common into multiple modules with ease. As a result, users of this class would only need to modify the location from which they retrieve the code, rather than modifying the import statement as well.

@@ -60,6 +63,7 @@ class ResourceManagerGatewayCxn extends ExponentialBackoffAbstractScheduledServi
@Getter
private volatile boolean registered = false;

private boolean hasRan = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there is no actual multi-thread access for this one?

@Andyz26 Andyz26 changed the title Support TE connection dynamic config Support TE connection dynamic config + fix heartbeat deadlock Nov 15, 2023
@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 15, 2023 20:02 — with GitHub Actions Failure
@@ -194,6 +193,9 @@ public TaskExecutor(
this.registeredState = new DurableBooleanState(
new File(workerConfiguration.getRegistrationStoreDir(),
"rmCxnState.txt").getAbsolutePath());
this.rpcCallTimeoutMsDp =
ConfigUtils.getDynamicPropertyLong("heartbeatTimeoutMs", WorkerConfiguration.class,
workerConfiguration.heartbeatTimeoutMs(), this.dynamicPropertiesLoader);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we define this property in both places. Wondering if we can just have this defined in one place to avoid confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Discussed offline]: leaving as it is to avoid duplicating the skife config annotations.

@@ -307,13 +309,21 @@ private ResourceManagerGatewayCxn newResourceManagerCxn() {
ResourceClusterGateway resourceManagerGateway = resourceClusterGatewaySupplier.getCurrent();

// let's register ourselves with the resource manager
// todo: move timeout/retry to apply values from this.dynamicPropertiesLoader
LongDynamicProperty heartbeatIntervalDp =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines 21 to 23
public abstract void initalize();
void initalize();

public abstract void shutdown();
void shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of these methods in the interface.

protected final T defaultValue;
protected T lastValue;
protected Instant lastRefreshTime;
private final long refreshDuration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use java::util::Duration or reflect the unit in the name.

* limitations under the License.
*/

package io.mantisrx.common.properties;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am proposing moving these classes under a package that's specific to configuration such as io.mantisrx.config.dynamic. Currently, this class is included in the mantis-common package. However, I believe it would be more efficient to utilize a gradle module agnostic approach. This would enable us to refactor or divide mantis-common into multiple modules with ease. As a result, users of this class would only need to modify the location from which they retrieve the code, rather than modifying the import statement as well.

protected T lastValue;
protected Instant lastRefreshTime;
private final long refreshDuration;
private final Clock clock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

this.refreshDuration = Long.parseLong(
propertiesLoader.getStringValue(DYNAMICPROPERTY_REFRESH_SECONDS_KEY, "30"));
}
catch (NumberFormatException ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this to the previous line.

return this.propertiesLoader.getStringValue(this.propertyName, this.lastValue.toString());
}

protected boolean shouldRefresh() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why you want to expose 'shouldRefresh' to child classes? Instead, you can perform the check in this class and only reach out to the subclass when necessary to obtain the value. This approach eliminates the need to duplicate this logic in all subclasses.

import java.lang.reflect.Method;
import org.skife.config.Config;

public class ConfigUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would move this under the config package too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one has a dep to skife (which is why it's not in common module in the first place) and it's better to keep the Skife-related parts on a higher level module so it doesn't pollute the whole hierarchy.

@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 16, 2023 02:10 — with GitHub Actions Failure
@Andyz26 Andyz26 had a problem deploying to Integrate Pull Request November 16, 2023 03:27 — with GitHub Actions Failure
@Andyz26 Andyz26 merged commit b8ceac2 into master Nov 16, 2023
3 of 5 checks passed
@Andyz26 Andyz26 deleted the andyz/teConfigOnFP branch November 16, 2023 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants