-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement batch schedule request #580
Conversation
Test Results130 files ±0 130 suites ±0 7m 34s ⏱️ +14s For more details on these failures, see this check. Results for commit 73c723f. ± Comparison against base commit c0c403c. This pull request removes 1 test.
♻️ This comment has been updated with latest results. |
a1ca00d
to
d42bc82
Compare
Uploaded ArtifactsTo use these artifacts in your Gradle project, paste the following lines in your build.gradle.
|
...trol-plane-core/src/main/java/io/mantisrx/server/master/resourcecluster/ResourceCluster.java
Outdated
Show resolved
Hide resolved
...-plane-server/src/main/java/io/mantisrx/master/resourcecluster/ExecutorStateManagerImpl.java
Outdated
Show resolved
Hide resolved
log.info("Received batch schedule request event: {}", event); | ||
|
||
// If the size of the batch request is 1 we'll handle it as the "old" schedule request | ||
if (event.getRequest().getScheduleRequests().size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason/upside to keeping the 1 node case special?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. If we're talking about actions like autoscaling or resubmitting a worker, scheduling requests will always going to involve one worker at the time. In these situations, we'd rather use the ScheduleRequestEvent
lifecycle, not the BatchScheduleRequestEvent
. The reason is that ScheduleRequestEvent
lets us try again at the worker level. This means setting up timers for each worker (using their worker-id), instead of using timers for the entire job (with job-id) like we would in BatchScheduleRequestEvent
, which will collide.
private void onFailedToBatchScheduleRequestEvent(FailedToBatchScheduleRequestEvent event) { | ||
batchSchedulingFailures.increment(); | ||
if (event.getAttempt() >= this.maxScheduleRetries) { | ||
log.error("Failed to submit the batch request {} because of ", event.getScheduleRequestEvent(), event.getThrowable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case does it need to route some message/status back to the job actor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm following the behavior we currently have during scheduling ops ->
Line 206 in 81bce54
log.error("Failed to submit the request {} because of ", event.getScheduleRequestEvent(), event.getThrowable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should change this behavior? If so, how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess in this case there is no more retry/actions to this job so maybe we should mark the job as failed instead? So users won't feel that the job is still being handled.
|
||
@Override | ||
public void unscheduleJob(String jobId) { | ||
schedulerActor.tell(CancelBatchRequestEvent.of(jobId),null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CancelBatchRequestEvent seems to be only cancelling retrying on the scheduler side. I am wondering what happens if the rcActor is sending the assignment request while this is happening? Will the agent still get the cancel request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the question. If you're asking what happens if the user decides to kill the job, there are 2 cases:
- job is in pending state because not enough resources are available. In this case we'll simply delete the timer so the scheduler won't retry.
- job is in accepted state and workers are already assigned. Those will still get cancelled as we're currently doing via
unscheduleAndTerminateWorker
.
@@ -1357,7 +1358,14 @@ private void initializeRunningWorkers() { | |||
|
|||
scheduler.initializeRunningWorker(scheduleRequest, wm.getSlave(), wm.getSlaveID()); | |||
} else if (wm.getState().equals(WorkerState.Accepted)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part a potential state mismatch source that a worker has been submitted to an agent but failed to update the worker info wm here during leader switch and it will get queued again (where the previous assigned agent is also running the same worker already)?
assertEquals(SCALE_GROUP_1, | ||
Objects.requireNonNull(bestFitO.get().getRight().getRegistration()) | ||
Objects.requireNonNull(bestFitO.get().getBestFit().values().stream().findFirst().get().getRight().getRegistration()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to have some more UTs around the pending worker set scenarios here.
lgtm thanks a lot! |
@@ -317,12 +386,35 @@ public GetClusterUsageResponse getClusterUsage(GetClusterUsageRequest req) { | |||
} else { | |||
usageByGroupKey.put(groupKey, kvState); | |||
} | |||
|
|||
if ((value.isAssigned() || value.isRunningTask()) && value.getWorkerId() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need metrics to track the pending set size with tags.
}); | ||
|
||
GetClusterUsageResponseBuilder resBuilder = GetClusterUsageResponse.builder().clusterID(req.getClusterID()); | ||
usageByGroupKey.forEach((key, value) -> resBuilder.usage(UsageByGroupKey.builder() | ||
.usageGroupKey(key) | ||
.idleCount(value.getLeft()) | ||
.idleCount(value.getLeft() - pendingCountByGroupKey.get(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we disable the scaler during the upgrade. With this logic job scheduling might see more interruptions and do you think we need to do some handling between upgrade?
28f2756
to
73c723f
Compare
Introduction
This PR addresses the issue we've been having with large job requests and management of idle containers.
Today if a large job request (200+) nodes is to be scheduled and it used all idle containers on the RC, the scaler is not aware of any remaining pending scheduled workers and will still scale up to idle size (thus any newly added containers will directly be used by pending workers). This forces us to pre-scale ASGs manually and allocating a hefty idle worker buffer. To improve system efficiency, we've revised this mechanic to include pending worker counts in the scaler logic.
Details
In essence, the scheduling procedure shifts from per-worker to per-job. Now, once a job is submitted, it either gets all the required workers or none. For instance, if there are 20 idle workers of a certain type and a new job requires 50, we hold the request until the scaler scales up the ASG with 50 extra containers. The present 20 idle containers can then be utilized for other tasks such as job autoscaling, worker resubmission, among others.
Our solution provides several benefits:
Detailed List of Changes per Component
This update affects principal components/actors which include: job actor, scheduler actor, and rc actor. Let's delve into the detailed changes for each:
Job Actor
The Job Actor will now only handle batch requests with certain modifications. For autoscaling or worker resubmission requests, only one worker will be contained in the batch request. For job submissions requests (
submitInitialWorkers
function), all requested workers will be included. This also applies to all pending requests before a master leader change (initializeRunningWorkers
function). Apart from these changes, the actor functions as previously.Scheduler Actor
Initially,
scheduleWorker
was replaced byscheduleWorkers
to manage any schedule request.RC Actor
The RC actor now handles assignment requests for one or more workers.
Scaler Actor
There are no specific changes needed for the Scaler Actor. It functions by knowing the idle count and total count for a given SKU to match the target idle count. The amendment here is that pending workers are subtracted from the current idle count and included in the usage response by the RC actor.
Tests
The new scheduler logic was tested across various scenarios, including single/multiple jobs, different skus, and autoscaling cases, even incorporating unusual and error cases such as leader failover or jobs stuck in the accepted state for long. This is the full list:
Todo
unscheduleJob()
function: At present, this function plays a role in managing the timers in the scheduler responsible for retrying batch schedule requests, and it governs a specific corner case associated with the invalidation of the pending job cache. However, the necessity of its role could be questioned if we find a way to effectively use cache timeouts. Consequently, our mission is to devise a more proficient strategy that, ideally, capitalizes on cache timeouts.