Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batchRoute: Better handle job cancellation and deletion #1092

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

tahini
Copy link
Collaborator

@tahini tahini commented Nov 7, 2024

fixes #710

  • Cancelled jobs sometimes were not really cancelled if a checkpoint was set after the job status was set to cancelled in the main thread, but before the automatic task refresh every 5 seconds, so we refresh the task before updating it in the checkpoint update.
  • Also, refreshing and saving the job would throw an error if the job is deleted from the DB and would kill the running thread, which would cause the trRouting process to be left defunct. So we add a catch when refreshing/updating the job and stopping the trRouting process is done in the finally block, to make sure it is always stopped, even in case of an exception.

@tahini tahini requested a review from greenscientist November 7, 2024 19:17
@tahini
Copy link
Collaborator Author

tahini commented Nov 7, 2024

@greenscientist This mitigates issue #710, but it is not perfect yet, #1091 tracks what the ideal solution should be, but it requires some more major refactoring. With this PR, the trRouting process is never left defunct anymore and cancellations/deletions are properly handled.

The only remaining race condition described in #1091 would cause a cancelled task to continue running, refreshing the task list would show it as "in progress" and it can be cancelled again, so the impact is not so bad compared to having the restart the pod.

@greenscientist
Copy link
Collaborator

@tahini , I think you refered to the wrong issue here, 710 is a PR.

Copy link
Collaborator

@greenscientist greenscientist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.
Just check my notes about the comments.


this.options.progressEmitter.emit('progress', { name: 'StoppingRoutingParallelServers', progress: 1.0 });
console.log('trRouting multiple stopStatus', stopStatus);
// FIXME Should we return here if the job is cancelled? Or we still
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an issue to track this? That seems important to get an answer to that

@@ -65,7 +71,7 @@ const getTaskCancelledFct = (task: ExecutableJob<JobDataType>) => {
clearInterval(intervalObj);
}
})
.catch(() => (refreshError = true));
.catch(() => (refreshError = true)); // This will catch deleted jobs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, you added the same comment as before without changing the code. This is accurate and not just a bad copy/paste ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's accurate, it adds an explanation of which kind of use case would be caught here.

fixes chairemobilite#709

* Cancelled jobs sometimes were not really cancelled if a checkpoint was
  set after the job status was set to cancelled in the main thread, but
  before the automatic task refresh every 5 seconds, so we refresh the
  task before updating it in the checkpoint update.
* Also, refreshing and saving the job would throw an error if the job
  is deleted from the DB and would kill the running thread, which would
  cause the trRouting process to be left defunct. So we add a catch when
  refreshing/updating the job and stopping the trRouting process is done
  in the finally block, to make sure it is always stopped, even in case of
  an exception.
@tahini
Copy link
Collaborator Author

tahini commented Nov 7, 2024

@tahini , I think you refered to the wrong issue here, 710 is a PR.

It's #709... how could I mix 709 and 710?

@tahini tahini merged commit 75ef4cc into chairemobilite:main Nov 8, 2024
6 checks passed
@tahini tahini deleted the cancelTask branch November 8, 2024 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants