-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batchRoute: Better handle job cancellation and deletion #1092
Conversation
@greenscientist This mitigates issue #710, but it is not perfect yet, #1091 tracks what the ideal solution should be, but it requires some more major refactoring. With this PR, the trRouting process is never left defunct anymore and cancellations/deletions are properly handled. The only remaining race condition described in #1091 would cause a cancelled task to continue running, refreshing the task list would show it as "in progress" and it can be cancelled again, so the impact is not so bad compared to having the restart the pod. |
@tahini , I think you refered to the wrong issue here, 710 is a PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Just check my notes about the comments.
|
||
this.options.progressEmitter.emit('progress', { name: 'StoppingRoutingParallelServers', progress: 1.0 }); | ||
console.log('trRouting multiple stopStatus', stopStatus); | ||
// FIXME Should we return here if the job is cancelled? Or we still |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an issue to track this? That seems important to get an answer to that
@@ -65,7 +71,7 @@ const getTaskCancelledFct = (task: ExecutableJob<JobDataType>) => { | |||
clearInterval(intervalObj); | |||
} | |||
}) | |||
.catch(() => (refreshError = true)); | |||
.catch(() => (refreshError = true)); // This will catch deleted jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, you added the same comment as before without changing the code. This is accurate and not just a bad copy/paste ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's accurate, it adds an explanation of which kind of use case would be caught here.
fixes chairemobilite#709 * Cancelled jobs sometimes were not really cancelled if a checkpoint was set after the job status was set to cancelled in the main thread, but before the automatic task refresh every 5 seconds, so we refresh the task before updating it in the checkpoint update. * Also, refreshing and saving the job would throw an error if the job is deleted from the DB and would kill the running thread, which would cause the trRouting process to be left defunct. So we add a catch when refreshing/updating the job and stopping the trRouting process is done in the finally block, to make sure it is always stopped, even in case of an exception.
fixes #710