-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batchRouting: trRouting is sometimes left in defunct state #709
Comments
Cannot be started from the Transition perspective or because the socket cannot be bind to? |
One "short term" solution would be to check the state of the process pointed by the PID file and wipe the file if the process is defunct. That should allow for a restart. |
Might be related to issues like chairemobilite/trRouting#255 |
I don't think so, the cache files are ok in this case, since the trRouting process for the instance (on port 4000) runs fine. But maybe at some point there was a state of bad cache files, which caused the trRouting process of port 14000 to go defunct this way... |
Yes, that's what I meant. The defunct state is probably cause by some race condition during a crash or something like that. |
On a machine with the defunct process: |
Log potentiellement du process defunct |
À noter aussi le flux haut niveau de ce qui a été fait sur Transition au moment où cette situation s'est produite:
|
Y'a peut-etre quelque chose dans la séquence suppression/annulation qui mêler Node.js et qui l'empeche d'acknowledger son child |
A thought: when we cancel a batch calculation we need to kill the running trRouting. But maybe we have some request still in flight. We might need to first stop the calculation, maybe sure we terminate all in flight request before we go and kill trRouting. If we kill trRouting first, maybe we leave some socket in a weird state. (Should not happen, but we never know) |
I think that the p-queue.clear() on cancellation is too agressive in TrRoutingBatch: |
ok, I don't see anything that we could do with the p-queue directly. |
Problem happened again, and again, just after job cancellation. |
fixes chairemobilite#709 * Cancelled jobs sometimes were not really cancelled if a checkpoint was set after the job status was set to cancelled in the main thread, but before the automatic task refresh every 5 seconds, so we refresh the task before updating it in the checkpoint update. * Also, refreshing and saving the job would throw an error if the job is deleted from the DB and would kill the running thread, which would cause the trRouting process to be left defunct. So we add a catch when refreshing/updating the job and stopping the trRouting process is done in the finally block, to make sure it is always stopped, even in case of an exception.
I've seen it locally once or twice and it happens on the server too, trRouting remains defunct, so all batch queries after have errors becuse trRouting cannot be started as it is considered as still running but it doesn't answer...
No idea why, the trRouting's last logs don't show anything. And so much happened on the server that logs from that moment are lost.
The text was updated successfully, but these errors were encountered: