batchRouting: trRouting is sometimes left in defunct state #709

tahini · 2023-09-08T13:26:56Z

I've seen it locally once or twice and it happens on the server too, trRouting remains defunct, so all batch queries after have errors becuse trRouting cannot be started as it is considered as still running but it doesn't answer...

No idea why, the trRouting's last logs don't show anything. And so much happened on the server that logs from that moment are lost.

greenscientist · 2023-09-08T13:39:49Z

Cannot be started from the Transition perspective or because the socket cannot be bind to?

greenscientist · 2023-09-08T14:15:15Z

One "short term" solution would be to check the state of the process pointed by the PID file and wipe the file if the process is defunct. That should allow for a restart.

greenscientist · 2023-09-08T15:05:07Z

Might be related to issues like chairemobilite/trRouting#255

tahini · 2023-09-08T15:18:13Z

Might be related to issues like chairemobilite/trRouting#255

I don't think so, the cache files are ok in this case, since the trRouting process for the instance (on port 4000) runs fine. But maybe at some point there was a state of bad cache files, which caused the trRouting process of port 14000 to go defunct this way...

greenscientist · 2023-09-08T15:19:37Z

Yes, that's what I meant. The defunct state is probably cause by some race condition during a crash or something like that.

greenscientist · 2024-04-19T13:44:08Z

On a machine with the defunct process:
ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2580 936 pts/0 Ss+ Apr18 0:00 /bin/sh /config/start.sh
root 9 1.0 0.0 2148308 37420 pts/0 Sl+ Apr18 13:54 ./json2capnp 2000 /app/projects/demo_transition/cache/demo_transition/
root 326 0.0 0.0 861308 79516 pts/0 Sl+ Apr18 0:09 node /opt/yarn-v1.22.19/bin/yarn.js start:tracing
root 347 0.0 0.0 2584 912 pts/0 S+ Apr18 0:00 /bin/sh -c yarn workspace transition-backend run start:tracing
root 348 0.0 0.0 927872 80704 pts/0 Sl+ Apr18 0:09 /usr/local/bin/node /opt/yarn-v1.22.19/bin/yarn.js workspace transition-backend run start:tracing
root 359 0.0 0.0 864288 84596 pts/0 Sl+ Apr18 0:08 /usr/local/bin/node /opt/yarn-v1.22.19/lib/cli.js run start:tracing
root 370 0.0 0.0 2588 920 pts/0 S+ Apr18 0:00 /bin/sh -c yarn node -r ../../tracing.js --max-old-space-size=4096 lib/server.js
root 371 0.0 0.0 797988 83216 pts/0 Sl+ Apr18 0:09 /usr/local/bin/node /opt/yarn-v1.22.19/lib/cli.js node -r ../../tracing.js --max-old-space-size=4096 lib/server.js
root 382 4.8 0.8 2042800 1072224 pts/0 Sl+ Apr18 64:56 /usr/local/bin/node -r ../../tracing.js --max-old-space-size=4096 lib/server.js
root 103254 63.5 0.0 0 0 pts/0 Z+ Apr18 637:14 [trRouting]
root 126577 1.0 0.8 1709776 1146584 pts/0 Sl+ 13:08 0:21 trRouting --port=4000 --osrmPort=5000 --osrmHost=osrm-quebecetottawa-walking --debug=0 --cachePath=/app/projects/demo_transition/cache/demo_tran
root 126638 0.0 0.0 4316 3532 pts/2 Ss+ 13:34 0:00 bash
root 126789 0.0 0.0 4324 3628 pts/1 Ss 13:37 0:00 bash
root 127020 0.0 0.0 8248 4500 pts/1 R+ 13:43 0:00 ps aux

greenscientist · 2024-04-19T13:54:37Z

Log potentiellement du process defunct
trRouting.log

tahini · 2024-04-19T14:00:48Z

À noter aussi le flux haut niveau de ce qui a été fait sur Transition au moment où cette situation s'est produite:

1. Supprimé la tâche en cours
2. Annulé la même tâche en cours en me disant que j'aurais peut-être dû faire ça en premier avant de la supprimer
3. Supprimé des [lignes]
4. Mis à jour le cache
5. Démarré un nouveau calcul multiple

greenscientist · 2024-04-19T14:22:56Z

Y'a peut-etre quelque chose dans la séquence suppression/annulation qui mêler Node.js et qui l'empeche d'acknowledger son child

greenscientist · 2024-04-19T19:01:58Z

A thought: when we cancel a batch calculation we need to kill the running trRouting. But maybe we have some request still in flight. We might need to first stop the calculation, maybe sure we terminate all in flight request before we go and kill trRouting. If we kill trRouting first, maybe we leave some socket in a weird state. (Should not happen, but we never know)

greenscientist · 2024-04-20T20:21:40Z

I think that the p-queue.clear() on cancellation is too agressive in TrRoutingBatch:
if (this.options.isCancelled()) {
promiseQueue.clear();
}
This will immediately empty the queue, then the await onIdle will return. Some pending query might still be pending, we might then kill trRouting before those queries are done.

greenscientist · 2024-04-20T20:44:00Z

ok, I don't see anything that we could do with the p-queue directly.
What we could do instead: use a read/writer lock. That the read lock before the call to odTripTask, release in the finally, and then before calling stopBatch we can attempd to take the write lock, which would block on all the current request to finish before taking it.
Ok, yes, we can probably just have a "running" counter and check that it's zero before doing the stopBatch. Not sure how to await on this check.

tahini · 2024-11-06T21:27:28Z

Problem happened again, and again, just after job cancellation.

fixes chairemobilite#709 * Cancelled jobs sometimes were not really cancelled if a checkpoint was set after the job status was set to cancelled in the main thread, but before the automatic task refresh every 5 seconds, so we refresh the task before updating it in the checkpoint update. * Also, refreshing and saving the job would throw an error if the job is deleted from the DB and would kill the running thread, which would cause the trRouting process to be left defunct. So we add a catch when refreshing/updating the job and stopping the trRouting process is done in the finally block, to make sure it is always stopped, even in case of an exception.

tahini added the bug Something isn't working label Sep 8, 2023

greenscientist linked a pull request Sep 8, 2023 that will close this issue

Add more logging on starting process errors #710

Merged

tahini closed this as completed in #710 Sep 8, 2023

greenscientist reopened this Sep 8, 2023

tahini mentioned this issue Jun 5, 2024

transition: Errors returned by trRouting with batch route, but not with individual calculations #372

Closed

tahini self-assigned this Nov 7, 2024

tahini mentioned this issue Nov 7, 2024

batchRoute: Better handle job cancellation and deletion #1092

Merged

tahini closed this as completed in 75ef4cc Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batchRouting: trRouting is sometimes left in defunct state #709

batchRouting: trRouting is sometimes left in defunct state #709

tahini commented Sep 8, 2023

greenscientist commented Sep 8, 2023

greenscientist commented Sep 8, 2023

greenscientist commented Sep 8, 2023

tahini commented Sep 8, 2023

greenscientist commented Sep 8, 2023

greenscientist commented Apr 19, 2024

greenscientist commented Apr 19, 2024

tahini commented Apr 19, 2024

greenscientist commented Apr 19, 2024

greenscientist commented Apr 19, 2024

greenscientist commented Apr 20, 2024

greenscientist commented Apr 20, 2024

tahini commented Nov 6, 2024

batchRouting: trRouting is sometimes left in defunct state #709

batchRouting: trRouting is sometimes left in defunct state #709

Comments

tahini commented Sep 8, 2023

greenscientist commented Sep 8, 2023

greenscientist commented Sep 8, 2023

greenscientist commented Sep 8, 2023

tahini commented Sep 8, 2023

greenscientist commented Sep 8, 2023

greenscientist commented Apr 19, 2024

greenscientist commented Apr 19, 2024

tahini commented Apr 19, 2024

greenscientist commented Apr 19, 2024

greenscientist commented Apr 19, 2024

greenscientist commented Apr 20, 2024

greenscientist commented Apr 20, 2024

tahini commented Nov 6, 2024