Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batchRouting: trRouting is sometimes left in defunct state #709

Closed
tahini opened this issue Sep 8, 2023 · 13 comments · Fixed by #710
Closed

batchRouting: trRouting is sometimes left in defunct state #709

tahini opened this issue Sep 8, 2023 · 13 comments · Fixed by #710
Assignees
Labels
bug Something isn't working

Comments

@tahini
Copy link
Collaborator

tahini commented Sep 8, 2023

I've seen it locally once or twice and it happens on the server too, trRouting remains defunct, so all batch queries after have errors becuse trRouting cannot be started as it is considered as still running but it doesn't answer...

No idea why, the trRouting's last logs don't show anything. And so much happened on the server that logs from that moment are lost.

@tahini tahini added the bug Something isn't working label Sep 8, 2023
@greenscientist
Copy link
Collaborator

Cannot be started from the Transition perspective or because the socket cannot be bind to?

@greenscientist
Copy link
Collaborator

One "short term" solution would be to check the state of the process pointed by the PID file and wipe the file if the process is defunct. That should allow for a restart.

@greenscientist greenscientist linked a pull request Sep 8, 2023 that will close this issue
@greenscientist
Copy link
Collaborator

Might be related to issues like chairemobilite/trRouting#255

@tahini
Copy link
Collaborator Author

tahini commented Sep 8, 2023

Might be related to issues like chairemobilite/trRouting#255

I don't think so, the cache files are ok in this case, since the trRouting process for the instance (on port 4000) runs fine. But maybe at some point there was a state of bad cache files, which caused the trRouting process of port 14000 to go defunct this way...

@greenscientist
Copy link
Collaborator

Yes, that's what I meant. The defunct state is probably cause by some race condition during a crash or something like that.

@greenscientist
Copy link
Collaborator

On a machine with the defunct process:
ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2580 936 pts/0 Ss+ Apr18 0:00 /bin/sh /config/start.sh
root 9 1.0 0.0 2148308 37420 pts/0 Sl+ Apr18 13:54 ./json2capnp 2000 /app/projects/demo_transition/cache/demo_transition/
root 326 0.0 0.0 861308 79516 pts/0 Sl+ Apr18 0:09 node /opt/yarn-v1.22.19/bin/yarn.js start:tracing
root 347 0.0 0.0 2584 912 pts/0 S+ Apr18 0:00 /bin/sh -c yarn workspace transition-backend run start:tracing
root 348 0.0 0.0 927872 80704 pts/0 Sl+ Apr18 0:09 /usr/local/bin/node /opt/yarn-v1.22.19/bin/yarn.js workspace transition-backend run start:tracing
root 359 0.0 0.0 864288 84596 pts/0 Sl+ Apr18 0:08 /usr/local/bin/node /opt/yarn-v1.22.19/lib/cli.js run start:tracing
root 370 0.0 0.0 2588 920 pts/0 S+ Apr18 0:00 /bin/sh -c yarn node -r ../../tracing.js --max-old-space-size=4096 lib/server.js
root 371 0.0 0.0 797988 83216 pts/0 Sl+ Apr18 0:09 /usr/local/bin/node /opt/yarn-v1.22.19/lib/cli.js node -r ../../tracing.js --max-old-space-size=4096 lib/server.js
root 382 4.8 0.8 2042800 1072224 pts/0 Sl+ Apr18 64:56 /usr/local/bin/node -r ../../tracing.js --max-old-space-size=4096 lib/server.js
root 103254 63.5 0.0 0 0 pts/0 Z+ Apr18 637:14 [trRouting]
root 126577 1.0 0.8 1709776 1146584 pts/0 Sl+ 13:08 0:21 trRouting --port=4000 --osrmPort=5000 --osrmHost=osrm-quebecetottawa-walking --debug=0 --cachePath=/app/projects/demo_transition/cache/demo_tran
root 126638 0.0 0.0 4316 3532 pts/2 Ss+ 13:34 0:00 bash
root 126789 0.0 0.0 4324 3628 pts/1 Ss 13:37 0:00 bash
root 127020 0.0 0.0 8248 4500 pts/1 R+ 13:43 0:00 ps aux

@greenscientist
Copy link
Collaborator

Log potentiellement du process defunct
trRouting.log

@tahini
Copy link
Collaborator Author

tahini commented Apr 19, 2024

À noter aussi le flux haut niveau de ce qui a été fait sur Transition au moment où cette situation s'est produite:

1. Supprimé la tâche en cours
2. Annulé la même tâche en cours en me disant que j'aurais peut-être dû faire ça en premier avant de la supprimer
3. Supprimé des [lignes]
4. Mis à jour le cache
5. Démarré un nouveau calcul multiple

@greenscientist
Copy link
Collaborator

Y'a peut-etre quelque chose dans la séquence suppression/annulation qui mêler Node.js et qui l'empeche d'acknowledger son child

@greenscientist
Copy link
Collaborator

A thought: when we cancel a batch calculation we need to kill the running trRouting. But maybe we have some request still in flight. We might need to first stop the calculation, maybe sure we terminate all in flight request before we go and kill trRouting. If we kill trRouting first, maybe we leave some socket in a weird state. (Should not happen, but we never know)

@greenscientist
Copy link
Collaborator

I think that the p-queue.clear() on cancellation is too agressive in TrRoutingBatch:
if (this.options.isCancelled()) {
promiseQueue.clear();
}
This will immediately empty the queue, then the await onIdle will return. Some pending query might still be pending, we might then kill trRouting before those queries are done.

@greenscientist
Copy link
Collaborator

ok, I don't see anything that we could do with the p-queue directly.
What we could do instead: use a read/writer lock. That the read lock before the call to odTripTask, release in the finally, and then before calling stopBatch we can attempd to take the write lock, which would block on all the current request to finish before taking it.
Ok, yes, we can probably just have a "running" counter and check that it's zero before doing the stopBatch. Not sure how to await on this check.

@tahini
Copy link
Collaborator Author

tahini commented Nov 6, 2024

Problem happened again, and again, just after job cancellation.

@tahini tahini self-assigned this Nov 7, 2024
tahini added a commit to tahini/transition that referenced this issue Nov 7, 2024
fixes chairemobilite#709

* Cancelled jobs sometimes were not really cancelled if a checkpoint was
  set after the job status was set to cancelled in the main thread, but
  before the automatic task refresh every 5 seconds, so we refresh the
  task before updating it in the checkpoint update.
* Also, refreshing and saving the job would throw an error if the job
  is deleted from the DB and would kill the running thread, which would
  cause the trRouting process to be left defunct. So we add a catch when
  refreshing/updating the job and stopping the trRouting process is done
  in the finally block, to make sure it is always stopped, even in case of
  an exception.
@tahini tahini closed this as completed in 75ef4cc Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants