-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Stetl multithreaded #41
Comments
From the Stetl Gitter conversation: "Was gister (june 24, 2020 JvdB) op PyAmsterdam virtual Meetup. Erg interessante presentatie van Clayton Bezuidenhout, zie op YouTube na minuut 16: https://youtu.be/Aqu5PE3tzV0?t=998 . In feite iets Stetl-achtigs (basis Pipeline architectuur, gedreven door configuratie) maar elke module is een Thread. Communicatie loopt via Queues. Heb hem gevraagd of hij code wil delen. Celery is soort alternatief maar volgens mij is dat multi-proces met messaging etc, te zwaar. In GeoHealthCheck heb ik goede ervaring met scheduling (package APScheduler) en multi-threading (elke Healthcheck is een thread), erg stabiel. Ik plaats het even hier om het te onthouden..." The framework is Open Source: |
So the core architecture of Stetl is a Chain/Pipeline of Components (Inputs, Filters, Outputs) that pass Data Packets to each other. Likewise, a Component (or group of linked Components) could run in a single Thread and pass Data Packets via Queues to other Component Threads. So instead of a direct connection Components could be connected via Queues. In other cases we may consider running multiple instances of a Chain, e.g. typically with Dutch Keyregistries (Basisregistraties) there are multiple files where the order of processing is not significant. |
The best solution depends on the workflow. I would keep Stetl as 'atomic' as possible. Just use it for a single task. IMO this means that it should be executed on a single machine, and in that case I agree that threads are much more efficient than processes. An example is loading the BGT in a database. This can be seen as a single job, which can perfectly be parallellized. On the other hand, there are many situations that you want to run multiple Stetl jobs. In this case processes should be used, and if you want to perform the processing on multiple machines, Celery or similar task queues for distributed processing are needed. So, I would suggest to focus on options to make Stetl multithreaded when performing one single job. |
Stetl is an ideal application to be made multithreaded. Most of the time it is processing datasets which consists of multiple files, and it is run in (server or desktop) environment where multiple processors or cores are available.
See also nlextract/NLExtract#194
The text was updated successfully, but these errors were encountered: