Make Stetl multithreaded #41

fsteggink · 2016-07-04T14:36:21Z

Stetl is an ideal application to be made multithreaded. Most of the time it is processing datasets which consists of multiple files, and it is run in (server or desktop) environment where multiple processors or cores are available.

See also nlextract/NLExtract#194

justb4 · 2020-07-22T12:27:37Z

From the Stetl Gitter conversation:

"Was gister (june 24, 2020 JvdB) op PyAmsterdam virtual Meetup. Erg interessante presentatie van Clayton Bezuidenhout, zie op YouTube na minuut 16: https://youtu.be/Aqu5PE3tzV0?t=998 . In feite iets Stetl-achtigs (basis Pipeline architectuur, gedreven door configuratie) maar elke module is een Thread. Communicatie loopt via Queues. Heb hem gevraagd of hij code wil delen. Celery is soort alternatief maar volgens mij is dat multi-proces met messaging etc, te zwaar. In GeoHealthCheck heb ik goede ervaring met scheduling (package APScheduler) en multi-threading (elke Healthcheck is een thread), erg stabiel. Ik plaats het even hier om het te onthouden..."

The framework is Open Source:
https://bitbucket.org/clayton-bezuidenhout/threads-and-queues-example-app/src/master/

justb4 · 2020-07-22T12:36:18Z

So the core architecture of Stetl is a Chain/Pipeline of Components (Inputs, Filters, Outputs) that pass Data Packets to each other. Likewise, a Component (or group of linked Components) could run in a single Thread and pass Data Packets via Queues to other Component Threads. So instead of a direct connection Components could be connected via Queues.

In other cases we may consider running multiple instances of a Chain, e.g. typically with Dutch Keyregistries (Basisregistraties) there are multiple files where the order of processing is not significant.

fsteggink · 2020-08-05T08:52:17Z

The best solution depends on the workflow. I would keep Stetl as 'atomic' as possible. Just use it for a single task. IMO this means that it should be executed on a single machine, and in that case I agree that threads are much more efficient than processes. An example is loading the BGT in a database. This can be seen as a single job, which can perfectly be parallellized.

On the other hand, there are many situations that you want to run multiple Stetl jobs. In this case processes should be used, and if you want to perform the processing on multiple machines, Celery or similar task queues for distributed processing are needed.

So, I would suggest to focus on options to make Stetl multithreaded when performing one single job.

fsteggink mentioned this issue Jul 4, 2016

Maak BAG-extract multithreaded nlextract/NLExtract#194

Closed

justb4 added the enhancement label Jul 13, 2016

justb4 added this to the Version - Horizon milestone Jul 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Stetl multithreaded #41

Make Stetl multithreaded #41

fsteggink commented Jul 4, 2016 •

edited

Loading

justb4 commented Jul 22, 2020

justb4 commented Jul 22, 2020

fsteggink commented Aug 5, 2020

Make Stetl multithreaded #41

Make Stetl multithreaded #41

Comments

fsteggink commented Jul 4, 2016 • edited Loading

justb4 commented Jul 22, 2020

justb4 commented Jul 22, 2020

fsteggink commented Aug 5, 2020

fsteggink commented Jul 4, 2016 •

edited

Loading