Architecture

Architecture of the Proof of Concept

The figure below depict the Architecture of our Data-Pipeline. It shows that our Infrastructure is based on 3 mains technologies: (i) Spark Streaming, which ensure a highly parallel and distributed (Cluster-based) marc21 to bibframe transformation, as marc21 records are being streamed to it; (ii) Kafka which serves as a fault tolerant temporary buffer (messaging system) within the infrastructure, effectively decoupling Producer and Consumers; and (iii) Akka-stream which enables to write reactive stream based component to support fast asynchronous fault tolerant data production and consumption in and out of the pipeline.

(iii) is rather critical in the sense that, as Spark and Kafka are clustered based solutions, any component poised to produce to or consume from the Pipeline, need to be extremely fast, otherwise, the processing power of the pipeline is mitigated. This is what (iii) provides: the ability to write with ease highly sophisticated asynchronous component with self healing properties for high responsiveness with fault tolerance. In addition, as they are reactive, they provide for automated backpressure and easy throttling set up, which makes them ideal to interface with Legacy Systems if necessary. (see Akka-Stream Wiki Page for more details)

Zooming further in, we can see that the Architecture do showcases 2 pipelining jobs, represented by the "Dump" components and the "Update" Components. This is because a continuous pipeline is actually a 2 step process. That is, (a) first we replicate/transform the full dump of the Catalogue, (b) and then we propagate the new creation/update/Deletion on a continuous Basis.

From a Pipeline Architecture point of view:

-A Dump scenario happens once and that is it. It is not something done continuously. It is the first step of a continuous pipeline. It is an initial conversion pipeline on its own. It is separated from a continuous pipeline because it has different requirements. In a dump, full parallelization is possible at each stage of the pipeline. Indeed the order of the propagation does not matter at all. What we want is simply to replicate the data in a different format in whatever downstream system we have.

–The update pipeline however which comprise new creation as much as changes (or deletion), need to have its order respected while running continuously. We don't want the update event of a record (in practice here it means the record) to be propagated before its creation event. Order Matter ! Some Parallelization is still possible during the transformation in spark streaming, but re-ordering will need to happen when sent back in the topic for consumption by consumer.