Consider Apache Beam? #1

ghost · 2018-05-10T18:24:23Z

Apache Beam is a unified interface for batch and streaming with multiple backends: Apache Spark, Apache Flink, Apache Apex, Apache Gearpump and Google Cloud Dataflow.

It could deduplicate codes for different backends. Apache Zeppelin also supports Beam interpreter.

GezimSejdiu · 2018-06-05T14:09:25Z

Hi @t3476 ,
many thanks for the suggestion. Yes, indeed we have considered Apache Beam after it came out on the end of 2016th.

Since we are providing the same functionality (or at least we try to keep them aligned as much as we can) on both engines (Apache Spark and Apache Flink) we were also thinking of having an underlying engine which works on both runners with a single pipeline. But the good news was that Apache Beam came out and we were looking at it and see if we can adapt it to our framework as well.

We are still discussing the possibility of switching to Apache Beam, and looking forward to seeing if Apache Beam supports Scala API as well; since most of our code is build on Scala language.

We will have to see how easy is to re-implement the same functionality we did on SANSA on Apache Beam framework.

We will keep posting here in case we decide for something. But would be great to keep this option open and discuss the benefits of using one engine which covers most of the distributing frameworks out there.

Best,

ghost · 2018-06-12T14:41:18Z

Hi @GezimSejdiu,
Have you checked scio, scala api of Apache Beam. Its features is very rich and the example looks exactly like Apache Spark's one. I hope this would increase the possibility.

JensLehmann · 2018-06-15T08:19:27Z

We might consider this for the SANSA 0.5 release (December 2018). It's under discussion until then.

JensLehmann · 2018-12-07T08:42:53Z

We finally decided not to support this in the 0.5 release of SANSA.

Aklakan · 2019-01-11T10:47:44Z

Because Linux tooling is often much faster than dedicated Big Data frameworks for most conventional workloads (several GB of data), it may be intriguing whether performance of some data processing workflows could be maximized if they were written in Beam and run with a "SystemEnvironmentRunner`, which uses Linux tooling and pipes.

So running a workflow (pseudocode) such as

PCollection.from("someFile.ttl")
  .apply(RDFReader.readTriplesFrom(Lang.TURTLE))
  .apply(Sorter.create())
  .apply(TextIO.write().to("outputFile.nt"))

should give:

cat someFile | rapper -i ttl -o ntriples - http://foo | sort -u > outputFile.nt

In principle operators for certain common operations would have to be implemented.
The idea would be "write once, run everywhere", but in the case of integrating linux tooling, maybe the efforts needed to (a) implement new operators with (b) have roughly portable semantics and (c) actually execute such a workflow would be too high.

Yet, it might be interesting to see, whether beam would in principle allow for doing this.

Validation of Triples (rapper, jena)
Conversion of format
Filtering by regex (grep)
Filtering by predicate expression
Sorting (sort)
Join (join command?)

ghost · 2019-04-18T10:03:39Z

@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.

Aklakan · 2019-10-04T07:44:23Z

So I have rolled a module which I guess is conceptually related to Apache Bean when it comes to dataset processing:

https://github.com/SmartDataAnalytics/jena-sparql-api/tree/develop/jena-sparql-api-conjure

Workflows are assembled in RDF using static factory classes - just like in Beam.
Furthermore, It seems that my terminology can be mapped to it: Executor = Runner, Workflow = Pipeline

My implementation however is native RDF

Aklakan · 2019-10-04T07:44:54Z

@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.

Cool, need to check that out

JensLehmann added this to the 0.5 milestone Jun 15, 2018

patrickwestphal added the General label Aug 16, 2018

JensLehmann modified the milestones: 0.5, 0.6 Dec 7, 2018

JensLehmann modified the milestones: 0.6, 0.9 Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider Apache Beam? #1

Consider Apache Beam? #1

ghost commented May 10, 2018

GezimSejdiu commented Jun 5, 2018

ghost commented Jun 12, 2018

JensLehmann commented Jun 15, 2018

JensLehmann commented Dec 7, 2018

Aklakan commented Jan 11, 2019 •

edited

Loading

ghost commented Apr 18, 2019

Aklakan commented Oct 4, 2019

Aklakan commented Oct 4, 2019

Consider Apache Beam? #1

Consider Apache Beam? #1

Comments

ghost commented May 10, 2018

GezimSejdiu commented Jun 5, 2018

ghost commented Jun 12, 2018

JensLehmann commented Jun 15, 2018

JensLehmann commented Dec 7, 2018

Aklakan commented Jan 11, 2019 • edited Loading

ghost commented Apr 18, 2019

Aklakan commented Oct 4, 2019

Aklakan commented Oct 4, 2019

Aklakan commented Jan 11, 2019 •

edited

Loading