Skip to content
This repository has been archived by the owner on Oct 8, 2020. It is now read-only.

Consider Apache Beam? #1

Open
ghost opened this issue May 10, 2018 · 8 comments
Open

Consider Apache Beam? #1

ghost opened this issue May 10, 2018 · 8 comments
Labels
Milestone

Comments

@ghost
Copy link

ghost commented May 10, 2018

Apache Beam is a unified interface for batch and streaming with multiple backends: Apache Spark, Apache Flink, Apache Apex, Apache Gearpump and Google Cloud Dataflow.

It could deduplicate codes for different backends. Apache Zeppelin also supports Beam interpreter.

@GezimSejdiu
Copy link
Member

Hi @t3476 ,
many thanks for the suggestion. Yes, indeed we have considered Apache Beam after it came out on the end of 2016th.

Since we are providing the same functionality (or at least we try to keep them aligned as much as we can) on both engines (Apache Spark and Apache Flink) we were also thinking of having an underlying engine which works on both runners with a single pipeline. But the good news was that Apache Beam came out and we were looking at it and see if we can adapt it to our framework as well.

We are still discussing the possibility of switching to Apache Beam, and looking forward to seeing if Apache Beam supports Scala API as well; since most of our code is build on Scala language.

We will have to see how easy is to re-implement the same functionality we did on SANSA on Apache Beam framework.

We will keep posting here in case we decide for something. But would be great to keep this option open and discuss the benefits of using one engine which covers most of the distributing frameworks out there.

Best,

@ghost
Copy link
Author

ghost commented Jun 12, 2018

Hi @GezimSejdiu,
Have you checked scio, scala api of Apache Beam. Its features is very rich and the example looks exactly like Apache Spark's one. I hope this would increase the possibility.

@JensLehmann JensLehmann added this to the 0.5 milestone Jun 15, 2018
@JensLehmann
Copy link
Member

We might consider this for the SANSA 0.5 release (December 2018). It's under discussion until then.

@JensLehmann
Copy link
Member

We finally decided not to support this in the 0.5 release of SANSA.

@JensLehmann JensLehmann modified the milestones: 0.5, 0.6 Dec 7, 2018
@Aklakan
Copy link
Member

Aklakan commented Jan 11, 2019

Because Linux tooling is often much faster than dedicated Big Data frameworks for most conventional workloads (several GB of data), it may be intriguing whether performance of some data processing workflows could be maximized if they were written in Beam and run with a "SystemEnvironmentRunner`, which uses Linux tooling and pipes.

So running a workflow (pseudocode) such as

PCollection.from("someFile.ttl")
  .apply(RDFReader.readTriplesFrom(Lang.TURTLE))
  .apply(Sorter.create())
  .apply(TextIO.write().to("outputFile.nt"))

should give:

cat someFile | rapper -i ttl -o ntriples - http://foo | sort -u > outputFile.nt

In principle operators for certain common operations would have to be implemented.
The idea would be "write once, run everywhere", but in the case of integrating linux tooling, maybe the efforts needed to (a) implement new operators with (b) have roughly portable semantics and (c) actually execute such a workflow would be too high.

Yet, it might be interesting to see, whether beam would in principle allow for doing this.

  • Validation of Triples (rapper, jena)
  • Conversion of format
  • Filtering by regex (grep)
  • Filtering by predicate expression
  • Sorting (sort)
  • Join (join command?)

@ghost
Copy link
Author

ghost commented Apr 18, 2019

@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.

@Aklakan
Copy link
Member

Aklakan commented Oct 4, 2019

So I have rolled a module which I guess is conceptually related to Apache Bean when it comes to dataset processing:

https://github.com/SmartDataAnalytics/jena-sparql-api/tree/develop/jena-sparql-api-conjure

Workflows are assembled in RDF using static factory classes - just like in Beam.
Furthermore, It seems that my terminology can be mapped to it: Executor = Runner, Workflow = Pipeline

My implementation however is native RDF

@Aklakan
Copy link
Member

Aklakan commented Oct 4, 2019

@Aklakan Apache Beam includes Direct Runner, pipelines on your machine.

Cool, need to check that out

@JensLehmann JensLehmann modified the milestones: 0.6, 0.9 Jun 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants