-
Notifications
You must be signed in to change notification settings - Fork 2
Consider Apache Beam? #1
Comments
Hi @t3476 , Since we are providing the same functionality (or at least we try to keep them aligned as much as we can) on both engines (Apache Spark and Apache Flink) we were also thinking of having an underlying engine which works on both runners with a single pipeline. But the good news was that Apache Beam came out and we were looking at it and see if we can adapt it to our framework as well. We are still discussing the possibility of switching to Apache Beam, and looking forward to seeing if Apache Beam supports Scala API as well; since most of our code is build on Scala language. We will have to see how easy is to re-implement the same functionality we did on SANSA on Apache Beam framework. We will keep posting here in case we decide for something. But would be great to keep this option open and discuss the benefits of using one engine which covers most of the distributing frameworks out there. Best, |
Hi @GezimSejdiu, |
We might consider this for the SANSA 0.5 release (December 2018). It's under discussion until then. |
We finally decided not to support this in the 0.5 release of SANSA. |
Because Linux tooling is often much faster than dedicated Big Data frameworks for most conventional workloads (several GB of data), it may be intriguing whether performance of some data processing workflows could be maximized if they were written in Beam and run with a "SystemEnvironmentRunner`, which uses Linux tooling and pipes. So running a workflow (pseudocode) such as
should give:
In principle operators for certain common operations would have to be implemented. Yet, it might be interesting to see, whether beam would in principle allow for doing this.
|
@Aklakan Apache Beam includes Direct Runner, pipelines on your machine. |
So I have rolled a module which I guess is conceptually related to Apache Bean when it comes to dataset processing: https://github.com/SmartDataAnalytics/jena-sparql-api/tree/develop/jena-sparql-api-conjure Workflows are assembled in RDF using static factory classes - just like in Beam. My implementation however is native RDF |
Cool, need to check that out |
Apache Beam is a unified interface for batch and streaming with multiple backends: Apache Spark, Apache Flink, Apache Apex, Apache Gearpump and Google Cloud Dataflow.
It could deduplicate codes for different backends. Apache Zeppelin also supports Beam interpreter.
The text was updated successfully, but these errors were encountered: