Skip to content

Commit

Permalink
support spark 3, update scala, fix dockerfile, provide new rosbag dow…
Browse files Browse the repository at this point in the history
…nload, move to new home
  • Loading branch information
wiegelmann committed Oct 10, 2020
1 parent e610d3d commit b7974c8
Show file tree
Hide file tree
Showing 18 changed files with 103 additions and 12,678 deletions.
10 changes: 5 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
RUN python2 -m pip install --upgrade pip && \
python3 -m pip install --upgrade pip && \
pip3 install --no-cache-dir --upgrade jupyter && \
pip2 install --no-cache-dir --upgrade pyspark matplotlib pandas tensorflow keras Pillow && \
pip2 install --no-cache-dir --upgrade pyspark==3.0.1 matplotlib==2.2.3 pandas==0.23.2 tensorflow==1.9.0 keras==2.0.7 Pillow==6.2.2 && \
python2 -m pip install ipykernel && \
python2 -m ipykernel install && \
python3 -m pip install ipykernel && \
Expand Down Expand Up @@ -38,12 +38,12 @@ RUN mkdir -p /opt/ros_hadoop/latest
RUN mkdir -p /opt/ros_hadoop/master/dist/
RUN mkdir -p /opt/apache/
ADD . /opt/ros_hadoop/master
RUN bash -c "curl -s https://api.github.com/repos/valtech/ros_hadoop/releases/latest | egrep -io 'https://api.github.com/repos/valtech/ros_hadoop/tarball/[^\"]*' | xargs wget --quiet -O /opt/ros_hadoop/latest.tgz"
RUN bash -c "if [ ! -f /opt/ros_hadoop/master/dist/hadoop-3.0.2.tar.gz ] ; then wget --quiet -O /opt/ros_hadoop/master/dist/hadoop-3.0.2.tar.gz http://www.eu.apache.org/dist/hadoop/common/hadoop-3.0.2/hadoop-3.0.2.tar.gz ; fi"
RUN bash -c "curl -s https://api.github.com/repos/autovia/ros_hadoop/releases/latest | egrep -io 'https://api.github.com/repos/autovia/ros_hadoop/tarball/[^\"]*' | xargs wget --quiet -O /opt/ros_hadoop/latest.tgz"
RUN bash -c "if [ ! -f /opt/ros_hadoop/master/dist/hadoop-3.0.2.tar.gz ] ; then wget --quiet -O /opt/ros_hadoop/master/dist/hadoop-3.0.2.tar.gz https://archive.apache.org/dist/hadoop/common/hadoop-3.0.2/hadoop-3.0.2.tar.gz ; fi"
RUN tar -xzf /opt/ros_hadoop/latest.tgz -C /opt/ros_hadoop/latest --strip-components=1 && rm /opt/ros_hadoop/latest.tgz
RUN tar -xzf /opt/ros_hadoop/master/dist/hadoop-3.0.2.tar.gz -C /opt/apache && rm /opt/ros_hadoop/master/dist/hadoop-3.0.2.tar.gz

RUN ln -s /opt/apache/hadoop-3.0.2 /opt/apache/hadoop
RUN ln -s /opt/apache/hadoop-3.0.2 /opt/apache/hadoop
RUN bash -c "if [ ! -f /opt/ros_hadoop/latest/lib/rosbaginputformat.jar ] ; then ln -s /opt/ros_hadoop/master/lib/rosbaginputformat.jar /opt/ros_hadoop/latest/lib/rosbaginputformat.jar ; fi"

RUN printf "<configuration>\n\n<property>\n<name>fs.defaultFS</name>\n<value>hdfs://localhost:9000</value>\n</property>\n</configuration>" > /opt/apache/hadoop/etc/hadoop/core-site.xml && \
Expand All @@ -55,7 +55,7 @@ RUN printf "<configuration>\n\n<property>\n<name>fs.defaultFS</name>\n<value>hdf
RUN printf "#! /bin/bash\nset -e\nsource \"/opt/ros/$ROS_DISTRO/setup.bash\"\n/start_hadoop.sh\nexec \"\$@\"\n" > /ros_hadoop.sh && \
chmod a+x /ros_hadoop.sh

RUN bash -c "if [ ! -f /opt/ros_hadoop/master/dist/HMB_4.bag ] ; then wget --quiet -O /opt/ros_hadoop/master/dist/HMB_4.bag https://xfiles.valtech.io/f/c494d168522045e3bcc0/?dl=1 ; fi" && \
RUN bash -c "if [ ! -f /opt/ros_hadoop/master/dist/HMB_4.bag ] ; then wget --quiet -O /opt/ros_hadoop/master/dist/HMB_4.bag https://data.autovia.io/f/ros_hadoop/HMB_4.bag ; fi" && \
java -jar "$ROSIF_JAR" -f /opt/ros_hadoop/master/dist/HMB_4.bag

RUN bash -c "/start_hadoop.sh" && \
Expand Down
15 changes: 2 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ RosbagInputFormat is an open source **splittable** Hadoop InputFormat for the RO

The complete source code is available in src/ folder and the jar file is generated using SBT (see build.sbt)

For an example of rosbag file larger than 2 GB see doc/Rosbag larger than 2 GB.ipynb Solved the issue https://github.com/valtech/ros_hadoop/issues/6 The issue was due to ByteBuffer being limitted by JVM Integer size and has nothing to do with Spark or how the RosbagMapInputFormat works within Spark. It was only problematic to extract the conf index with the jar.

# Workflow

1. Download latest release jar file and put it in classpath
Expand All @@ -21,18 +19,14 @@ hdfs dfs -put
```python
sc.newAPIHadoopFile(
path = "hdfs://127.0.0.1:9000/user/spark/HMB_4.bag",
inputFormatClass = "de.valtech.foss.RosbagMapInputFormat",
inputFormatClass = "io.autovia.foss.RosbagMapInputFormat",
keyClass = "org.apache.hadoop.io.LongWritable",
valueClass = "org.apache.hadoop.io.MapWritable",
conf = {"RosbagInputFormat.chunkIdx":"/opt/ros_hadoop/master/dist/HMB_4.bag.idx.bin"})
```

Example data can be found for instance at https://github.com/udacity/self-driving-car/tree/master/datasets published under MIT License.

# Usage on a Kubernetes cluster for processing at scale

Check out our Flux project (Machine Learning Stack for Big Data): https://github.com/flux-project/flux

# Documentation
The [doc/](doc/) folder contains a jupyter notebook with a few basic usage examples.

Expand Down Expand Up @@ -108,7 +102,7 @@ hdfs dfs -ls
```python
fin = sc.newAPIHadoopFile(
path = "hdfs://127.0.0.1:9000/user/root/HMB_4.bag",
inputFormatClass = "de.valtech.foss.RosbagMapInputFormat",
inputFormatClass = "io.autovia.foss.RosbagMapInputFormat",
keyClass = "org.apache.hadoop.io.LongWritable",
valueClass = "org.apache.hadoop.io.MapWritable",
conf = {“RosbagInputFormat.chunkIdx”:”/opt/ros_hadoop/master/dist/HMB_4.bag.idx.bin"})
Expand Down Expand Up @@ -272,8 +266,3 @@ One can sample of course and collect the data in the driver to train a model on
Note that the msg is the most granular unit but you could replace the flatMap with a mapPartitions to apply such a Keras function to a whole split.

Another option would be to have a map.reduceByKey before the flatMap so that the function argument would be a whole interval instead of a msg. The idea is to key on time.

We hope that the RosbagInputFormat would be useful to you.

## Please do not forget to send us your [feedback](AUTHORS).
![doc/images/browse-tutorial.png](doc/images/browse-tutorial.png)
6 changes: 3 additions & 3 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@ lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "org.apache.spark.input",
scalaVersion := "2.11.8",
scalaVersion := "2.12.10",
version := "0.9.8"
)),
name := "RosbagInputFormat",
libraryDependencies ++= Seq(
scalaTest % Test,
"com.google.code.gson" % "gson" % "2.8.0",
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-core" % "3.0.1",
"com.google.protobuf" % "protobuf-java" % "3.3.0"
),
packageOptions in (Compile, packageBin) +=
Package.ManifestAttributes(
"Class-Path" -> Seq(".","scala-library-2.11.8.jar","protobuf-java-3.3.0.jar").mkString(" "))
"Class-Path" -> Seq(".","scala-library-2.12.10.jar","protobuf-java-3.3.0.jar").mkString(" "))
)
2 changes: 1 addition & 1 deletion doc/Rosbag larger than 2 GB.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
"source": [
"# Show that the we can read the index\n",
"\n",
"Solved the issue https://github.com/valtech/ros_hadoop/issues/6 \n",
"Solved the issue https://github.com/autovia/ros_hadoop/issues/6 \n",
"\n",
"The issue was due to ByteBuffer being limitted by JVM Integer size and has nothing to do with Spark or how the RosbagMapInputFormat works within Spark. It was only problematic to extract the conf index with the jar.\n",
"\n",
Expand Down
203 changes: 29 additions & 174 deletions doc/Tutorial.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit b7974c8

Please sign in to comment.