Project-5Batch-ETL-job-using-Spark

PROJECT#5

OBJECTIVE The objectives of this project are to get experience of coding with � Spark � Spark SQL � Spark Streaming � Kafka � Scala and functional programming DATASET The data set is the one that you analyzed in Course 1 and it is STM GTFS data. PROBLEMSTATEMENT We get the information of STM every day and need to run an ETL pipeline to enrich data for reporting and analysis purpose in real-time. Data is split in two

A set of tables that build dimension (batch style)
Stop times that needed to be enriched for analysis and reporting (streaming) COURSE 5 – PROJECT DESIGN ETL SPARK A BATCH JOB USING OBJECTIVE The objectives of this project are to get experience of coding with � Spark � Spark SQL � Spark Streaming � Kafka � Scala and functional programming DATA SET The data set is the one that you analyzed in Course 1 and it is STM GTFS data. PROBLEM STATEMENT We get the information of STM every day and need to run an ETL pipeline to enrich data for reporting and analysis purpose in real-time. Data is split in two
A set of tables that build dimension (batch style)
Stop times that needed to be enriched for analysis and reporting (streaming) PROJECT REQUIREMENTS
Data pipeline installation Create a directory on HDFS for staging area called /user/[GROUP]/[YOUR NAME]/project5/ where [GROUP] is the program. Ask the instructor if you don’t have it and [YOUR NAME] is a nickname of your choice with only lowercase letters. Create a directory for each source table called /user/[GROUP]/[YOUR NAME]/project5/[TABLE NAME] where [TABLE NAME] is from the following list � trips � calendar_dates � routes

Create a database called [GROUP][YOUR NAME] in Hive. If you already have one, just use and don’t try to create multiple databases. Create Kafka topic called [GROUP][YOUR NAME]_stop_times Create a directory for the result: /user/[GROUP]/[YOUR NAME]/project5/enriched_stop_time Create an external table in Hive that points to this folder so we can verify the results. (Schema follows)

Extract data from STM to staging area Download the data set of STM GTFS from http://stm.info/sites/default/files/gtfs/gtfs_stm.zip Put extracted version into /user/[GROUP]/[YOUR NAME]/project5/[TABLE NAME] path on HDFS where [TABLE NAME] here is the name of file without extension. We just need the following tables • trips • calendar_dates • routes
Data pipeline
BATCH: Enrich trips with calendar dates and routes a. Read trips, calendar dates and routes into DataFrame b. Enrich them to create an enrichedTrip DataFrame. You can use either of SQL query or “join” API
STREAM: stream stop times through Kafka and enrich them with enriched trip information.
Use a command line tool (kafka-console-producer) to produce stop times to stop_time topic (The stop times is a huge dataset. In order to avoid breaking the cluster, produce only 100 records.)
Stream stop_time into the Spark Streaming application
For each micro-batch, enrich the RDD of stop times with enriched trips dimension
Save enriched stop times on HDFS under result directory.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CalendarDates.scala		CalendarDates.scala
EnrichedTrip.scala		EnrichedTrip.scala
README.md		README.md
Routes.scala		Routes.scala
StopTimes.scala		StopTimes.scala
Trips.scala		Trips.scala
TripsRoute.scala		TripsRoute.scala
build.sbt		build.sbt
core-site.xml		core-site.xml
hdfs-site.xml		hdfs-site.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project-5Batch-ETL-job-using-Spark

About

Releases

Packages

Languages

adrestrada/Batch-ETL-job-using-Spark-P5

Folders and files

Latest commit

History

Repository files navigation

Project-5Batch-ETL-job-using-Spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages