GitHub - michlin0825/DEND-Project-5-Data-Pipeline: Udacity data engineer project 5. Using Airflow to build data pipeline.

Project Background

Sparkify provides music streaming to end users. Data of song details and user activities is captured as JSON files.
AWS Redshift is selected as the data warehousing platform, enabling persistent data storage and ad hoc queries.
Apache Airflow is designated as the data pipeline solution, supporting automation and monitoring of the ETL process.

Redshift Setup

Create Redshift IAM role with S3FullAccess.
Create Redshift cluster.
Attach Redshift IAM role to the cluster.
create empty staging tables, dimension tables, and fact table as groundwork.

Airflow Setup

Start Airflow with /opt/airflow/start.sh in command line
Create connection to S3
Create connection to Redshift

Data Sources

Log data: s3://udacity-dend/log_data
Song data: s3://udacity-dend/song_data

Final Tables

Songplay Fact Table
Users Dimension Table
Songs Dimension Table
Artists Dimension Table
Time Dimension Table

Data Pipelines

ETL Scripts

create_tables.sql : DDLs for 2 staging tables, 4 dimension tables, and 1 fact table for the project
udac_example_dag.py : DAG file for Apache Airflow
stage_redshift.py : custom operator to load data from S3 to Redshift
load_fact.py : custom operator to populate fact table in Redshift
load_dimension.py : custom operator to load dimension tables in Redshift
data_quality.py : custom operator to check data quality in all tables. e.g. at least one record in the table

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dags		dags
plugins		plugins
README.md		README.md
create_tables.sql		create_tables.sql
dag-flow.png		dag-flow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Background

Redshift Setup

Airflow Setup

Data Sources

Final Tables

Data Pipelines

ETL Scripts

About

Releases

Packages

Languages

michlin0825/DEND-Project-5-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Project Background

Redshift Setup

Airflow Setup

Data Sources

Final Tables

Data Pipelines

ETL Scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages