GitHub

Project summary

The purpose of this project is to upgrade Sparkify data analytical capabilities by moving from a data warehouse (AWS Redshift) to a data lake. The raw data resides on two AWS S3 buckets:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

Project steps

Load the raw data from from S3 buckets
Process the data into analytics tables using Spark
Load them back into S3 as a set of dimensional tables

ETL pipeline steps

Load song and log datasets from S3 buckets into AWS ERM cluster
Transform the raw data into analytical tables optimized for queries (partitioning)
Store transformed data in parquet format on S3 bucket

Database schema

Table	Description
songplays	fact table for played songs (played by who, when on which devise etc.)
users	dimensional table for users (names, gender and level)
songs	dimensional table for songs (artist, title, year and duration)
artists	dimensional table for artists (name and location info)
time	timestamps breakdown

Example queries

SELECT artist_id, count(*) as cnt FROM songplays GROUP BY artist_id ORDER BY cnt DESC LIMIT 10;

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
data_lake_draft.html		data_lake_draft.html
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project summary

Project steps

ETL pipeline steps

Database schema

Example queries

About

Releases

Packages

Languages

GregLed/data_lake_project

Folders and files

Latest commit

History

Repository files navigation

Project summary

Project steps

ETL pipeline steps

Database schema

Example queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages