The purpose of this project is to upgrade Sparkify data analytical capabilities by moving from a data warehouse (AWS Redshift) to a data lake. The raw data resides on two AWS S3 buckets:
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
- Load the raw data from from S3 buckets
- Process the data into analytics tables using Spark
- Load them back into S3 as a set of dimensional tables
- Load song and log datasets from S3 buckets into AWS ERM cluster
- Transform the raw data into analytical tables optimized for queries (partitioning)
- Store transformed data in parquet format on S3 bucket
Table | Description |
---|---|
songplays | fact table for played songs (played by who, when on which devise etc.) |
users | dimensional table for users (names, gender and level) |
songs | dimensional table for songs (artist, title, year and duration) |
artists | dimensional table for artists (name and location info) |
time | timestamps breakdown |
- SELECT artist_id, count(*) as cnt FROM songplays GROUP BY artist_id ORDER BY cnt DESC LIMIT 10;