Skip to content

xavji/lastfm-exercise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

To build an uber-jar to run the jobs in Spark: sbt assembly By default the assembly task also runs the tests.

This project contains three functional flavours of Spark jobs:

  • jobs that count the number of distinct tracks played by each user:
    • these jobs take exactly two command line arguments:
      1. path to input TSV file downloaded
      2. path to output directory where CSV files with the results are generated
    • RDD version in lastfm.DistinctSongsByUserRddJob
    • SQL version in lastfm.DistinctSongsByUserSqlJob
    • example: spark-submit --master local[*] --class lastfm.DistinctSongsByUserSqlJob /project-dir/lastfm-exercise-assembly-0.1.0-SNAPSHOT.jar /data/userid-timestamp-artid-artname-traid-traname.tsv /out/distinct-songs
  • jobs that rank the tracks by number of plays and return the top N artist & track name:
    • these jobs take exactly three command line arguments:
      1. path to input TSV file downloaded
      2. path to output directory where CSV files with the results are generated
      3. maximum song rank to return in the results
    • RDD version in lastfm.PopularSongsRddJob
    • SQL version in lastfm.PopularSongsSqlJob
    • example: spark-submit --master local[*] --class lastfm.PopularSongsSqlJob /project-dir/lastfm-exercise-assembly-0.1.0-SNAPSHOT.jar /data/userid-timestamp-artid-artname-traid-traname.tsv /out/popular 100
  • jobs that find the N longest user sessions (in terms of duration)
    • these jobs take exactly three command line arguments:
      1. path to input TSV file downloaded
      2. path to output directory where CSV files with the results are generated
      3. maximum session duration rank to return in the results
    • RDD version in lastfm.LongestSessionsRddJob
    • SQL version in lastfm.LongestSessionsSqlJob
    • example: spark-submit --driver-memory 4G --master local[*] --class lastfm.LongestSessionsRddJob /project-dir/lastfm-exercise-assembly-0.1.0-SNAPSHOT.jar /data/userid-timestamp-artid-artname-traid-traname.tsv /out/sessions 10

The output of the longest sessions job is included in sessions.csv. Due to a lack of time, the SQL version of the longest sessions job has not been able to process the complete dataset, probably because of the immutable data structures used during the aggregation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages