Skip to content

Latest commit

 

History

History
27 lines (22 loc) · 1.06 KB

File metadata and controls

27 lines (22 loc) · 1.06 KB

Spark Fundamentals Week

  • match_details
    • a row for every players performance in a match
  • matches
    • a row for every match
  • medals_matches_players
    • a row for every medal type a player gets in a match
  • medals
    • a row for every medal type

Your goal is to make the following things happen:

  • Build a Spark job that
    • Disabled automatic broadcast join with spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
    • Explicitly broadcast JOINs medals and maps
    • Bucket join match_details, matches, and medal_matches_players on match_id with 16 buckets
    • Aggregate the joined data frame to figure out questions like:
      • Which player averages the most kills per game?
      • Which playlist gets played the most?
      • Which map gets played the most?
      • Which map do players get the most Killing Spree medals on?
    • With the aggregated data set
      • Try different .sortWithinPartitions to see which has the smallest data size (hint: playlists and maps are both very low cardinality)

Save these as .py files and submit them this way!