Companies like Spotify create personalized recommendations. How can this be extended to groups of people? This project uses the Echo Nest Taste Profile Dataset» of user play histories to create recommended playlists for groups of users.
In many social situations, like parties and road trips, groups of people with different music tastes will listen to music together. It's hard to please everyone, and it's even harder if you don't know what each person likes. This project extends the personalization of music recommendations for an individual to multiple people. There are multiple ways to synthesize the preferences of a group. Should you average their preferences? Should you try to make sure that no one hates the choices at the cost of excluding someone's favorites? This project explores these questions by first generating a music profile for each person, and then applying different strategies to combine their preferences into a single playlist.
Python 3.9:
ML modules:
- sci-kit learn
- surprise
Data modules
- pandas
- numpy
- csv
- pickle
- tables
- hdf5_getters
Plotting modules
- matplotlib
- seaborn
Web development
- flask
- wtforms
- img: Figure image files
- src: Python files
A presentation summarizing the data analysis and results can be found here.
This project explores the Echo Nest Taste Profiles subset of the Million Song Dataset. A reference paper explaining the dataset is:
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
The Taste Profiles dataset contains:
- 1,019,318 unique users
- 384,546 unique songs
- 48,373,586 entries
The main table is in the format: user – song ID – play counts
The song information table follows the format: song id – song title – artistThe dataset does not have explicit ratings, so I generated implicit ratings inferred from the song play counts. I made the assumption that user would tend to rate highly the songs that they listened to frequently. To generate the ratings, I needed to first explore the data.
The play counts per user-song pair ranged from 1 to 10,000, which is a large dynamic range. About half of the songs had only 1 play.
To deal with the large dynamic range, first I took the logarithm (base-10) of the plays. The play count distribution is shown below:
Even after this transformation, the distribution was heavily weighted toward zero, and had a very long tail. I found that only 1% of play counts exceeded 24, so I designated this as the ceiling. I normalized the play counts by this new maximum, took the logarithm, and then mapped them to a 1-5 scale for interpretability. See the equation below: Here is the distribution of generated ratings:Most (99.99%) user-song pairs were missing from the dataset, as the users only listened to a small fraction of the ~400,000 songs. When the sparsity is too high, learning algorithms have difficulty determining the underlying patterns. I simplified the problem by reducing the dataset to the top 250 songs. This does detract from the real world practicality, but could be remedied in the future by adding more complex song and user information to the training data to improve model predictions. This reduced the sparcity from 99.99% to 97%, and retained 11% of the dataset.
To train and test different models, I split the data into a random 80/20 train/validation split. All of the models I tested were from the Surprise Python library for recommender systems. A brief description of the each model algorithm is below, and more information can be found here.
- Normal
- Fits the ratings to a normal distribution and randomly generates ratings from this distribution. This is a baseline model for comparing other models to.
- Global Mean
- Assigns the global average rating to each prediction. This is an alternative baseline model.
- K-Nearest Neighbors (KNN)
- Predictions are based on the average values from the k nearest neighbors.
- Nonnegative Matrix Factorization (NMF)
- Fits prediction values based on factoring a matrix into user and item vectors, with elements assumed to be non-negative. The values are optimized iteratively through stochastic gradient descent.
- Co-cluster
- Similar to k-means clustering, but users and items are assigned some clusters Cu, Ci, and some co-clusters Cui.
- Alternating Least Squares Baseline
- Uses the alternating least squares algorithm to estimate a baseline vector for items, for users, and sums them with a global baseline. Note, this is incorporated into the KNN and SVD models
- Singular Value Decomposition (SVD)
- Uses a matrix factorization algorithm similar to NMF, without the non-negative assumption.
Additionally I combined 3 of the best performing models into an ensemble model. I excluded the ALS Baseline model because it is too similar to the others. I averaged the predictions of the 3 models, giving weight to each model based on their performance on the validation set. The ensemble model was superior to all other models, but may be too slow without some modification.
Here are the results of all of the models, using root-mean-square error of predicted versus actual ratings.
Additionally, I optimized the hyperparameters of the SVD model, both for its use as a standalone model and as part of the ensemble model. I varied the learning rate and number of epochs. I also varied the number of factors in the SVD matrix, but it showed little effect. Below is the error as a function of learning rate and number of epochs.
The results showed that a low learning rate (0.001) with a high number of epochs (125) performed the best. Pushing these values toward more extremes may offer marginal improvement in performance, but the training time increases nonlinearly, so I stopped there.I trained the model on all of the data in preparation for generating recommendations. The ranking algorithm is as follows. For a user, ratings for all 250 songs are predicted by the model. The known song ratings are imputed into the predictions, constituting ~3% of the predictions. The ratings are then sorted, and assigned a ranking of 0-249, where 0 is the ranking of the highest rated song.
When a group of users is submitted for recommendations, the algorithm generates rankings for each user as described above. I implemented 3 different strategies for synthesizing the sets of rankings into a single group rating. There is some research behind the strategies rooted in psychology. The first, and most basic strategy is to average the rankings of the users for each song. This is referred to as the "average strategy". A playlist recommendation for 5 songs for 3 users from the dataset is below, where Rank 1, Rank 2, and Rank 3 are the rankings for User 1, User 2, User 3.
Sometimes this results in recommendations for songs that most group members rate highly, but one or some rate very low. The "least misery strategy" addresses this issue by defining the group ranking for each song as the worst (maximum) ranking of the group members. The idea is that "the group is only as happy as its least happy member." A playlist recommendation using the "least misery strategy" is below.
Last, the downside of the "least misery strategy" is that it may generate recommendations that no one hates, but none or few of the members love. The complement to the "least misery strategy" is the "most pleasure strategy." Group rankings for each song are defined as the best (minimum) ranking of the group members. The enjoyment of others may be contagious, so group members may feed off of the energy of the happiest member. An example using this strategy is below.
The strategies and their descriptions were adapted from this reference paper:
De Pessemier, T., Dooms, S. & Martens, L. Comparison of group recommendation algorithms. Multimed Tools Appl 72, 2497–2541 (2014). https://doi.org/10.1007/s11042-013-1563-0
This prototype was completed in a limited timespan, but can be improved in the future by:
- Incorporating the ~300 GB and >50 categories of song metadata from the Million Song Dataset for item-item similarity
- Expanding to much larger song catalog
- Incorporating more complex criteria such as
- Diversity of recommendations
- Harmony, or extra weighting toward item similarity in recommendations
- Serendipity, or recommendations novel to users, but still enjoyed by them
Author: Christopher Shaffer