In fulfilling its mission to grant people all over the world access to the music they love, Spotify has developed a number of innovative technologies. Playlists in particular are an effective vehicle by which Spotify relays and recommends tracks to millions of users. While most playlists on the platform are generated by users themselves, Spotify-curated playlists are amongst the most popular, being followed in some cases by millions of users. These playlists are created through a combination of algorithmic and human-driven processes. These playlists are both personalized (such as the esteemed Discover Weekly playlist) as well as intended for general browsing based on genres, moods, or even current events.
This capstone project aims to answer the following question: to what extent can we predict the success
of a Spotify-curated playlist, based only on the songs that comprise that playlist? Do the intrinsic
features of a song, from its acoustic properties to its relative position within a playlist, contribute
to how well users will respond to that playlist?
More specifically, our project aims to develop a predictive model of playlist success that can be used to generate playlists algorithmically, given an optimization function, a sampling algorithm from Spotify's universe of tracks and artists, and only one input song of interest from which to seed the playlist.
The primary data source used for this project was the publicly available Spotify API. From the API, we obtained valuable information about playlists, from their lengths and popularities to their component songs and a list of Spotify-derived audio features for each track. For this project we analyzed 2,000 playlists that have been individually curated by Spotify, encompassing a universe of roughly 20,000 songs.
Our project is divided into the following broad spheres of analysis:
This carousel cycles through primary observations learned through the data exploration phase of the project, such as understanding the relationship between the mean popularity of a playlist's tracks and the overall popularity of a playlist.
This section describes a series of models that predict the popularity of a playlist, defined as the number of playlist followers binned into 5 popularity levels based on quantile. The inputs to this model include various attributes relating to the playlist as well as its component songs. As the number of songs per playlist varies, summary metrics were decided upon and tested in order to generate a final array of playlist-level features that could be used for a suitable prediction.
The Spotify API provides a number of important data points on the track level, namely the popularity of the track on an index basis from 1 to 100, as well as the number of markets the song is available in. The API also notably includes the following audio-based features per song: acousticness, danceability, duration, energy, instrumentalness, key, liveness, loudness, mode, tempo, valence, and time signature.
Playlist-level data includes three primary metrics: the number of followers (in other words, the response variable we are trying to predict), the number of tracks that are in the playlist, and an indicator corresponding to whether the playlist is featured or not.
One important observation that guided our modeling approach for playlist popularity prediction
was the understanding that the mean popularity of a playlist's tracks were a rough proxy or
independently strong predictor of popularity. In order to improve upon the predictive ability of an
all-encompassing model, however, our goal was to determine whether the addition of acoustical
features to a training set would improve the classification rate of a predictive model.
In order to generate a training set of playlist-level features, a careful consideration of track-level analysis had to be made in order to aggregate song data in a meaningful way. While simple averaging or addition could have been a simple solution, capturing the distribution of acoustic features more precisely should contain better predictive ability. As a result, our first approach was to approximate the distribution of acoustic features by representing each feature with 5 different columns, each corresponding to the quantile median of that feature's distribution across the playlist. The second, and chosen, approach was to instead reflect the distribution through sequencing, by providing the median value of that acoustic feature for songs that belong to a particular quartile of the playlist's progression.
Various models were tested for predicting playlist popularity, including logistic regression, support vector machines, and random forest classifiers. The best performing model was a random forest whose overall accuracy was 78% on a training set when both acoustic features and metadata were included.
After implementing a rigorous regime of training and testing with cross-validation and tuning, the best performing model was also a random forest classifier of depth 29. While the training set was classified with almost perfect accuracy, results on an out-of-bag testing sample were roughly 50%, indicating a rare case of overfitting within the random forest.
As part of our dataset, we have 30 second raw audio samples for 15,000 songs. We extracted 34 acoustic features from raw audio at each second using a
feature extraction library available here.
The acoustic features include 13 Mel Frequency Cepstral Coefficients (MFCC), 12 Chroma coefficients, in addition to energy and other spectral features.
These features are a characteristic of the frequency and pitch of the songs. Combining all 34 features taken at 30 timestamps of a song produces 1,020
dimensional predictors for each track.
Extraction of energy from track with frame rate = 1 second
This section describes modelling techniques used to predict the popularity of a track given the raw audio features.
The predictors are the mean value of the individual raw audio features extracted at 30 timestamps, and the response is the track popularity (in range 0-100) available from the Spotify API. Dimensionality reduction techniques like PCA did not give improvement in model performance, so we decided to take the mean value of each feature.
The traditional approach to determining acoustic similarity is by hand. This is clearly infeasible for large quantities of music. Some researchers have tried to analyize MIDI music data to find melody contours for each section of a music track. But the type of music similarity required for playlist construction is based on the overall sound of music rather than just the melody. One such approach is to use indexing based on matching features such as Mel-frequency cepstral coefficients (MFCCs), Chroma coefficients, energy, or other such features.
Obtaining the spectral signature
The spectral signature attempts to capture the main statistics of the frequency and chroma coefficients to characterize the main types of sounds present in a track. We achieve this by dividing each song into 30 frames each of 1 second.
It is important to note that the clustering is local to each song. The clusters are characterized by means, covariances and weights (weight is proportional to the number of frames present in each cluster). The set of clusters define the signature of a song. Mathematically, $\{(μ_{p_1}, Σ_{p_1}, w_{p_1}),.....(μ_{p_m}, Σ_{p_m}, w_{p_m})\}$ is the signature of a song where $μ_{p_1}$ and $Σ_{p_1}$ are the mean and covariance respectively of the cluster $p_i$ and $w_i$ is the weight of that cluster.
Once we obtain the spectral signature of two songs, we use EMD to calculate the similarity between songs. We use the symmetric form of Kullback–Leibler divergence to calculate the distance between two clusters. Linear programming is used to minimize the cost (defined by multiplying flow with the KL divergence distance). Finally, EMD is calculated as the weighted sum of distances between clusters where weights are inputted as the flow (probability mass). The lower the EMD value, the higher the similarity between songs.
Optimizing song search
Our final database contained around 15,000 songs with spectral signatures. Given a track, it is computationally expensive to calculate EMD for the entire database of songs. To optimize the search process, we suggest a vantage point method. The database of songs is clustered into 10 clusters using K-means clustering and the cluster centers are chosen as vantage points. Given an input track, we find the closest vantage point and then return the top 200 acoustically similar songs from that cluster (where acoustic similarity is calculated using EMD as previously explained). This reduces the computation of EMD to a subset of songs.
As well as selecting one bucket of candidate songs, our implementation allows the user to select multiple buckets and input a weighting to define the proportions between buckets included in the candidate list. We can understand the effects of using different buckets by varying the weights and measuring the resulting energy from simulated annealing. For example, if we select tracks from both A) top tracks by related artists as well as B) acoustically similar tracks from related artists, we can observe that increasing the proportion of top tracks will give us somewhat better results for the track Shape of You by Ed Sheeran.
However, using a higher proportion of top tracks will also prevent the final playlist from including songs that are acoustically similar to the seed song, and will limit the addition of new or undiscovered music that may be unlikely to appear in top tracks from related artists.
In order to return the playlist that maximizes the number of followers, we optimize the set and order of tracks using simulated annealing, an optimization technique for estimating the global optimum of a function.
See Appendix A for more detailed background about simulated annealing.
We begin by initializing four simulated annealing processes in parallel. For each process:
Cost Function: We define our cost (energy) $E$ as $10^4 *(1-(0.25P_4+0.75P_5))$, where $P_j$ is the class (pseudo) probability of that the playlist is in class $j$
Simulated annealing effectively lowers the energy, plateauing after several hundred transitions.
Optimization performance is significantly better than the baseline of random sampling, resulting in substantially higher probabilities for classes 4 and 5 (top two classes).
Song Name | Artist | Preview |
---|---|---|
Shape of You | Ed Sheeran | |
Piano Sonata No. 14 in C Minor | Mozart | |
Daydream | Ciele (ft. Chris Ho) |
Method |
---|
Top Tracks From Related Artists |
Similar Tracks From Related Artists |
Similar Tracks From Vantage Points Database |
Recommended Playlist | Popularity |
---|
Playlists can be effectively optimized for popularity using acoustic and non-acoustic features extracted from component songs. Non-acoustic features include the song's popularity, its availability in world markets, and other metadata obtained from the record summary. Acoustic features include prescribed signals for energy, liveliness, tempo, and key, with additional features engineered and sampled by us.
Generating a playlist given an input song of interest was performed using simulated annealing, which optimized combinations and sequences of songs across a pool of candidate tracks. Our final module allows a user to specify the pool of candidate songs from a list of three options: a network of top tracks from related artists, a network of acoustically similar tracks from related artists, or a system of similar tracks that are 100% acoustically-derived from raw audio across the Spotify universe.
We recommend two areas of future research: first, steps must be taken to mitigate the rare case of overfitting that affected our random forest classifier, which over-predicted unpopular playlists in the testing set. Second, more metadata must be collected and then incorporated as predictors of popularity to ensure that acoustical features are not capturing information such as release date or other non-intrinsic characteristics.
Evaluating music recommendation systems quantitatively is difficult, as there is often no definitive metric for acoustic similarity. This problem is usually approached in a subjective way, whereby music experts rate the similarity of songs and finally the majority vote is taken to generate similarity scores between songs. Based on literature review, we found that songs we deemed to be acoustically similar belonged to the same genre more that 90% of the time. To further qualitatively evaluate our model for acoustic similarity, we listened to the songs ourselves and noted striking similarities in the acoustics, vocal style, and genre.
Evaluating our playlist generation engine is also challenging for two primary reasons.
Simulated Annealing is one of the most popular techniques for global optimization. The technique is based on the physical annealing process, in which a system is first heated to a melting state and then cooled down slowly. When the solid is heated, its molecules start moving randomly, and its energy increases. If the subsequent process of cooling is slow, the energy of the solid decreases slowly, but there are also random increases in the energy governed by the Boltzmann distribution. If the cooling is slow enough, and deep enough to unstress the solid, the system will eventually settle down to the lowest energy state where all the molecules are arranged to have minimal potential energy.
Suppose the system is in a state with energy $E_i$ (function at current $x_i$ has value $E_i$) and subsequent to the random fluctuation to position $x_j$, now has energy state $E_j$. We denote the change in the energy due to the state change as $ΔE=E_j−E_i$. The move to the new position $x_j$ is done via a proposal. If the new state has lower energy than the current state, we go into this new state. This stochastic acceptance of higher energy states, allows our process to escape local minima.
As the temperature is lowered, there will be more concentrated search effort near the current local minimum, since only few uphill moves will be allowed. Thus, if we tune our temperature decrease schedule appropriately, we hope to converge to a global minimum. If the lowering of the temperature is sufficiently slow, the system reaches “thermal equilibrium” at each temperature.
Hill Climbing with Simulated Annealing
By Kingpin13 - Own work, CC0, Link