top of page

Playlist Perfection: Predicting Song Skips on Spotify



Business Motivation

Help Spotify improve song recommendation algorithms, thus improving user experience and keeping users engaged as they listen to their favorite music on the platform



Project Overview

  • Employed three classification models (logistic regression, classification tree, random forest) to predict whether users will skip tracks in a listening session

  • Deployed unsupervised learning (k-means clustering) to identify and understand listeners' behaviors

  • Preprocessed data: checking for duplicates and encoding categorical variables

  • Performed exploratory data analysis (EDA) to understand correlation between variables and gather summary statistics


Main Insights

1. Random Forest model performs the best to classify song skipping behaviors

2. Skipping behavior in a session is correlated with prior skipping behavior within the same session

3. Skipping rate > 50% can indicate that users are not happy with current selection of tracks



Tools

Software: RStudio

Libraries: ggplot2, dplyr, pROC, tree, randomForest, corrplot


 

Table of Contents:

a. K-means Clustering

b. Classification (Logistic Regression, CART, Random Forest)

 

I. Data Preprocessing


Original dataset contains 160 million Spotify listening sessions and user interactions (350GB of unzipped data). With this excessive amount of data, I decided to random sample a subset of the original dataset.


The random sample consists of ~2M rows (or tracks) and 24 independent variables. For target variable, we have 4 features: skip_1, skip_2, skip_3, and not_skipped, which measures the different degree of skipping a track. Our prediction only focus on skip_1.



For data cleaning, I checked for duplicate rows, missing values, and encoded categorical variables.

sum(ifelse(duplicated(data) == TRUE,1,0))
## [1] 0 # Zero duplicated rows
data$shuffle <- ifelse(data$hist_user_behavior_is_shuffle=="True", 1, 0)
data$premium <- ifelse(data$premium=="True", 1, 0)


II. Exploratory Data Analysis


First, I explored the most popular tracks in the dataset:




In addition, I plotted a correlation matrix to detect multicollinearity in the dataset.

CorMatrix <- cor(data_corr)
corrplot(CorMatrix, method = "square")

Here are some takeaways from the matrix:

- Positive relation between context_switch and start_clickrow, start_backbutton and end_backbutton

- Negative relation between start_trackdone and start_forwardbutton




III. Model Building


a. K-means Clustering

I performed k-means clustering with k = 6.

SixCenters <- kmeans(xdata, 6 ,nstart=30)
SixCenters$size

b. Classification Models


For classification, I employed 3 models to compare performance:

- Logistic Regression (base model)

I chose logistic regression for its interpretability, making it easier to generate actionable insights for stakeholders. This will be our base model.


- Classification Tree (CART)

In comparison with logistic regression, Classification Tree has an advantage in interpretability and easier to visualize.


- Random Forest

Random Forest can identify complicated data patterns and mitigate overfit problem, but less interpretable than classification trees.


I performed 5-fold cross validation to split the data into 5 subsets of training and testing these models. I evaluated these models using out-of-sample (OOS) accuracy metrics.



IV. Model Performance


I evaluated the 3 models using OOS accuracy. With the 5-fold cross-validation, I computed the mean of OOS accuracy, and got the following results:



The results indicated that Random Forest outperformed other approached on the test and validation sets.


 



bottom of page