Business Motivation
Help Spotify improve song recommendation algorithms, thus improving user experience and keeping users engaged as they listen to their favorite music on the platform
Project Overview
Employed three classification models (logistic regression, classification tree, random forest) to predict whether users will skip tracks in a listening session
Deployed unsupervised learning (k-means clustering) to identify and understand listeners' behaviors
Preprocessed data: checking for duplicates and encoding categorical variables
Performed exploratory data analysis (EDA) to understand correlation between variables and gather summary statistics
Main Insights
1. Random Forest model performs the best to classify song skipping behaviors
2. Skipping behavior in a session is correlated with prior skipping behavior within the same session
3. Skipping rate > 50% can indicate that users are not happy with current selection of tracks
Tools
Software: RStudio
Libraries: ggplot2, dplyr, pROC, tree, randomForest, corrplot
Table of Contents:
a. K-means Clustering
b. Classification (Logistic Regression, CART, Random Forest)
I. Data Preprocessing
Original dataset contains 160 million Spotify listening sessions and user interactions (350GB of unzipped data). With this excessive amount of data, I decided to random sample a subset of the original dataset.
The random sample consists of ~2M rows (or tracks) and 24 independent variables. For target variable, we have 4 features: skip_1, skip_2, skip_3, and not_skipped, which measures the different degree of skipping a track. Our prediction only focus on skip_1.
For data cleaning, I checked for duplicate rows, missing values, and encoded categorical variables.
sum(ifelse(duplicated(data) == TRUE,1,0))
## [1] 0 # Zero duplicated rows
data$shuffle <- ifelse(data$hist_user_behavior_is_shuffle=="True", 1, 0)
data$premium <- ifelse(data$premium=="True", 1, 0)
II. Exploratory Data Analysis
First, I explored the most popular tracks in the dataset:
In addition, I plotted a correlation matrix to detect multicollinearity in the dataset.
CorMatrix <- cor(data_corr)
corrplot(CorMatrix, method = "square")
Here are some takeaways from the matrix:
- Positive relation between context_switch and start_clickrow, start_backbutton and end_backbutton
- Negative relation between start_trackdone and start_forwardbutton
III. Model Building
a. K-means Clustering
I performed k-means clustering with k = 6.
SixCenters <- kmeans(xdata, 6 ,nstart=30)
SixCenters$size
b. Classification Models
For classification, I employed 3 models to compare performance:
- Logistic Regression (base model)
I chose logistic regression for its interpretability, making it easier to generate actionable insights for stakeholders. This will be our base model.
- Classification Tree (CART)
In comparison with logistic regression, Classification Tree has an advantage in interpretability and easier to visualize.
- Random Forest
Random Forest can identify complicated data patterns and mitigate overfit problem, but less interpretable than classification trees.
I performed 5-fold cross validation to split the data into 5 subsets of training and testing these models. I evaluated these models using out-of-sample (OOS) accuracy metrics.
IV. Model Performance
I evaluated the 3 models using OOS accuracy. With the 5-fold cross-validation, I computed the mean of OOS accuracy, and got the following results:
The results indicated that Random Forest outperformed other approached on the test and validation sets.