Playlist Perfection: Predicting Song Skips on Spotify

nganhahv99
Sep 24, 2023
2 min read

Updated: Nov 16, 2023

Business Motivation

Help Spotify improve song recommendation algorithms, thus improving user experience and keeping users engaged as they listen to their favorite music on the platform

Project Overview

Employed three classification models (logistic regression, classification tree, random forest) to predict whether users will skip tracks in a listening session
Deployed unsupervised learning (k-means clustering) to identify and understand listeners' behaviors
Preprocessed data: checking for duplicates and encoding categorical variables
Performed exploratory data analysis (EDA) to understand correlation between variables and gather summary statistics

Main Insights

1. Random Forest model performs the best to classify song skipping behaviors

2. Skipping behavior in a session is correlated with prior skipping behavior within the same session

3. Skipping rate > 50% can indicate that users are not happy with current selection of tracks

Tools

Software: RStudio

Libraries: ggplot2, dplyr, pROC, tree, randomForest, corrplot

Table of Contents:

I. Data Preprocessing

II. Exploratory Data Analysis

III. Model Building

a. K-means Clustering

b. Classification (Logistic Regression, CART, Random Forest)

IV. Model Performance

I. Data Preprocessing

Original dataset contains 160 million Spotify listening sessions and user interactions (350GB of unzipped data). With this excessive amount of data, I decided to random sample a subset of the original dataset.

The random sample consists of ~2M rows (or tracks) and 24 independent variables. For target variable, we have 4 features: skip_1, skip_2, skip_3, and not_skipped, which measures the different degree of skipping a track. Our prediction only focus on skip_1.

For data cleaning, I checked for duplicate rows, missing values, and encoded categorical variables.

sum(ifelse(duplicated(data) == TRUE,1,0))
## [1] 0 # Zero duplicated rows

data$shuffle <- ifelse(data$hist_user_behavior_is_shuffle=="True", 1, 0)
data$premium <- ifelse(data$premium=="True", 1, 0)

II. Exploratory Data Analysis

First, I explored the most popular tracks in the dataset:

In addition, I plotted a correlation matrix to detect multicollinearity in the dataset.

CorMatrix <- cor(data_corr)
corrplot(CorMatrix, method = "square")

Here are some takeaways from the matrix:

- Positive relation between context_switch and start_clickrow, start_backbutton and end_backbutton

- Negative relation between start_trackdone and start_forwardbutton

III. Model Building

a. K-means Clustering

I performed k-means clustering with k = 6.

SixCenters <- kmeans(xdata, 6 ,nstart=30)
SixCenters$size

b. Classification Models

For classification, I employed 3 models to compare performance:

- Logistic Regression (base model)

I chose logistic regression for its interpretability, making it easier to generate actionable insights for stakeholders. This will be our base model.

- Classification Tree (CART)

In comparison with logistic regression, Classification Tree has an advantage in interpretability and easier to visualize.

- Random Forest

Random Forest can identify complicated data patterns and mitigate overfit problem, but less interpretable than classification trees.

I performed 5-fold cross validation to split the data into 5 subsets of training and testing these models. I evaluated these models using out-of-sample (OOS) accuracy metrics.

IV. Model Performance

I evaluated the 3 models using OOS accuracy. With the 5-fold cross-validation, I computed the mean of OOS accuracy, and got the following results:

The results indicated that Random Forest outperformed other approached on the test and validation sets.